Skip to content

Conversation

@ndrluis
Copy link
Collaborator

@ndrluis ndrluis commented Jan 5, 2026

Closes #2372

Rationale for this change

PyArrow's pa.uuid() type creates Python Arrow metadata that differs from Java Arrow's UUID metadata, causing incompatibility with Spark. Python and Rust Arrow implementations don't recognize Java's UUID metadata for filtering. Reading works, but filtering returns this error:

ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.uuid>, extension<arrow.uuid>)
  • Change UUIDType Arrow schema conversion from pa.uuid() to pa.binary(16)
  • Add integration test verifying UUID round-trip between PyIceberg and Spark
  • Update existing tests to expect binary(16) instead of pa.uuid()
  • Bump Iceberg version to 1.10.1 which includes Java-side UUID fix

Are these changes tested?

Yes

Are there any user-facing changes?

No

PyArrow's pa.uuid() type creates Python Arrow metadata that differs from
Java Arrow's UUID metadata, causing incompatibility with Spark. Python and
Rust Arrow implementations don't recognize Java's UUID metadata.

Changes:
- Change UUIDType Arrow schema conversion from pa.uuid() to pa.binary(16)
- Add integration test verifying UUID round-trip between PyIceberg and Spark
- Update existing tests to expect binary(16) instead of pa.uuid()
- Fix test_uuid_partitioning to properly convert bytes to UUID strings
- Bump Iceberg version to 1.10.1 which includes Java-side UUID fix
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a revert of #2007
I think this is a new issue with Spark, Iceberg, and UUID. The previous fix (apache/iceberg#13324) and its spark 4 backport (apache/iceberg#13573) is already included in 1.10.0

I've include the stacktrace from spark for debugging

EDIT:
The Java-side UUID fix in 1.10.1 is actually apache/iceberg#14027
I had to run make test-integration-rebuild to update the docker image cache

the new stacktrace is

    
>       result = table.scan(row_filter=EqualTo("uuid_col", uuid.UUID("00000000-0000-0000-0000-000000000000").bytes)).to_arrow()

tests/integration/test_writes/test_writes.py:2588: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyiceberg/table/__init__.py:2027: in to_arrow
    ).to_table(self.plan_files())
pyiceberg/io/pyarrow.py:1730: in to_table
    first_batch = next(batches)
pyiceberg/io/pyarrow.py:1781: in to_record_batches
    for batches in executor.map(batches_for_task, tasks):
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:456: in result
    return self.__get_result()
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/_base.py:401: in __get_result
    raise self._exception
../../.pyenv/versions/3.12.11/lib/python3.12/concurrent/futures/thread.py:59: in run
    result = self.fn(*self.args, **self.kwargs)
pyiceberg/io/pyarrow.py:1778: in batches_for_task
    return list(self._record_batches_from_scan_tasks_and_deletes([task], deletes_per_file))
pyiceberg/io/pyarrow.py:1818: in _record_batches_from_scan_tasks_and_deletes
    for batch in batches:
pyiceberg/io/pyarrow.py:1600: in _task_to_record_batches
    fragment_scanner = ds.Scanner.from_fragment(
pyarrow/_dataset.pyx:3792: in pyarrow._dataset.Scanner.from_fragment
    ???
pyarrow/_dataset.pyx:3547: in pyarrow._dataset._populate_builder
    ???
pyarrow/_compute.pyx:2884: in pyarrow._compute._bind
    ???
pyarrow/error.pxi:155: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension<arrow.uuid>, extension<arrow.uuid>)

pyarrow/error.pxi:92: ArrowNotImplementedError

assert len(table.scan().to_arrow()) == 4

result = df.where("uuid_col = '00000000-0000-0000-0000-000000000000'")
assert result.count() == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this test on current main branch with 1.10.1 and this is the stacktrace. This is different from the stacktrace in #2007

E                   pyspark.errors.exceptions.connect.SparkException: Job aborted due to stage failure: Task 0 in stage 142.0 failed 1 times, most recent failure: Lost task 0.0 in stage 142.0 (TID 287) (fcaa97ba83c2 executor driver): java.lang.ClassCastException: class java.util.UUID cannot be cast to class java.nio.ByteBuffer (java.util.UUID and java.nio.ByteBuffer are in module java.base of loader 'bootstrap')
E                       at java.base/java.nio.ByteBuffer.compareTo(ByteBuffer.java:267)
E                       at java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:52)
E                       at java.base/java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:47)
E                       at org.apache.iceberg.types.Comparators$NullSafeChainedComparator.compare(Comparators.java:306)
E                       at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:352)
E                       at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:79)
E                       at org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162)
E                       at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390)
E                       at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409)
E                       at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eval(ParquetMetricsRowGroupFilter.java:103)
E                       at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter.shouldRead(ParquetMetricsRowGroupFilter.java:73)
E                       at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:108)
E                       at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90)
E                       at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99)
E                       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:126)
E                       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43)
E                       at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:141)
E                       at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:148)
E                       at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:186)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:72)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:72)
E                       at scala.Option.exists(Option.scala:406)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:72)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:103)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:72)
E                       at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
E                       at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
E                       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
E                       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
E                       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
E                       at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
E                       at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
E                       at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
E                       at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:143)
E                       at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
E                       at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111)
E                       at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
E                       at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
E                       at org.apache.spark.scheduler.Task.run(Task.scala:147)
E                       at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
E                       at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
E                       at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
E                       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
E                       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
E                       at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
E                       at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
E                       at java.base/java.lang.Thread.run(Thread.java:840)
E                   
E                   Driver stacktrace:
E                   
E                   JVM stacktrace:
E                   org.apache.spark.SparkException
E                       at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$3(DAGScheduler.scala:2935)
E                       at scala.Option.getOrElse(Option.scala:201)
E                       at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2935)
E                       at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2927)
E                       at scala.collection.immutable.List.foreach(List.scala:334)
E                       at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2927)
E                       at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1295)
E                       at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1295)
E                       at scala.Option.foreach(Option.scala:437)
E                       at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1295)
E                       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3207)
E                       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3141)
E                       at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3130)
E                       at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:50)
E                       at org.apache.spark.util.Utils$.getTryWithCallerStacktrace(Utils.scala:1439)
E                       at org.apache.spark.util.LazyTry.get(LazyTry.scala:58)
E                       at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:201)
E                       at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:260)
E                       at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
E                       at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:257)
E                       at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:197)
E                       at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.$anonfun$processAsArrowBatches$2(SparkConnectPlanExecution.scala:155)
E                       at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$8(SQLExecution.scala:163)
E                       at org.apache.spark.sql.execution.SQLExecution$.withSessionTagsApplied(SQLExecution.scala:272)
E                       at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$7(SQLExecution.scala:125)
E                       at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
E                       at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112)
E                       at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:186)
E                       at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:102)
E                       at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111)
E                       at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:125)
E                       at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:295)
E                       at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:124)
E                       at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
E                       at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:78)
E                       at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:237)
E                       at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.processAsArrowBatches(SparkConnectPlanExecution.scala:154)
E                       at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:78)
E                       at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:314)
E                       at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:225)
E                       at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:196)
E                       at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:341)
E                       at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
E                       at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:341)
E                       at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
E                       at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112)
E                       at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:186)
E                       at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:102)
E                       at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111)
E                       at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:340)
E                       at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:196)
E                       at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:125)
E                       at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:347)
E                   Caused by: java.lang.ClassCastException: class java.util.UUID cannot be cast to class java.nio.ByteBuffer (java.util.UUID and java.nio.ByteBuffer are in module java.base of loader 'bootstrap')
E                       at java.nio.ByteBuffer.compareTo(ByteBuffer.java:267)
E                       at java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:52)
E                       at java.util.Comparators$NaturalOrderComparator.compare(Comparators.java:47)
E                       at org.apache.iceberg.types.Comparators$NullSafeChainedComparator.compare(Comparators.java:306)
E                       at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:352)
E                       at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eq(ParquetMetricsRowGroupFilter.java:79)
E                       at org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162)
E                       at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390)
E                       at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409)
E                       at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter$MetricsEvalVisitor.eval(ParquetMetricsRowGroupFilter.java:103)
E                       at org.apache.iceberg.parquet.ParquetMetricsRowGroupFilter.shouldRead(ParquetMetricsRowGroupFilter.java:73)
E                       at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:108)
E                       at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90)
E                       at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99)
E                       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:126)
E                       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43)
E                       at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:141)
E                       at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:148)
E                       at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:186)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:72)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:72)
E                       at scala.Option.exists(Option.scala:406)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:72)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:103)
E                       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:72)
E                       at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
E                       at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
E                       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(:-1)
E                       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(:-1)
E                       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(:-1)
E                       at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
E                       at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
E                       at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
E                       at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:143)
E                       at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
E                       at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111)
E                       at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
E                       at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
E                       at org.apache.spark.scheduler.Task.run(Task.scala:147)
E                       at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
E                       at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
E                       at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
E                       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
E                       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
E                       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
E                       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
E                       at java.lang.Thread.run(Thread.java:840)

.venv/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py:1882: SparkException

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also downloaded the 2 data files

➜  Downloads parquet schema 00000-0-562a1d32-f5da-4e09-836a-b7d0d4e737e6.parquet
{
  "type" : "record",
  "name" : "schema",
  "fields" : [ {
    "name" : "uuid_col",
    "type" : [ "null", {
      "type" : "string",
      "logicalType" : "uuid"
    } ],
    "default" : null
  } ]
}
➜  Downloads parquet schema 00000-284-30c509c4-f8e9-46fb-a1f1-169e1c928e00-0-00001.parquet
{
  "type" : "record",
  "name" : "table",
  "fields" : [ {
    "name" : "uuid_col",
    "type" : {
      "type" : "string",
      "logicalType" : "uuid"
    }
  } ]
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stacktrace has changed because of the fix that I made in the Java implementation. This PR (apache/iceberg#14027) has more details about the problem and in the issue #2372 I explain the problem from the pyiceberg side and why we are changing back to binary(16).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parquet files look wrong, and hot sure what happened there. UUID should annotate FILED_LEN_BYTE_ARRAY:

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pa.binary(16) change fixed the comparison issue but broke Parquet spec compliance by removing the UUID logical type annotation. We can get back to UUID in the visitor and raise an exception with a better message when the user tries to filter a UUID column, since PyArrow does not support filtering.

@kevinjqliu
Copy link
Contributor

The UUID support is a gift that keeps on giving.

😄

"""
identifier = "default.test_write_uuid_in_pyiceberg_and_scan"

catalog = load_catalog("default", type="in-memory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need these lines with the session catlaog

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I fixed it in 1751f0c

Copy link
Contributor

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just have a small nit on the test. One minor note, this changes how UUID columns are returned from Arrow as users get bytes instead of UUID. Worth noting in release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error when filtering by UUID in table scan

4 participants