Spark, Arrow, Parquet: Add vectorized read support for parquet RLE encoded data pages #14853

jbewing · 2025-12-15T22:21:05Z

What

This PR adds vectorized read support to Iceberg for RLE encoded data pages of the Apache Parquet v2 specification (see #7162). This builds on top of the existing support for reading DELTA_BINARY_PACKED implemented by @eric-maynard in #13391 with:

Implementing vectorized read support for RLE encoded data pages
Bolstering golden files test coverage to cover each of the paths above. In addition, I added golden file tests that include rows with null values for each data type as well to ensure our handling of those is correct

This is split out from #14800 to make reviewing the changes easier / facilitate tighter feedback cycles per some feedback on the Iceberg Developer Slack Community.

Background

This solves a longstanding issue of: the reference Apache Iceberg Spark implementation with the default settings enabled (e.g. spark.sql.iceberg.vectorization.enabled = true) isn't able to read iceberg tables that may have been written by other compute engines that utilize the not so new anymore Apache Parquet v2 writer specification. It's a widely known workaround to need to disable the vectorized reader in Spark if you need to interop with other compute engines or adjust all compute engines to use the Apache Parquet v1 writer specification when writing parquet files. With disabling the vectorization flag, clients take a performance hit that we've anecdotally measured is quite large in some cases/workloads. If forcing all writers of an iceberg table to write in Apache Parquet v1 format, clients are incurring additional performance and storage penalties (files written with parquet v2 tend to be smaller than those written with the v1 spec as the new encodings tend to save space and are often faster to read/write). So really, it's a lose-lose for performance & data size in the current setup with the additional papercut of Apache Iceberg not being super portable across engines in it's default setup. This PR seeks to solve that by finishing the swing on implementing vectorized parquet read support for the v2 format. In the future, we may also consider allowing clients to write Apache Parquet v2 files natively gated via a setting from Apache Iceberg. Even longer down that road, we may even consider changing that to be the "default" setting.

Previous Work / Thanks

This PR is a revival + extension to the work that @eric-maynard was doing in #13709. That PR had been active for a little while, so I literally started from where Eric left off. Thank you for the great work here @eric-maynard, you made implementing the rest of the changes required for vectorized read support way easier!

Note to Reviewers

This is a split of the vectorized RLE encoded data page read support from #14800.

Testing

I've tested this on a fork of Spark 3.5 & Iceberg 1.10.0 and verified that a Spark job is able to read a table written with Parquet V2 writer without issues.

Issue: #7162
Split out from: #14800

cc @nastra

- Split out from apache#14800

- Now that RLE boolean data page reads are vectorized, vectorized reads over parquet v2 written files with dictionary encoded columns should work completely

github-actions · 2026-01-16T00:20:03Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

steveloughran · 2026-01-16T19:53:23Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedPageIterator.java

+                new VectorizedRunLengthEncodedParquetValuesReader(setArrowValidityVector);
+            break;
+          }
+          // fall through


this is quite a serious fall through here, given the parquet spec limits what RLEs can be used for to bools, Repetition and definition levels & Dictionary indices. Is it likely to occur in the wild?

If so, it probably merits a test case to see that if you create one with a column whose type != BOOLEAN then you can't init() it with RLE data encoding.

given the parquet spec limits what RLEs can be used for to bools, Repetition and definition levels & Dictionary indices. Is it likely to occur in the wild?

Yeah theoretically for a malformed parquet writer this could occur in the wild. That being said it wouldn't be to spec given that bool is the only data page that can be RLE encoded and we handle the dictionary RLE up in VectorizedDictionaryEncodedParquetValuesReader (directly above here) and the repetition levels are handled via VectorizedParquetDefinitionLevelReader.

All to say, I think this is impossible. If a malformed writer does in fact write a file with a non-bool data page, it wouldn't be to spec so we'd be correctly throwing here. I can add a negative test case for this, although I'd have to make a corrupt parquet writer implementation to do so. Happy to do if you think it adds value.

Also FWIW, the full parquet v2 vectorized impl PR (that this PR was split out from has quite a few production PBs read under its belt at this point and hasn't hit anything like this in the wild.

well, nothing is going to handle it so it's not much of a source of data, is it? Key thing is not to cause damage to the system other than the specific query failing

Done in dda0185

…ata pages - apache#14853 (comment)

Add vectorized read support for parquet RLE encoded data pages

dbebc5d

- Split out from apache#14800

jbewing changed the title ~~Add vectorized read support for parquet RLE encoded data pages~~ Spark, Arrow, Parquet: Add vectorized read support for parquet RLE encoded data pages Dec 15, 2025

github-actions bot added spark parquet arrow labels Dec 15, 2025

Fix dictionary-encoded vectorized read tests

51cbe97

- Now that RLE boolean data page reads are vectorized, vectorized reads over parquet v2 written files with dictionary encoded columns should work completely

pvary mentioned this pull request Dec 16, 2025

Add ParquetFileMerger for efficient row-group level file merging #14435

Open

github-actions bot added the stale label Jan 16, 2026

steveloughran reviewed Jan 16, 2026

View reviewed changes

github-actions bot removed the stale label Jan 17, 2026

Add test to verify fall through for reading non-boolean rle encoded d…

dda0185

…ata pages - apache#14853 (comment)

jbewing added a commit to jbewing/iceberg that referenced this pull request Jan 21, 2026

Add test to verify fall through for reading non-boolean rle encoded d…

6e07e0c

…ata pages - apache#14853 (comment)

Run spotless for Spark 3.5 & 3.4

41286c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark, Arrow, Parquet: Add vectorized read support for parquet RLE encoded data pages #14853

Spark, Arrow, Parquet: Add vectorized read support for parquet RLE encoded data pages #14853

Uh oh!

jbewing commented Dec 15, 2025

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

steveloughran Jan 16, 2026

Uh oh!

jbewing Jan 17, 2026

Uh oh!

steveloughran Jan 19, 2026

Uh oh!

jbewing Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spark, Arrow, Parquet: Add vectorized read support for parquet RLE encoded data pages #14853

Are you sure you want to change the base?

Spark, Arrow, Parquet: Add vectorized read support for parquet RLE encoded data pages #14853

Uh oh!

Conversation

jbewing commented Dec 15, 2025

What

Background

Previous Work / Thanks

Note to Reviewers

Testing

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

steveloughran Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

jbewing Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

jbewing Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants