-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark, Arrow, Parquet: Add vectorized read support for parquet RLE encoded data pages #14853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Now that RLE boolean data page reads are vectorized, vectorized reads over parquet v2 written files with dictionary encoded columns should work completely
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
| new VectorizedRunLengthEncodedParquetValuesReader(setArrowValidityVector); | ||
| break; | ||
| } | ||
| // fall through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite a serious fall through here, given the parquet spec limits what RLEs can be used for to bools, Repetition and definition levels & Dictionary indices. Is it likely to occur in the wild?
If so, it probably merits a test case to see that if you create one with a column whose type != BOOLEAN then you can't init() it with RLE data encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the parquet spec limits what RLEs can be used for to bools, Repetition and definition levels & Dictionary indices. Is it likely to occur in the wild?
Yeah theoretically for a malformed parquet writer this could occur in the wild. That being said it wouldn't be to spec given that bool is the only data page that can be RLE encoded and we handle the dictionary RLE up in VectorizedDictionaryEncodedParquetValuesReader (directly above here) and the repetition levels are handled via VectorizedParquetDefinitionLevelReader.
All to say, I think this is impossible. If a malformed writer does in fact write a file with a non-bool data page, it wouldn't be to spec so we'd be correctly throwing here. I can add a negative test case for this, although I'd have to make a corrupt parquet writer implementation to do so. Happy to do if you think it adds value.
Also FWIW, the full parquet v2 vectorized impl PR (that this PR was split out from has quite a few production PBs read under its belt at this point and hasn't hit anything like this in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, nothing is going to handle it so it's not much of a source of data, is it? Key thing is not to cause damage to the system other than the specific query failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in dda0185
What
This PR adds vectorized read support to Iceberg for RLE encoded data pages of the Apache Parquet v2 specification (see #7162). This builds on top of the existing support for reading DELTA_BINARY_PACKED implemented by @eric-maynard in #13391 with:
This is split out from #14800 to make reviewing the changes easier / facilitate tighter feedback cycles per some feedback on the Iceberg Developer Slack Community.
Background
This solves a longstanding issue of: the reference Apache Iceberg Spark implementation with the default settings enabled (e.g.
spark.sql.iceberg.vectorization.enabled=true) isn't able to read iceberg tables that may have been written by other compute engines that utilize the not so new anymore Apache Parquet v2 writer specification. It's a widely known workaround to need to disable the vectorized reader in Spark if you need to interop with other compute engines or adjust all compute engines to use the Apache Parquet v1 writer specification when writing parquet files. With disabling the vectorization flag, clients take a performance hit that we've anecdotally measured is quite large in some cases/workloads. If forcing all writers of an iceberg table to write in Apache Parquet v1 format, clients are incurring additional performance and storage penalties (files written with parquet v2 tend to be smaller than those written with the v1 spec as the new encodings tend to save space and are often faster to read/write). So really, it's a lose-lose for performance & data size in the current setup with the additional papercut of Apache Iceberg not being super portable across engines in it's default setup. This PR seeks to solve that by finishing the swing on implementing vectorized parquet read support for the v2 format. In the future, we may also consider allowing clients to write Apache Parquet v2 files natively gated via a setting from Apache Iceberg. Even longer down that road, we may even consider changing that to be the "default" setting.Previous Work / Thanks
This PR is a revival + extension to the work that @eric-maynard was doing in #13709. That PR had been active for a little while, so I literally started from where Eric left off. Thank you for the great work here @eric-maynard, you made implementing the rest of the changes required for vectorized read support way easier!
Note to Reviewers
This is a split of the vectorized
RLEencoded data page read support from #14800.Testing
I've tested this on a fork of Spark 3.5 & Iceberg 1.10.0 and verified that a Spark job is able to read a table written with Parquet V2 writer without issues.
Issue: #7162
Split out from: #14800
cc @nastra