-
Notifications
You must be signed in to change notification settings - Fork 3k
[ORC][Spark] - Support selected vector with ORC reader on the row and batch reader #7197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| InternalRow row = rows.next(); | ||
| Assertions.assertEquals(3L, row.getLong(0)); | ||
| Assertions.assertEquals("c", row.getString(1)); | ||
| Assertions.assertArrayEquals(new int[] {3, 4, 5}, row.getArray(3).toIntArray()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @pavibhai there was a recommendation to use assertj for assertions - https://github.com/apache/iceberg/blob/master/CONTRIBUTING.md#assertj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling that out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| protected int getRowIndex(int rowId) { | ||
| return vector.isRepeating ? 0 : rowId; | ||
| int row = isSelectedInUse ? selected[rowId] : rowId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be a possibility of isSelectedInUse=true and selected is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is not a possibility. Similar to other vectors the selected vector is initialized at the start to max batch size and all of the indices in this are excepted to be within that size.
|
Thanks, @pavibhai! |
What?
Currently Iceberg does not support the use of the selected vector when reading ORC Files. This requires reads on ORC to be run in compatibility mode by not setting
orc.filter.use.selectedin the presence of filter processing that is triggered viaorc.sarg.to.filter.Filter processing was introduced as part of ORC-744 where ORC has the ability to filter out records and indicate this status in partially filtered batches using the selected vector in VectorizedRowBatch.
This PR uses the selected vector to determine valid rows when applicable.
Why?
ORC can only operate in compatibility mode by not setting
orc.filter.use.selected. Enabling this will further hasten the processing of rows by ignoring rows that are already filtered out in the batch.Tested?
New Unit tests have been added to verify the behavior.