Skip to content

Conversation

@pavibhai
Copy link
Contributor

@pavibhai pavibhai commented Mar 24, 2023

What?

Currently Iceberg does not support the use of the selected vector when reading ORC Files. This requires reads on ORC to be run in compatibility mode by not setting orc.filter.use.selected in the presence of filter processing that is triggered via orc.sarg.to.filter.

Filter processing was introduced as part of ORC-744 where ORC has the ability to filter out records and indicate this status in partially filtered batches using the selected vector in VectorizedRowBatch.

This PR uses the selected vector to determine valid rows when applicable.

Why?

ORC can only operate in compatibility mode by not setting orc.filter.use.selected. Enabling this will further hasten the processing of rows by ignoring rows that are already filtered out in the batch.

Tested?

New Unit tests have been added to verify the behavior.

@pavibhai pavibhai changed the title [ORC] - Support selected vector with ORC reader on the row and batch reader [ORC][Spark] - Support selected vector with ORC reader on the row and batch reader Mar 24, 2023
@pavibhai pavibhai changed the title [ORC][Spark] - Support selected vector with ORC reader on the row and batch reader [WIP][ORC][Spark] - Support selected vector with ORC reader on the row and batch reader Mar 24, 2023
@pavibhai pavibhai marked this pull request as draft March 24, 2023 22:44
@pavibhai pavibhai changed the title [WIP][ORC][Spark] - Support selected vector with ORC reader on the row and batch reader [ORC][Spark] - Support selected vector with ORC reader on the row and batch reader Mar 24, 2023
@pavibhai pavibhai marked this pull request as ready for review March 24, 2023 22:47
InternalRow row = rows.next();
Assertions.assertEquals(3L, row.getLong(0));
Assertions.assertEquals("c", row.getString(1));
Assertions.assertArrayEquals(new int[] {3, 4, 5}, row.getArray(3).toIntArray());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling that out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


protected int getRowIndex(int rowId) {
return vector.isRepeating ? 0 : rowId;
int row = isSelectedInUse ? selected[rowId] : rowId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be a possibility of isSelectedInUse=true and selected is empty?

Copy link
Contributor Author

@pavibhai pavibhai Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not a possibility. Similar to other vectors the selected vector is initialized at the start to max batch size and all of the indices in this are excepted to be within that size.

@github-actions github-actions bot added the build label Mar 29, 2023
@aokolnychyi aokolnychyi merged commit b756a5d into apache:master Apr 15, 2023
@aokolnychyi
Copy link
Contributor

Thanks, @pavibhai!

manisin pushed a commit to Snowflake-Labs/iceberg that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants