Skip to content

Conversation

@shardulm94
Copy link
Contributor

@shardulm94 shardulm94 commented Nov 2, 2020

ORC non-vectorized code path uses the following condition in OrcRowIterator#next() to determine if a new batch should be read and row batch counter should be reset.
if (batch == null || nextRow >= batch.size) (Note: the code uses current as the variable name instead of batch)

Since the batch object is reused, #hasNext() can cause a new batch to be loaded in case the existing batch was consumed. In such a case, the condition in #next() will cause the row batch counter to be reset. However, if #hasNext() was called prior to this, the batch variable will already have data loaded for the next batch and so batch.size will give size of the newly loaded batch.

In some cases, the batch sizes across batches will remain the same and so the current condition works fine. However, there can be cases where batch size of the next batch is greater than the current batch and so the condition nextRow >= batch.size will be false even if a new batch was loaded causing nextRow to not be reset and rows from the new row batch being skipped.

The conditions under which this case can occur:

  1. The first batch of a stripe will in most cases have batch size greater than the last batch of the previous stripe. This is because ORC will avoid loading rows from two stripes in the same batch, causing the last batch of a stripe to be small.
  2. Filters may cause some row groups to be skipped. ORC will only load rows in a batch which come from contiguous row groups within a stripe. So similar to 1. the first batch of a contiguous row group will in most cases have batch size greater than the last batch of the previous read row group.

This PR stores the current batch size in a local variable when the batch is loaded so that calls to #hasNext() do not affect the condition in #next().

@shardulm94
Copy link
Contributor Author

cc: @omalley @edgarRd @rdblue

@rdblue
Copy link
Contributor

rdblue commented Nov 2, 2020

This looks like a serious problem to me. Case 1 above seems very likely. Do we need to create any patch releases for this or stop the 0.10.0 release to get this into 0.10.0? (FYI @aokolnychyi)

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Nov 2, 2020

I'd be in favor of failing RC2 and including this in 0.10.0

@rdblue
Copy link
Contributor

rdblue commented Nov 2, 2020

I ran the tests without the fix and they do break as expected. I'm going to merge this.

@rdblue rdblue merged commit 6c22c24 into apache:master Nov 2, 2020
@shardulm94
Copy link
Contributor Author

I would also be in favor of including this in 0.10.0. Seems like this issue has been here since the very beginning and went unnoticed. Most of our test cases which check for correctness test on a small number of rows (100) and hence do no trigger this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants