ORC: Fix non-vectorized reader incorrectly skipping rows #1706

shardulm94 · 2020-11-02T09:44:46Z

ORC non-vectorized code path uses the following condition in OrcRowIterator#next() to determine if a new batch should be read and row batch counter should be reset.
if (batch == null || nextRow >= batch.size) (Note: the code uses current as the variable name instead of batch)

Since the batch object is reused, #hasNext() can cause a new batch to be loaded in case the existing batch was consumed. In such a case, the condition in #next() will cause the row batch counter to be reset. However, if #hasNext() was called prior to this, the batch variable will already have data loaded for the next batch and so batch.size will give size of the newly loaded batch.

In some cases, the batch sizes across batches will remain the same and so the current condition works fine. However, there can be cases where batch size of the next batch is greater than the current batch and so the condition nextRow >= batch.size will be false even if a new batch was loaded causing nextRow to not be reset and rows from the new row batch being skipped.

The conditions under which this case can occur:

The first batch of a stripe will in most cases have batch size greater than the last batch of the previous stripe. This is because ORC will avoid loading rows from two stripes in the same batch, causing the last batch of a stripe to be small.
Filters may cause some row groups to be skipped. ORC will only load rows in a batch which come from contiguous row groups within a stripe. So similar to 1. the first batch of a contiguous row group will in most cases have batch size greater than the last batch of the previous read row group.

This PR stores the current batch size in a local variable when the batch is loaded so that calls to #hasNext() do not affect the condition in #next().

shardulm94 · 2020-11-02T09:45:05Z

cc: @omalley @edgarRd @rdblue

rdblue · 2020-11-02T21:50:18Z

This looks like a serious problem to me. Case 1 above seems very likely. Do we need to create any patch releases for this or stop the 0.10.0 release to get this into 0.10.0? (FYI @aokolnychyi)

aokolnychyi · 2020-11-02T21:53:24Z

I'd be in favor of failing RC2 and including this in 0.10.0

rdblue · 2020-11-02T22:00:05Z

I ran the tests without the fix and they do break as expected. I'm going to merge this.

shardulm94 · 2020-11-03T02:35:22Z

I would also be in favor of including this in 0.10.0. Seems like this issue has been here since the very beginning and went unnoticed. Most of our test cases which check for correctness test on a small number of rows (100) and hence do no trigger this.

ORC: Fix non-vectorized reader incorrectly skipping rows

2bc9bc2

github-actions bot added data ORC labels Nov 2, 2020

Simplify the fix

e638b54

rdblue approved these changes Nov 2, 2020

View reviewed changes

rdblue merged commit 6c22c24 into apache:master Nov 2, 2020

chenjunjiedada mentioned this pull request Nov 5, 2020

Parquet: Add vectorized position reader #1356

Merged

rdblue added this to the Java 0.10.0 Release milestone Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC: Fix non-vectorized reader incorrectly skipping rows #1706

ORC: Fix non-vectorized reader incorrectly skipping rows #1706

Uh oh!

shardulm94 commented Nov 2, 2020 •

edited

Loading

Uh oh!

shardulm94 commented Nov 2, 2020

Uh oh!

rdblue commented Nov 2, 2020

Uh oh!

aokolnychyi commented Nov 2, 2020 •

edited

Loading

Uh oh!

rdblue commented Nov 2, 2020

Uh oh!

shardulm94 commented Nov 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ORC: Fix non-vectorized reader incorrectly skipping rows #1706

ORC: Fix non-vectorized reader incorrectly skipping rows #1706

Uh oh!

Conversation

shardulm94 commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shardulm94 commented Nov 2, 2020

Uh oh!

rdblue commented Nov 2, 2020

Uh oh!

aokolnychyi commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Nov 2, 2020

Uh oh!

shardulm94 commented Nov 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shardulm94 commented Nov 2, 2020 •

edited

Loading

aokolnychyi commented Nov 2, 2020 •

edited

Loading