-
Notifications
You must be signed in to change notification settings - Fork 3k
NPE in Parquet Writer Metrics when data value max bound will overflow #3760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(1)); | ||
| Assert.assertTrue("Should have a valid upper bound", dataFile.upperBounds().containsKey(1)); | ||
| Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems UnicodeUtil and BinaryUtil truncateMin functions never return null like truncateMax does (they just truncate to the length and return).
I hope it is ok to have a lowerBound but no upperBound for this column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we always do a null check before comparing, but lets check the metrics evaluators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg/api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Lines 231 to 237 in bf9a227
| if (lowerBounds != null && lowerBounds.containsKey(id)) { | |
| T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id)); | |
| if (NaNUtil.isNaN(lower)) { | |
| // NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more. | |
| return ROWS_MIGHT_MATCH; | |
| } |
iceberg/api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Lines 256 to 263 in bf9a227
| if (upperBounds != null && upperBounds.containsKey(id)) { | |
| T upper = Conversions.fromByteBuffer(ref.type(), upperBounds.get(id)); | |
| int cmp = lit.comparator().compare(upper, lit.value()); | |
| if (cmp <= 0) { | |
| return ROWS_CANNOT_MATCH; | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically this means if the upper bound is missing (due to incalculable truncate) we just always fall back to "RowsMayMatch"
|
@rdblue @aokolnychyi @RussellSpitzer @jackye1995 fyi for comments on whether this is the right fix for the issue, thanks! |
|
@aokolnychyi @RussellSpitzer could you guys take a look when you have some time? Looks like the data is bad (can update the test case to make it more generic), but they are possible Parquet strings and Iceberg could be much more more graceful when getting such data |
| case FIXED: | ||
| case BINARY: | ||
| upperBounds.put(id, BinaryUtil.truncateBinaryMax((Literal<ByteBuffer>) max, truncateLength)); | ||
| Literal<ByteBuffer> truncatedMaxBinary = BinaryUtil.truncateBinaryMax((Literal<ByteBuffer>) max, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a test for this? Looks like we only got the unicode test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, will give a try
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so after a quick discussion I think we listed a few things
- Test for binary max value, truncate a binary value which can't be incremented (1111)
- Tests should check that upperBounds are set to null
- Tests can focus just on triggering the truncate, remove the sort order stuff to make it clear what we are testing
After that I think we are good to go!
|
@RussellSpitzer thanks a lot for review, done. The upper bound check in the test was already there (i clarified a bit the assert message to try to make it more clear). |
|
Renamed issue to reflect other use cases that will hit this, following deeper look during our conversation |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this all looks good to me, let's merge it in
|
Thanks @szehon-ho for the pr! |
|
Thanks for the review! |
A certain string in the input data (with a prefix of over 16 unparseable chars like high/low surrogates) triggered a NullPointerException in Parquet writer close:
The problem is that UnicodeUtil and BinaryUtil return null if fail to calculate a truncated upper bound for the string/binary (see UnicodeUtil.truncateStringMax() and BinaryUtil.truncateBinaryMax()). A null value in the upperBound map then triggers a NPE in the ParquetUtil.toBufferMap class as it tries to call .getValue() on it.