Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Dec 16, 2021

A certain string in the input data (with a prefix of over 16 unparseable chars like high/low surrogates) triggered a NullPointerException in Parquet writer close:

java.lang.NullPointerException
		at org.apache.iceberg.parquet.ParquetUtil.toBufferMap(ParquetUtil.java:267)
		at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:152)
		at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:85)
		at org.apache.iceberg.parquet.ParquetWriter.metrics(ParquetWriter.java:151)
		at org.apache.iceberg.io.DataWriter.close(DataWriter.java:78)
		at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.closeCurrent(BaseTaskWriter.java:282)
		at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.close(BaseTaskWriter.java:298)
		at org.apache.iceberg.io.PartitionedWriter.close(PartitionedWriter.java:82)
		at org.apache.iceberg.io.BaseTaskWriter.abort(BaseTaskWriter.java:72)
		at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$6(WriteToDataSourceV2Exec.scala:447)
		at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1484)

The problem is that UnicodeUtil and BinaryUtil return null if fail to calculate a truncated upper bound for the string/binary (see UnicodeUtil.truncateStringMax() and BinaryUtil.truncateBinaryMax()). A null value in the upperBound map then triggers a NPE in the ParquetUtil.toBufferMap class as it tries to call .getValue() on it.

  private static Map<Integer, ByteBuffer> toBufferMap(Schema schema, Map<Integer, Literal<?>> map) {
    Map<Integer, ByteBuffer> bufferMap = Maps.newHashMap();
    for (Map.Entry<Integer, Literal<?>> entry : map.entrySet()) {
      bufferMap.put(entry.getKey(),
          Conversions.toByteBuffer(schema.findType(entry.getKey()), entry.getValue().value()));
    }
    return bufferMap;
  }


Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(1));
Assert.assertTrue("Should have a valid upper bound", dataFile.upperBounds().containsKey(1));
Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(2));
Copy link
Member Author

@szehon-ho szehon-ho Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems UnicodeUtil and BinaryUtil truncateMin functions never return null like truncateMax does (they just truncate to the length and return).

I hope it is ok to have a lowerBound but no upperBound for this column?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always do a null check before comparing, but lets check the metrics evaluators

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
if (NaNUtil.isNaN(lower)) {
// NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.
return ROWS_MIGHT_MATCH;
}

if (upperBounds != null && upperBounds.containsKey(id)) {
T upper = Conversions.fromByteBuffer(ref.type(), upperBounds.get(id));
int cmp = lit.comparator().compare(upper, lit.value());
if (cmp <= 0) {
return ROWS_CANNOT_MATCH;
}
}
<- Yep should be safe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this means if the upper bound is missing (due to incalculable truncate) we just always fall back to "RowsMayMatch"

@szehon-ho
Copy link
Member Author

szehon-ho commented Dec 16, 2021

@rdblue @aokolnychyi @RussellSpitzer @jackye1995 fyi for comments on whether this is the right fix for the issue, thanks!

@szehon-ho
Copy link
Member Author

@aokolnychyi @RussellSpitzer could you guys take a look when you have some time? Looks like the data is bad (can update the test case to make it more generic), but they are possible Parquet strings and Iceberg could be much more more graceful when getting such data

case FIXED:
case BINARY:
upperBounds.put(id, BinaryUtil.truncateBinaryMax((Literal<ByteBuffer>) max, truncateLength));
Literal<ByteBuffer> truncatedMaxBinary = BinaryUtil.truncateBinaryMax((Literal<ByteBuffer>) max,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a test for this? Looks like we only got the unicode test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, will give a try

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so after a quick discussion I think we listed a few things

  1. Test for binary max value, truncate a binary value which can't be incremented (1111)
  2. Tests should check that upperBounds are set to null
  3. Tests can focus just on triggering the truncate, remove the sort order stuff to make it clear what we are testing

After that I think we are good to go!

@szehon-ho
Copy link
Member Author

@RussellSpitzer thanks a lot for review, done. The upper bound check in the test was already there (i clarified a bit the assert message to try to make it more clear).

@szehon-ho szehon-ho changed the title NPE in Parquet Writer Metrics when writing un-truncatable strings NPE in Parquet Writer Metrics when data value max bound will overflow Jan 24, 2022
@szehon-ho
Copy link
Member Author

Renamed issue to reflect other use cases that will hit this, following deeper look during our conversation

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this all looks good to me, let's merge it in

@RussellSpitzer RussellSpitzer merged commit 704cd8c into apache:master Jan 24, 2022
@RussellSpitzer
Copy link
Member

Thanks @szehon-ho for the pr!

@szehon-ho
Copy link
Member Author

Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants