NPE in Parquet Writer Metrics when data value max bound will overflow #3760

szehon-ho · 2021-12-16T21:33:14Z

A certain string in the input data (with a prefix of over 16 unparseable chars like high/low surrogates) triggered a NullPointerException in Parquet writer close:

java.lang.NullPointerException
		at org.apache.iceberg.parquet.ParquetUtil.toBufferMap(ParquetUtil.java:267)
		at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:152)
		at org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:85)
		at org.apache.iceberg.parquet.ParquetWriter.metrics(ParquetWriter.java:151)
		at org.apache.iceberg.io.DataWriter.close(DataWriter.java:78)
		at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.closeCurrent(BaseTaskWriter.java:282)
		at org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.close(BaseTaskWriter.java:298)
		at org.apache.iceberg.io.PartitionedWriter.close(PartitionedWriter.java:82)
		at org.apache.iceberg.io.BaseTaskWriter.abort(BaseTaskWriter.java:72)
		at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$6(WriteToDataSourceV2Exec.scala:447)
		at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1484)

The problem is that UnicodeUtil and BinaryUtil return null if fail to calculate a truncated upper bound for the string/binary (see UnicodeUtil.truncateStringMax() and BinaryUtil.truncateBinaryMax()). A null value in the upperBound map then triggers a NPE in the ParquetUtil.toBufferMap class as it tries to call .getValue() on it.

  private static Map<Integer, ByteBuffer> toBufferMap(Schema schema, Map<Integer, Literal<?>> map) {
    Map<Integer, ByteBuffer> bufferMap = Maps.newHashMap();
    for (Map.Entry<Integer, Literal<?>> entry : map.entrySet()) {
      bufferMap.put(entry.getKey(),
          Conversions.toByteBuffer(schema.findType(entry.getKey()), entry.getValue().value()));
    }
    return bufferMap;
  }

szehon-ho · 2021-12-16T21:35:34Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetDataWriter.java

+
+    Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(1));
+    Assert.assertTrue("Should have a valid upper bound", dataFile.upperBounds().containsKey(1));
+    Assert.assertTrue("Should have a valid lower bound", dataFile.lowerBounds().containsKey(2));


It seems UnicodeUtil and BinaryUtil truncateMin functions never return null like truncateMax does (they just truncate to the length and return).

I hope it is ok to have a lowerBound but no upperBound for this column?

I think we always do a null check before comparing, but lets check the metrics evaluators

iceberg/api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

Lines 231 to 237 in bf9a227

if (lowerBounds != null && lowerBounds.containsKey(id)) {

T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));

if (NaNUtil.isNaN(lower)) {

// NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.

return ROWS_MIGHT_MATCH;

}

iceberg/api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

Lines 256 to 263 in bf9a227

if (upperBounds != null && upperBounds.containsKey(id)) {

T upper = Conversions.fromByteBuffer(ref.type(), upperBounds.get(id));

int cmp = lit.comparator().compare(upper, lit.value());

if (cmp <= 0) {

return ROWS_CANNOT_MATCH;

}

}

<- Yep should be safe

Basically this means if the upper bound is missing (due to incalculable truncate) we just always fall back to "RowsMayMatch"

szehon-ho · 2021-12-16T21:39:03Z

@rdblue @aokolnychyi @RussellSpitzer @jackye1995 fyi for comments on whether this is the right fix for the issue, thanks!

szehon-ho · 2022-01-04T20:00:54Z

@aokolnychyi @RussellSpitzer could you guys take a look when you have some time? Looks like the data is bad (can update the test case to make it more generic), but they are possible Parquet strings and Iceberg could be much more more graceful when getting such data

RussellSpitzer · 2022-01-19T18:36:14Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java

          case FIXED:
          case BINARY:
-            upperBounds.put(id, BinaryUtil.truncateBinaryMax((Literal<ByteBuffer>) max, truncateLength));
+            Literal<ByteBuffer> truncatedMaxBinary = BinaryUtil.truncateBinaryMax((Literal<ByteBuffer>) max,


Do we need a test for this? Looks like we only got the unicode test?

yep, will give a try

RussellSpitzer

Ok so after a quick discussion I think we listed a few things

Test for binary max value, truncate a binary value which can't be incremented (1111)
Tests should check that upperBounds are set to null
Tests can focus just on triggering the truncate, remove the sort order stuff to make it clear what we are testing

After that I think we are good to go!

szehon-ho · 2022-01-20T01:03:36Z

@RussellSpitzer thanks a lot for review, done. The upper bound check in the test was already there (i clarified a bit the assert message to try to make it more clear).

szehon-ho · 2022-01-24T18:11:13Z

Renamed issue to reflect other use cases that will hit this, following deeper look during our conversation

RussellSpitzer

Ok this all looks good to me, let's merge it in

RussellSpitzer · 2022-01-24T18:38:55Z

Thanks @szehon-ho for the pr!

szehon-ho · 2022-01-24T18:42:31Z

Thanks for the review!

NPE in Parquet Writer Metrics when writing un-truncatable strings

bcc1e65

github-actions bot added the parquet label Dec 16, 2021

szehon-ho commented Dec 16, 2021

View reviewed changes

szehon-ho added 2 commits December 16, 2021 13:48

Checkstyle: Truncate the code line

4167337

Supress AvoidEscapedUnicodeCharacters

43f2a3d

rdblue requested review from RussellSpitzer and aokolnychyi December 19, 2021 23:48

RussellSpitzer reviewed Jan 19, 2022

View reviewed changes

Address review comments- add test for binary

3caab78

szehon-ho changed the title ~~NPE in Parquet Writer Metrics when writing un-truncatable strings~~ NPE in Parquet Writer Metrics when data value max bound will overflow Jan 24, 2022

RussellSpitzer approved these changes Jan 24, 2022

View reviewed changes

RussellSpitzer merged commit 704cd8c into apache:master Jan 24, 2022

	if (lowerBounds != null && lowerBounds.containsKey(id)) {
	T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));

	if (NaNUtil.isNaN(lower)) {
	// NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.
	return ROWS_MIGHT_MATCH;
	}

	if (upperBounds != null && upperBounds.containsKey(id)) {
	T upper = Conversions.fromByteBuffer(ref.type(), upperBounds.get(id));

	int cmp = lit.comparator().compare(upper, lit.value());
	if (cmp <= 0) {
	return ROWS_CANNOT_MATCH;
	}
	}

NPE in Parquet Writer Metrics when data value max bound will overflow #3760

NPE in Parquet Writer Metrics when data value max bound will overflow #3760

Uh oh!

Conversation

szehon-ho commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Jan 4, 2022

Uh oh!

RussellSpitzer Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Jan 20, 2022

Uh oh!

szehon-ho commented Jan 24, 2022

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Jan 24, 2022

Uh oh!

szehon-ho commented Jan 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szehon-ho commented Dec 16, 2021 •

edited

Loading

szehon-ho Dec 16, 2021 •

edited

Loading

szehon-ho commented Dec 16, 2021 •

edited

Loading