Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 18, 2020

This updates Spark's DELETE FROM command to sort the retained rows by original file and position to ensure that the original data clustering is preserved by the command.

Because Spark does not yet support metadata columns, this exposes _file and _pos by adding them automatically to all merge scans. Projecting both columns was mostly supported, with only minor changes needed to project _file using the constants map supported by Avro, Parquet, and ORC.

This also required refactoring DynamicFileFilter. When projecting _file and _pos but only using file, the optimizer would throw an exception that the node could not be copied because the optimizer was attempting to rewrite the node with a projection to remove the unused _pos_. The fix is to update DynamicFileFilter so that the SupportsFileFilter is passed separately. Then the scan can be passed as a logical plan that can be rewritten by the planner. This also required updating conversion to physical plan because the scan plan may be more complicated than a single scan node. This ensures that the scan is converted to an extended scan by using a new logical plan wrapper so that planLater can be used in conversion like normal.

} else {
Preconditions.checkArgument(
field.isOptional() || field.fieldId() == MetadataColumns.ROW_POSITION.fieldId(),
field.isOptional() || MetadataColumns.metadataFieldIds().contains(field.fieldId()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to allow projecting _file even though it isn't in the data file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me we need to fix that projection bug / selection bug

// add _file
idToConstant.put(
MetadataColumns.FILE_PATH.fieldId(),
convertConstant.apply(Types.StringType.get(), task.file().path()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds _file to the constants map so it is set in records like a partition value.


val scan = scanBuilder.build()
val scanRelation = DataSourceV2ScanRelation(table, scan, output)
val scanRelation = DataSourceV2ScanRelation(table, scan, toOutputAttrs(scan.readSchema(), output))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark's contract is that the scan's schema is the one that should be used, not the original table schema. This allows the merge scan to return the extra _file and _pos columns and matches the behavior of normal scans that are configured with PushDownUtils.pruneColumns.

@aokolnychyi
Copy link
Contributor

It looks like vectorized Parquet is failing, @rdblue.


val fileNameCol = findOutputAttr(remainingRowsPlan, FILE_NAME_COL)
val rowPosCol = findOutputAttr(remainingRowsPlan, ROW_POS_COL)
val order = Seq(SortOrder(fileNameCol, Ascending), SortOrder(rowPosCol, Ascending))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Dec 18, 2020

I confirm tests for the non-vectorized Parquet path are working fine. Seems like we don't populate the position in the vectorized path only.

@chenjunjiedada
Copy link
Collaborator

You may want this PR: #1356.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 18, 2020

Merged #1356. Thanks for fixing that, @chenjunjiedada! Sorry I didn't get back to review that before now.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 19, 2020

Rebased and fixed a couple of bugs with _file and _pos in vectorized Parquet reads.

}
for (int i = 0; i < numValsToRead; i += 1) {
BitVectorHelper.setValidityBitToOne(vec.getValidityBuffer(), i);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for cases where Arrow checks the validity buffer.

} else {
idToConstant = ImmutableMap.of();
}
Map<Integer, ?> idToConstant = PartitionUtil.constantsMap(task, BatchDataReader::convertConstant);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't necessary to check whether there are projected ID columns. The code is shorter if the values are available by default, even if they aren't used. This fixes the problem where there are constants to add (like _file) but no identity partition values are projected.

@aokolnychyi
Copy link
Contributor

I cloned this PR locally and tests seem to work. Thanks, @chenjunjiedada and @rdblue.

I left a few questions but the rest looks great!

@rdblue rdblue merged commit bafda61 into apache:master Dec 21, 2020
@aokolnychyi
Copy link
Contributor

Thanks @rdblue and everyone who reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants