-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Sort retained rows in DELETE FROM by file and position #1955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } else { | ||
| Preconditions.checkArgument( | ||
| field.isOptional() || field.fieldId() == MetadataColumns.ROW_POSITION.fieldId(), | ||
| field.isOptional() || MetadataColumns.metadataFieldIds().contains(field.fieldId()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to allow projecting _file even though it isn't in the data file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reminds me we need to fix that projection bug / selection bug
| // add _file | ||
| idToConstant.put( | ||
| MetadataColumns.FILE_PATH.fieldId(), | ||
| convertConstant.apply(Types.StringType.get(), task.file().path())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds _file to the constants map so it is set in records like a partition value.
|
|
||
| val scan = scanBuilder.build() | ||
| val scanRelation = DataSourceV2ScanRelation(table, scan, output) | ||
| val scanRelation = DataSourceV2ScanRelation(table, scan, toOutputAttrs(scan.readSchema(), output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark's contract is that the scan's schema is the one that should be used, not the original table schema. This allows the merge scan to return the extra _file and _pos columns and matches the behavior of normal scans that are configured with PushDownUtils.pruneColumns.
|
It looks like vectorized Parquet is failing, @rdblue. |
|
|
||
| val fileNameCol = findOutputAttr(remainingRowsPlan, FILE_NAME_COL) | ||
| val rowPosCol = findOutputAttr(remainingRowsPlan, ROW_POS_COL) | ||
| val order = Seq(SortOrder(fileNameCol, Ascending), SortOrder(rowPosCol, Ascending)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
spark3/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java
Show resolved
Hide resolved
...nsions/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ExtendedScanRelation.scala
Outdated
Show resolved
Hide resolved
|
I confirm tests for the non-vectorized Parquet path are working fine. Seems like we don't populate the position in the vectorized path only. |
|
You may want this PR: #1356. |
|
Merged #1356. Thanks for fixing that, @chenjunjiedada! Sorry I didn't get back to review that before now. |
934a375 to
7a87a43
Compare
|
Rebased and fixed a couple of bugs with |
| } | ||
| for (int i = 0; i < numValsToRead; i += 1) { | ||
| BitVectorHelper.setValidityBitToOne(vec.getValidityBuffer(), i); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for cases where Arrow checks the validity buffer.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Show resolved
Hide resolved
| } else { | ||
| idToConstant = ImmutableMap.of(); | ||
| } | ||
| Map<Integer, ?> idToConstant = PartitionUtil.constantsMap(task, BatchDataReader::convertConstant); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't necessary to check whether there are projected ID columns. The code is shorter if the values are available by default, even if they aren't used. This fixes the problem where there are constants to add (like _file) but no identity partition values are projected.
spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDelete.scala
Outdated
Show resolved
Hide resolved
...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java
Outdated
Show resolved
Hide resolved
|
I cloned this PR locally and tests seem to work. Thanks, @chenjunjiedada and @rdblue. I left a few questions but the rest looks great! |
ffee55c to
5b3b8d8
Compare
|
Thanks @rdblue and everyone who reviewed. |
This updates Spark's
DELETE FROMcommand to sort the retained rows by original file and position to ensure that the original data clustering is preserved by the command.Because Spark does not yet support metadata columns, this exposes
_fileand_posby adding them automatically to all merge scans. Projecting both columns was mostly supported, with only minor changes needed to project_fileusing the constants map supported by Avro, Parquet, and ORC.This also required refactoring
DynamicFileFilter. When projecting_fileand_posbut only using file, the optimizer would throw an exception that the node could not be copied because the optimizer was attempting to rewrite the node with a projection to remove the unused_pos_. The fix is to updateDynamicFileFilterso that theSupportsFileFilteris passed separately. Then the scan can be passed as a logical plan that can be rewritten by the planner. This also required updating conversion to physical plan because the scan plan may be more complicated than a single scan node. This ensures that the scan is converted to an extended scan by using a new logical plan wrapper so thatplanLatercan be used in conversion like normal.