-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.4: Distribution and ordering enhancements #7637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.4: Distribution and ordering enhancements #7637
Conversation
...extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala
Show resolved
Hide resolved
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Show resolved
Hide resolved
c799dd8 to
7ff85e6
Compare
| import org.apache.spark.sql.connector.expressions.SortOrder; | ||
|
|
||
| /** A set of requirements such as distribution and ordering reported to Spark during writes. */ | ||
| public class SparkWriteRequirements { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, this class would also hold the advisory partition size.
| * | ||
| * <p>Note it is an evolving internal API that is subject to change even in minor releases. | ||
| */ | ||
| public class SparkWriteUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend checking the tests first to see what changed.
| // write mode is NOT SET -> CLUSTER BY date, days(ts) + LOCALLY ORDER BY date, days(ts) | ||
| // write mode is NOT SET (fanout) -> CLUSTER BY date, days(ts) + empty ordering | ||
| // write mode is NONE -> unspecified distribution + LOCALLY ORDERED BY date, days(ts) | ||
| // write mode is NONE (fanout) -> unspecified distribution + empty ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disabled local sort by partition columns in regular writes and in CoW operations if fanout writers enabled. If the table sort order is undefined, no need to sort records by partition columns if we are not using clustered writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, would it be useful to have all these distribution and ordering examples accessible in our public docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar, yes. It would be good to have these documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add, I believe @RussellSpitzer also had some doc update PR, I am yet to check it out.
| // UNPARTITIONED UNORDERED | ||
| // ------------------------------------------------------------------------- | ||
| // delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY _file, _pos | ||
| // delete mode is NOT SET -> CLUSTER BY _file + empty ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disabled the local sort by _file and _pos in DELETE operations as it is not helping that much. If we perform a DELETE operation and shuffle the records all over the place, we will cluster them by _file before writing. In most cases, records from multiple files will end up in a single task. If we stitch together two sorted chunks into one output file, the order of that file will be broken. So what’s the point of doing the sort and potentially spilling to disk? There is a very narrow use case where the old behavior could make sense: a task gets only records from a single file and that file was properly sorted yet the sort order is not defined in the table. I don’t think it is a good idea to optimize for that use case. Keep in mind it only happens if the sort order is empty. In most cases, it really means there is no reasonable sort order to preserve.
| // | ||
| // PARTITIONED BY date ORDERED BY id | ||
| // ------------------------------------------------------------------------- | ||
| // delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY date, id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ditching clustering by _file in favor of clustering by partition columns for CoW operations to reduce the number of produced files. Right now, each output task may get records from various files and partitions. Hence, we produce more than needed files. This was done to avoid OOM exceptions with too large partitions. This is no longer a problem with AQE writes in Spark 3.4.
| private static final SortOrder[] POSITION_DELETE_ORDERING = | ||
| orderBy(SPEC_ID, PARTITION, FILE_PATH, ROW_POSITION); | ||
|
|
||
| private SparkWriteUtil() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This utility would also hold the logic to compute the advisory partition size. It is similar to the old distribution and ordering utility but uses a shorter name/syntax and builds SparkWriteRequirements.
| } | ||
|
|
||
| public SparkWriteRequirements positionDeltaRequirements(Command command) { | ||
| return SparkWriteUtil.positionDeltaRequirements( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't skip distribution and ordering because the spec requires position deletes to be sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure to follow up on this in the spec. The sorting requirement I think is a little odd, but if we switch to another layout, like a delete bitmap per file this won't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a follow-up PR that adds a fanout position delete writer. It buffers deletes into a bitmap and then produces a sorted file when closed. In the future, we either remove the requirement for position deletes to be sorted or add support for Puffin delete files that would persist bitmaps. I am already looking into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's beyond this release, though.
| Context context) { | ||
| this.delegate = | ||
| new BasePositionDeltaWriter<>( | ||
| newInsertWriter(table, writerFactory, dataFileFactory, context), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using one data writer now.
7ff85e6 to
15a08b3
Compare
| // _file, _pos | ||
| // UNPARTITIONED (ORDERED & UNORDERED) | ||
| // ------------------------------------------------------------------------- | ||
| // delete mode is NOT SET -> CLUSTER BY _spec_id, _partition, _file + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding _file to the clustering for unpartitioned tables as spec_id, _partition will be the same for all rows and deletes for the same data file may end up in multiple delete files.
| // _file, _pos | ||
| // UNPARTITIONED UNORDERED | ||
| // ------------------------------------------------------------------------- | ||
| // update mode is NOT SET -> CLUSTER BY _spec_id, _partition, _file + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as DELETE.
| // LOCALLY ORDERED BY _spec_id, _partition, _file, _pos, date, days(ts) | ||
| // merge mode is HASH -> CLUSTER BY _spec_id, _partition, date, days(ts) + | ||
| // LOCALLY ORDER BY _spec_id, _partition, _file, _pos, date, days(ts) | ||
| // merge mode is RANGE -> RANGE DISTRIBUTE BY _spec_id, _partition, _file, date, days(ts) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be debatable. We initially included _file to split the deletes if too many but AQE should be a better option we are sure we won't split deletes per partition unless needed.
| private final Distribution distribution; | ||
| private final SortOrder[] ordering; | ||
|
|
||
| public SparkWriteRequirements(Distribution distribution, SortOrder[] ordering) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public or can it be package-private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll change. The class must stay public but I can make the constructor package-private.
| public void update(InternalRow meta, InternalRow id, InternalRow row) throws IOException { | ||
| delete(meta, id); | ||
| delegate.update(row, dataSpec, null); | ||
| throw new UnsupportedOperationException("Update must be represented as delete and insert"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me happy that we actually write updates as separate deletes and inserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes clustering easier!
| } else { | ||
| return new ClusteredDataWriter<>(writerFactory, fileFactory, table.io(), targetFileSize); | ||
| } | ||
| private PartitioningWriter<InternalRow, DataWriteResult> newDataWriter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand why this no longer supports the explicit fanout writer option?
I think it's because we are always adding a shuffle after the merge or update anyway so it never matters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always add a local sort as it is required to write out position deletes. That's why we always add a sort by partition columns if the table is partitioned. Therefore, we should be able to always use a clustered writer.
|
Overall this looks good to me. Thanks for taking the time to work through all the cases with AQE. |
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, just a comment on documentation. Thanks @aokolnychyi !
| // write mode is NOT SET -> CLUSTER BY date, days(ts) + LOCALLY ORDER BY date, days(ts) | ||
| // write mode is NOT SET (fanout) -> CLUSTER BY date, days(ts) + empty ordering | ||
| // write mode is NONE -> unspecified distribution + LOCALLY ORDERED BY date, days(ts) | ||
| // write mode is NONE (fanout) -> unspecified distribution + empty ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, would it be useful to have all these distribution and ordering examples accessible in our public docs?
| .booleanConf() | ||
| .option(SparkWriteOptions.USE_TABLE_DISTRIBUTION_AND_ORDERING) | ||
| .defaultValue(SparkWriteOptions.USE_TABLE_DISTRIBUTION_AND_ORDERING_DEFAULT) | ||
| .negate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me slightly apprehensive, since I don't like flipping the sign in the builder. I do think it's cleaner than just adding a ! to the return statement though so +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated it but having negation in the main logic bothered me more :)
| private static final NamedReference ROW_POSITION = ref(MetadataColumns.ROW_POSITION); | ||
|
|
||
| private static final Expression[] FILE_CLUSTERING = clusterBy(FILE_PATH); | ||
| private static final Expression[] SPEC_ID_PARTITION_CLUSTERING = clusterBy(SPEC_ID, PARTITION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could just all this Partition_Clustering, at least in my head a partition is unique only in combination with the spec id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
| case HASH: | ||
| if (table.spec().isUnpartitioned()) { | ||
| Expression[] clustering = concat(SPEC_ID_PARTITION_FILE_CLUSTERING, clustering(table)); | ||
| return Distributions.clustered(clustering); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can pull the return out of this branch statement, that way we wouldn't make two identical clustering variables in different scopes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize looking at future code you did this to avoid exceeding line length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I renamed the variable above like you suggested, it fits on one line now. I've inlined this call too.
| case RANGE: | ||
| if (table.spec().isUnpartitioned()) { | ||
| SortOrder[] ordering = concat(SPEC_ID_PARTITION_FILE_ORDERING, ordering(table)); | ||
| return Distributions.ordered(ordering); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about return statements
| private static final NamedReference ROW_POSITION = ref(MetadataColumns.ROW_POSITION); | ||
|
|
||
| private static final Expression[] FILE_CLUSTERING = clusterBy(FILE_PATH); | ||
| private static final Expression[] SPEC_ID_PARTITION_CLUSTERING = clusterBy(SPEC_ID, PARTITION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could user your concat function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it locally but it was not very obvious for ordering below. It is a bit more explicit to include all columns one by one.
f39a850 to
e5c133e
Compare
|
Thanks for reviewing, @amogh-jahagirdar @rdblue @RussellSpitzer! |
This fixes #7633.