Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented May 18, 2023

This fixes #7633.

import org.apache.spark.sql.connector.expressions.SortOrder;

/** A set of requirements such as distribution and ordering reported to Spark during writes. */
public class SparkWriteRequirements {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, this class would also hold the advisory partition size.

*
* <p>Note it is an evolving internal API that is subject to change even in minor releases.
*/
public class SparkWriteUtil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend checking the tests first to see what changed.

// write mode is NOT SET -> CLUSTER BY date, days(ts) + LOCALLY ORDER BY date, days(ts)
// write mode is NOT SET (fanout) -> CLUSTER BY date, days(ts) + empty ordering
// write mode is NONE -> unspecified distribution + LOCALLY ORDERED BY date, days(ts)
// write mode is NONE (fanout) -> unspecified distribution + empty ordering
Copy link
Contributor Author

@aokolnychyi aokolnychyi May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled local sort by partition columns in regular writes and in CoW operations if fanout writers enabled. If the table sort order is undefined, no need to sort records by partition columns if we are not using clustered writers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, would it be useful to have all these distribution and ordering examples accessible in our public docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar, yes. It would be good to have these documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add, I believe @RussellSpitzer also had some doc update PR, I am yet to check it out.

// UNPARTITIONED UNORDERED
// -------------------------------------------------------------------------
// delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY _file, _pos
// delete mode is NOT SET -> CLUSTER BY _file + empty ordering
Copy link
Contributor Author

@aokolnychyi aokolnychyi May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled the local sort by _file and _pos in DELETE operations as it is not helping that much. If we perform a DELETE operation and shuffle the records all over the place, we will cluster them by _file before writing. In most cases, records from multiple files will end up in a single task. If we stitch together two sorted chunks into one output file, the order of that file will be broken. So what’s the point of doing the sort and potentially spilling to disk? There is a very narrow use case where the old behavior could make sense: a task gets only records from a single file and that file was properly sorted yet the sort order is not defined in the table. I don’t think it is a good idea to optimize for that use case. Keep in mind it only happens if the sort order is empty. In most cases, it really means there is no reasonable sort order to preserve.

//
// PARTITIONED BY date ORDERED BY id
// -------------------------------------------------------------------------
// delete mode is NOT SET -> CLUSTER BY _file + LOCALLY ORDER BY date, id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ditching clustering by _file in favor of clustering by partition columns for CoW operations to reduce the number of produced files. Right now, each output task may get records from various files and partitions. Hence, we produce more than needed files. This was done to avoid OOM exceptions with too large partitions. This is no longer a problem with AQE writes in Spark 3.4.

private static final SortOrder[] POSITION_DELETE_ORDERING =
orderBy(SPEC_ID, PARTITION, FILE_PATH, ROW_POSITION);

private SparkWriteUtil() {}
Copy link
Contributor Author

@aokolnychyi aokolnychyi May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This utility would also hold the logic to compute the advisory partition size. It is similar to the old distribution and ordering utility but uses a shorter name/syntax and builds SparkWriteRequirements.

}

public SparkWriteRequirements positionDeltaRequirements(Command command) {
return SparkWriteUtil.positionDeltaRequirements(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't skip distribution and ordering because the spec requires position deletes to be sorted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to follow up on this in the spec. The sorting requirement I think is a little odd, but if we switch to another layout, like a delete bitmap per file this won't really matter.

Copy link
Contributor Author

@aokolnychyi aokolnychyi May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a follow-up PR that adds a fanout position delete writer. It buffers deletes into a bitmap and then produces a sorted file when closed. In the future, we either remove the requirement for position deletes to be sorted or add support for Puffin delete files that would persist bitmaps. I am already looking into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's beyond this release, though.

Context context) {
this.delegate =
new BasePositionDeltaWriter<>(
newInsertWriter(table, writerFactory, dataFileFactory, context),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using one data writer now.

@aokolnychyi aokolnychyi force-pushed the distribution-and-ordering-impr branch from 7ff85e6 to 15a08b3 Compare May 19, 2023 21:51
// _file, _pos
// UNPARTITIONED (ORDERED & UNORDERED)
// -------------------------------------------------------------------------
// delete mode is NOT SET -> CLUSTER BY _spec_id, _partition, _file +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding _file to the clustering for unpartitioned tables as spec_id, _partition will be the same for all rows and deletes for the same data file may end up in multiple delete files.

// _file, _pos
// UNPARTITIONED UNORDERED
// -------------------------------------------------------------------------
// update mode is NOT SET -> CLUSTER BY _spec_id, _partition, _file +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as DELETE.

// LOCALLY ORDERED BY _spec_id, _partition, _file, _pos, date, days(ts)
// merge mode is HASH -> CLUSTER BY _spec_id, _partition, date, days(ts) +
// LOCALLY ORDER BY _spec_id, _partition, _file, _pos, date, days(ts)
// merge mode is RANGE -> RANGE DISTRIBUTE BY _spec_id, _partition, _file, date, days(ts) +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be debatable. We initially included _file to split the deletes if too many but AQE should be a better option we are sure we won't split deletes per partition unless needed.

@aokolnychyi aokolnychyi added this to the Iceberg 1.3.0 milestone May 19, 2023
private final Distribution distribution;
private final SortOrder[] ordering;

public SparkWriteRequirements(Distribution distribution, SortOrder[] ordering) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public or can it be package-private?

Copy link
Contributor Author

@aokolnychyi aokolnychyi May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll change. The class must stay public but I can make the constructor package-private.

public void update(InternalRow meta, InternalRow id, InternalRow row) throws IOException {
delete(meta, id);
delegate.update(row, dataSpec, null);
throw new UnsupportedOperationException("Update must be represented as delete and insert");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me happy that we actually write updates as separate deletes and inserts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes clustering easier!

} else {
return new ClusteredDataWriter<>(writerFactory, fileFactory, table.io(), targetFileSize);
}
private PartitioningWriter<InternalRow, DataWriteResult> newDataWriter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why this no longer supports the explicit fanout writer option?

I think it's because we are always adding a shuffle after the merge or update anyway so it never matters?

Copy link
Contributor Author

@aokolnychyi aokolnychyi May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always add a local sort as it is required to write out position deletes. That's why we always add a sort by partition columns if the table is partitioned. Therefore, we should be able to always use a clustered writer.

@rdblue
Copy link
Contributor

rdblue commented May 21, 2023

Overall this looks good to me. Thanks for taking the time to work through all the cases with AQE.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, just a comment on documentation. Thanks @aokolnychyi !

// write mode is NOT SET -> CLUSTER BY date, days(ts) + LOCALLY ORDER BY date, days(ts)
// write mode is NOT SET (fanout) -> CLUSTER BY date, days(ts) + empty ordering
// write mode is NONE -> unspecified distribution + LOCALLY ORDERED BY date, days(ts)
// write mode is NONE (fanout) -> unspecified distribution + empty ordering
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, would it be useful to have all these distribution and ordering examples accessible in our public docs?

.booleanConf()
.option(SparkWriteOptions.USE_TABLE_DISTRIBUTION_AND_ORDERING)
.defaultValue(SparkWriteOptions.USE_TABLE_DISTRIBUTION_AND_ORDERING_DEFAULT)
.negate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me slightly apprehensive, since I don't like flipping the sign in the builder. I do think it's cleaner than just adding a ! to the return statement though so +1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated it but having negation in the main logic bothered me more :)

private static final NamedReference ROW_POSITION = ref(MetadataColumns.ROW_POSITION);

private static final Expression[] FILE_CLUSTERING = clusterBy(FILE_PATH);
private static final Expression[] SPEC_ID_PARTITION_CLUSTERING = clusterBy(SPEC_ID, PARTITION);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could just all this Partition_Clustering, at least in my head a partition is unique only in combination with the spec id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

case HASH:
if (table.spec().isUnpartitioned()) {
Expression[] clustering = concat(SPEC_ID_PARTITION_FILE_CLUSTERING, clustering(table));
return Distributions.clustered(clustering);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can pull the return out of this branch statement, that way we wouldn't make two identical clustering variables in different scopes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize looking at future code you did this to avoid exceeding line length

Copy link
Contributor Author

@aokolnychyi aokolnychyi May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I renamed the variable above like you suggested, it fits on one line now. I've inlined this call too.

case RANGE:
if (table.spec().isUnpartitioned()) {
SortOrder[] ordering = concat(SPEC_ID_PARTITION_FILE_ORDERING, ordering(table));
return Distributions.ordered(ordering);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about return statements

private static final NamedReference ROW_POSITION = ref(MetadataColumns.ROW_POSITION);

private static final Expression[] FILE_CLUSTERING = clusterBy(FILE_PATH);
private static final Expression[] SPEC_ID_PARTITION_CLUSTERING = clusterBy(SPEC_ID, PARTITION);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could user your concat function here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it locally but it was not very obvious for ordering below. It is a bit more explicit to include all columns one by one.

@aokolnychyi aokolnychyi force-pushed the distribution-and-ordering-impr branch from f39a850 to e5c133e Compare May 22, 2023 22:55
@aokolnychyi aokolnychyi merged commit 2502a23 into apache:master May 23, 2023
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @amogh-jahagirdar @rdblue @RussellSpitzer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve distribution and ordering in Spark 3.4

4 participants