Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR adds support for fanout writers in SparkPositionDeltaWrite.

@github-actions github-actions bot added the spark label May 25, 2023
@aokolnychyi
Copy link
Contributor Author

}

public SparkWriteRequirements positionDeltaRequirements(Command command) {
if (ignoreTableDistributionAndOrdering()) {
Copy link
Contributor Author

@aokolnychyi aokolnychyi May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches CoW and insert requirements now (defined above).

}

/** Builds requirements for merge-on-read DELETE, UPDATE, MERGE operations. */
public static SparkWriteRequirements positionDeltaRequirements(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a note at the top that this class is for internal purposes and has no API guarantees.

long targetFileSize = context.targetDataFileSize();

if (fanoutEnabled && !inputOrdered) {
return new FanoutDataWriter<>(writers, files, io, targetFileSize);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use the fanout writer if input is ordered. The local sort always include partition columns.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me


// the spec requires position deletes to be ordered by file and position
// use a fanout writer if the input is unordered no matter whether fanout writers are enabled
// clustered writers only validate records for the same spec/paritition are co-located,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: clustered writers only validate records for the same spec/paritition are co-located which is not enough for position deletes => clustered writers assume that the position deletes are already ordered by file and position?

(the validation seems a bit orthogonal to previous sentence. I dont mind putting it in a as well somewhere, but doesnt seem its in the other comment in newDataWriter, so hence thinking its better to omit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let me update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// -------------------------------------------------------------------------
// delete mode is NOT SET -> CLUSTER BY _spec_id, _partition, _file +
// LOCALLY ORDERED BY _spec_id, _partition, _file, _pos
// delete mode is NOT SET (fanout) -> CLUSTER BY _spec_id, _partition, _file + empty ordering
Copy link
Member

@szehon-ho szehon-ho May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably annoying to add another example with ORDERED BY, but is it still correct with "ORDERED" above? Is empty_ordering not correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a table sort order, we still respect it by default and fanout writer config has no effect.

Copy link
Member

@szehon-ho szehon-ho May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I meant , in this comment as it is ORDERED & UNORDERED, fanout should not be 'empty ordering', but rather whatever the order column was?

Not sure if I'm missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is DELETE. We only have positions, no data, so the order does not matter.

Copy link
Member

@szehon-ho szehon-ho May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. ok. Well i guess it matters in the end, but the fanoutWriter here will sort it

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks @aokolnychyi

@aokolnychyi aokolnychyi merged commit b430775 into apache:master May 26, 2023
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @szehon-ho @amogh-jahagirdar!

rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants