-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.5: Extend action for rewriting manifests to support deletes #9020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.5: Extend action for rewriting manifests to support deletes #9020
Conversation
| import org.apache.spark.sql.Row; | ||
| import org.apache.spark.sql.types.StructType; | ||
|
|
||
| public abstract class SparkContentFile<F> implements ContentFile<F> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rendered as a new file but it is mostly a copy of what we had in SparkDataFile before.
| } | ||
| } | ||
|
|
||
| private abstract static class WriteManifests<F extends ContentFile<F>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow the same pattern we have in other actions: create a class that implements a transform in Spark.
|
@aokolnychyi: Can we please add this to PR description? Fixes: #6375 |
|
Oh, I did not know that one existed. |
Already on it. |
| Iterables.addAll(rewrittenManifests, deletesResult.rewrittenManifests()); | ||
| Iterables.addAll(addedManifests, deletesResult.addedManifests()); | ||
|
|
||
| if (rewrittenManifests.isEmpty() && addedManifests.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally it is fine. It is checking whether calling rewriteManifests for both data and delete has returned EMPTY_RESULT. If so, method is retuning empty result.
But is just checking rewrittenManifests.isEmpty() is enough (instead of both)? Because how can there be a added new manifest when rewrittenManifests are empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check if both results are EMPTY_RESULT and return based on that.
Do you think that will be more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just rewrittenManifests.isEmpty() check is enough here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
|
|
||
| return currentSnapshot.dataManifests(table.io()).stream() | ||
| return currentSnapshot.allManifests(table.io()).stream() | ||
| .filter(manifest -> manifest.content() == content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can combine the filter conditions with below one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it separate to avoid Spotless formatting the closure in a weird way. I can define a method and call it here. These filters are chained together prior to execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restructured it a bit, no longer applies.
| private WriteManifests<?> newWriteManifestsFunc(ManifestContent content, StructType sparkType) { | ||
| ManifestWriterFactory writers = manifestWriters(); | ||
|
|
||
| StructType sparkFileType = (StructType) sparkType.apply("data_file").dataType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected a check based on content type and use delete_file and data_file instead of using data_file always. But looks like in schema we always keep it as data_file even for the delete file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we always use data_file struct in the manifest entry, even for deletes.
| .execute(); | ||
|
|
||
| // the original delete manifests must be combined | ||
| assertThat(result.rewrittenManifests()).hasSize(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also validate that these manifests type is delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
| } | ||
|
|
||
| private void checkDataFile(DataFile expected, DataFile actual) { | ||
| Assert.assertEquals("Content must match", expected.content(), actual.content()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can have checkContentFile and extract common assertions into that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I should have done that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
ajantha-bhat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits.
Overall LGTM. Thanks for fixing it.
998c52f to
b20a59e
Compare
b20a59e to
fe445a1
Compare
|
@amogh-jahagirdar @nastra @Fokko @flyrain @RussellSpitzer, could you also check this one whenever you have a minute? It would be nice to wrap up V2 table maintenance. |
| .execute(); | ||
|
|
||
| // the original delete manifests must be combined | ||
| assertThat(result.rewrittenManifests()).hasSize(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can usually combine those checks into a single line:
assertThat(result.rewrittenManifests())
.hasSize(2)
.allMatch(m -> m.content() == ManifestContent.DELETES);
assertThat(result.addedManifests())
.hasSize(1)
.allMatch(m -> m.content() == ManifestContent.DELETES);
|
Thank you for reviewing, @ajantha-bhat @nastra! |
This PR extends our action for rewriting manifest to support deletes.
This fixes #6375.