Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Nov 9, 2023

This PR extends our action for rewriting manifest to support deletes.

This fixes #6375.

@github-actions github-actions bot added the spark label Nov 9, 2023
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.StructType;

public abstract class SparkContentFile<F> implements ContentFile<F> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rendered as a new file but it is mostly a copy of what we had in SparkDataFile before.

}
}

private abstract static class WriteManifests<F extends ContentFile<F>>
Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow the same pattern we have in other actions: create a class that implements a transform in Spark.

@ajantha-bhat
Copy link
Member

@aokolnychyi: Can we please add this to PR description?

Fixes: #6375

@aokolnychyi
Copy link
Contributor Author

Oh, I did not know that one existed.
@ajantha-bhat, would you have time to review and test to see if it works for your needs?

@ajantha-bhat
Copy link
Member

@ajantha-bhat, would you have time to review and test to see if it works for your needs?

Already on it.

Iterables.addAll(rewrittenManifests, deletesResult.rewrittenManifests());
Iterables.addAll(addedManifests, deletesResult.addedManifests());

if (rewrittenManifests.isEmpty() && addedManifests.isEmpty()) {
Copy link
Member

@ajantha-bhat ajantha-bhat Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally it is fine. It is checking whether calling rewriteManifests for both data and delete has returned EMPTY_RESULT. If so, method is retuning empty result.

But is just checking rewrittenManifests.isEmpty() is enough (instead of both)? Because how can there be a added new manifest when rewrittenManifests are empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check if both results are EMPTY_RESULT and return based on that.
Do you think that will be more readable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just rewrittenManifests.isEmpty() check is enough here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.


return currentSnapshot.dataManifests(table.io()).stream()
return currentSnapshot.allManifests(table.io()).stream()
.filter(manifest -> manifest.content() == content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can combine the filter conditions with below one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it separate to avoid Spotless formatting the closure in a weird way. I can define a method and call it here. These filters are chained together prior to execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restructured it a bit, no longer applies.

private WriteManifests<?> newWriteManifestsFunc(ManifestContent content, StructType sparkType) {
ManifestWriterFactory writers = manifestWriters();

StructType sparkFileType = (StructType) sparkType.apply("data_file").dataType();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected a check based on content type and use delete_file and data_file instead of using data_file always. But looks like in schema we always keep it as data_file even for the delete file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we always use data_file struct in the manifest entry, even for deletes.

.execute();

// the original delete manifests must be combined
assertThat(result.rewrittenManifests()).hasSize(2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also validate that these manifests type is delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

}

private void checkDataFile(DataFile expected, DataFile actual) {
Assert.assertEquals("Content must match", expected.content(), actual.content());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can have checkContentFile and extract common assertions into that

Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I should have done that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@ajantha-bhat ajantha-bhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits.

Overall LGTM. Thanks for fixing it.

@aokolnychyi aokolnychyi force-pushed the rewrite-delete-manifests-action-prototype branch 3 times, most recently from 998c52f to b20a59e Compare November 16, 2023 19:45
@aokolnychyi aokolnychyi force-pushed the rewrite-delete-manifests-action-prototype branch from b20a59e to fe445a1 Compare November 16, 2023 19:49
@aokolnychyi
Copy link
Contributor Author

@amogh-jahagirdar @nastra @Fokko @flyrain @RussellSpitzer, could you also check this one whenever you have a minute? It would be nice to wrap up V2 table maintenance.

.execute();

// the original delete manifests must be combined
assertThat(result.rewrittenManifests()).hasSize(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can usually combine those checks into a single line:

    assertThat(result.rewrittenManifests())
        .hasSize(2)
        .allMatch(m -> m.content() == ManifestContent.DELETES);
    assertThat(result.addedManifests())
        .hasSize(1)
        .allMatch(m -> m.content() == ManifestContent.DELETES);

@aokolnychyi aokolnychyi merged commit e69418a into apache:main Nov 18, 2023
@aokolnychyi
Copy link
Contributor Author

Thank you for reviewing, @ajantha-bhat @nastra!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider delete manifests for rewrite manifests

3 participants