Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR fixes skipped file counts in ManifestReader with deleted entries as discussed here.

@github-actions github-actions bot added the core label Aug 30, 2023
}

private CloseableIterable<ManifestEntry<F>> entries(boolean onlyLive) {
if (hasRowFilter() || hasPartitionFilter() || partitionSet != null) {
Copy link
Contributor Author

@aokolnychyi aokolnychyi Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add hasRowFilter() and hasPartitionFilter() to avoid cyclomatic complexity warnings. I also think it makes it more readable.

? scanMetrics.skippedDataFiles()
: scanMetrics.skippedDeleteFiles(),
open(projection(fileSchema, fileProjection, projectColumns, caseSensitive)),
onlyLive ? filterLiveEntries(entries) : entries,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deleted entries must be discarded prior to evaluating stats primarily to have correct skipped counts. It should also help the performance.

tableDir, tableName, SCHEMA, SPEC, SortOrder.unsorted(), formatVersion, reporter);
table.newAppend().appendFile(FILE_A).appendFile(FILE_D).commit();
table.newAppend().appendFile(FILE_B).appendFile(FILE_C).commit();
table.newAppend().appendFile(FILE_A).appendFile(FILE_B).appendFile(FILE_D).commit();
Copy link
Contributor Author

@aokolnychyi aokolnychyi Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add FILE_B here so that we have a manifest with one skipped data file as the original manifest will be rewritten with FILE_A as deleted during the overwrite below.

assertThat(result.scannedDataManifests().value()).isEqualTo(1);
assertThat(result.scannedDeleteManifests().value()).isEqualTo(0);
assertThat(result.skippedDataManifests().value()).isEqualTo(1);
assertThat(result.skippedDataManifests().value()).isEqualTo(2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overwrite rewrites the original manifest and adds a new one for the new data file.

@aokolnychyi
Copy link
Contributor Author

cc @rdblue @nastra

}

private CloseableIterable<ManifestEntry<F>> entries(boolean onlyLive) {
if (hasRowFilter() || hasPartitionFilter() || partitionSet != null) {
Copy link
Contributor

@jerqi jerqi Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider the condition?

partitionSet.isEmpty()

If the partitionSet is empty, we seems to filter out all entries. Maybe we can return empty entries directly when the partitionSet is empty. I'm not sure whether we have guaranteed the partitionSet is not empty. Correct me if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe an empty partitionSet indicates there is no match and acts similar to alwaysFalse. We could add a branch for cases when the filters are always false or the matching partition set is empty, but we would have to analyze the usages first to make sure those are possible scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

return entries(true /* only live entries */);
}

private CloseableIterable<ManifestEntry<F>> filterLiveEntries(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: these last two methods could be static.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way so that I don't have to add F extends ContentFile<F> to these static methods and reuse the bound already defined in the class instead. I could use ? in the second one but not here.

@aokolnychyi aokolnychyi merged commit 5996b3d into apache:master Sep 1, 2023
@aokolnychyi
Copy link
Contributor Author

Thanks, @rdblue @jerqi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants