Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

This PR removes the outdated logic in DeleteFilter that determined whether we have to stream through position deletes or load them in memory. Instead, we should always rely on the bitmap-based approach like in vectorized reads.

@github-actions github-actions bot added the data label Nov 20, 2023

List<CloseableIterable<Record>> deletes = Lists.transform(posDeletes, this::openPosDeletes);

// if there are fewer deletes than a reasonable number to keep in memory, use a set
Copy link
Contributor Author

@aokolnychyi aokolnychyi Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this logic was added a few years ago, we added position deletes into a set (see the comment). We have been using bitmaps for a while now. In fact, vectorized reads always build bitmaps and have no threshold on the number of deletes. This has proven to work really well. Position deletes represented as bitmaps should always fit in memory.

Position deletes compress really well both on disk and in memory. We have seen this 100K threshold causing degradation in jobs without any good reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, this makes sense and appreciate the simplification. I remember seeing this threshold in the code and was wondering how we got to the 100000 value.

@aokolnychyi
Copy link
Contributor Author

@aokolnychyi aokolnychyi changed the title Core: Always use delete index for position deletes Data: Always use delete index for position deletes Nov 21, 2023

List<CloseableIterable<Record>> deletes = Lists.transform(posDeletes, this::openPosDeletes);

// if there are fewer deletes than a reasonable number to keep in memory, use a set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, this makes sense and appreciate the simplification. I remember seeing this threshold in the code and was wondering how we got to the 100000 value.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @aokolnychyi for improving this

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @aokolnychyi !

@aokolnychyi aokolnychyi merged commit c61c3ca into apache:main Nov 21, 2023
@aokolnychyi
Copy link
Contributor Author

Thank you for reviewing, @amogh-jahagirdar @nastra @singhpk234!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants