-
Notifications
You must be signed in to change notification settings - Fork 3k
Parquet: skip writing bloom filter for deletes #7617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: skip writing bloom filter for deletes #7617
Conversation
53ee2c9 to
b909176
Compare
|
@stevenzwu @hililiwei You guys may wanna take a look at this. |
|
I would get @huaxingao 's input on this one |
|
LGTM |
b909176 to
737ad5a
Compare
|
I just rebased this since it has conflicts. @stevenzwu could you take another look? |
| bloomFilterMaxBytes, | ||
| columnBloomFilterEnabled, | ||
| PARQUET_BLOOM_FILTER_MAX_BYTES_DEFAULT, | ||
| ImmutableMap.of(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for most use cases, this change makes sense.
If some datasets have high updates rate and generates a lot of large delete files. would the bloom filter for delete file be useful too?
if yes, we can introduce a config to enable/disable bloom filter for delete files only. If not, this change is good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or delete files are expected to be compacted/consolidated anyway. Hence the bloom filter on delete files never makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some datasets have high updates rate and generates a lot of large delete files. would the bloom filter for delete file be useful too?
First, we don't have filter logic for the bloom filter right now. Second, high-rate updates mostly produce position delete which doesn't contains a column to build a bloom filter unless the upstream is deleting a set of records. In that case, the records of the filtered data files should always pass the bloom filter of equality delete.
delete files are expected to be compacted/consolidated anyway. Hence the bloom filter on delete files never makes sense.
I think so, the deletes impact read performance, as far as I know, all real productions need the async tasks to compact them to achieve good performance.
Right now, the Flink upsert job writes bloom filter for equality deletes when enabling bloom filter for equality columns. This is useless and should be avoided.