Skip to content

Conversation

@chenjunjiedada
Copy link
Collaborator

Right now, the Flink upsert job writes bloom filter for equality deletes when enabling bloom filter for equality columns. This is useless and should be avoided.

image

@chenjunjiedada chenjunjiedada force-pushed the do-not-write-bf-for-deletes branch from 53ee2c9 to b909176 Compare May 16, 2023 08:18
@chenjunjiedada chenjunjiedada requested a review from stevenzwu May 16, 2023 08:19
@chenjunjiedada
Copy link
Collaborator Author

chenjunjiedada commented May 17, 2023

@stevenzwu @hililiwei You guys may wanna take a look at this.

@stevenzwu
Copy link
Contributor

I would get @huaxingao 's input on this one

@huaxingao
Copy link
Contributor

LGTM

@chenjunjiedada chenjunjiedada force-pushed the do-not-write-bf-for-deletes branch from b909176 to 737ad5a Compare May 23, 2023 01:46
@chenjunjiedada
Copy link
Collaborator Author

I just rebased this since it has conflicts. @stevenzwu could you take another look?

bloomFilterMaxBytes,
columnBloomFilterEnabled,
PARQUET_BLOOM_FILTER_MAX_BYTES_DEFAULT,
ImmutableMap.of(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for most use cases, this change makes sense.

If some datasets have high updates rate and generates a lot of large delete files. would the bloom filter for delete file be useful too?

if yes, we can introduce a config to enable/disable bloom filter for delete files only. If not, this change is good to me.

@huaxingao @hililiwei @chenjunjiedada

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or delete files are expected to be compacted/consolidated anyway. Hence the bloom filter on delete files never makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some datasets have high updates rate and generates a lot of large delete files. would the bloom filter for delete file be useful too?

First, we don't have filter logic for the bloom filter right now. Second, high-rate updates mostly produce position delete which doesn't contains a column to build a bloom filter unless the upstream is deleting a set of records. In that case, the records of the filtered data files should always pass the bloom filter of equality delete.

delete files are expected to be compacted/consolidated anyway. Hence the bloom filter on delete files never makes sense.

I think so, the deletes impact read performance, as far as I know, all real productions need the async tasks to compact them to achieve good performance.

@stevenzwu stevenzwu merged commit ee1040e into apache:master May 30, 2023
rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants