Comparing changes

Until now ExecChooseHashTableSize() considered only the size of the in-memory hash table when picking the nbatch value, and completely ignored the memory needed for the batch files. Which can be a lot, because each batch needs two BufFiles (each with a BLCKSZ buffer). Same for increasing the number of batches during execution. With enough batches, the batch files may use orders of magnitude more memory than the in-memory hash table. But the sizing logic is oblivious to this. It's also possible to trigger a "batch explosion", e.g. due to duplicate values or skew in general. We've seen reports of joins with hundreds of thousands (or even millions) of batches, consuming gigabytes of memory, triggering OOM errors. These cases are fairly rare, but it's clearly possible to hit them. We can't prevent this during planning - we could improve the planning, but that does nothing for the execution-time batch explosion. But we can reduce the impact by using as little memory as possible. This patch improves the memory by rebalancing how the memory is divided between the hash table and batch files. Sometimes it's better to use fewer batch files, even if it means the hash table exceeds the limit. Whenever we need to increase the capacity of the hash node, we can do that by either doubling the number of batches or doubling the size of the in-memory hash table. The outcome is the same, allowing the hash node to handle a relation twice the size. But the memory usage may be very different - for low nbatch values it's better to add batches, for high nbatch values it's better to allow a larger hash table. It might seem like relaxing the memory limit - but that's not really the case. It has always been like that, except the memory used by batches was ignored, as if the files were free. This commit improves the situation by considering this memory when adjusting nbatch values. Increasing the hashtable memory limit may also help to prevent the batch explosion in the first place. Given enough hash collisions or duplicate hashes it's easy to get a batch that can't be split, resulting in a cycle of quickly doubling the number of batches. Allowing the hashtable to get larger may stop this, once the batch gets large enough to fit the skewed data.

After increasing the number of batches and splitting the current one, we used to disable further growth if all tuples went into only one of the two new batches. It's possible to construct cases when this leads to disabling growth prematurely - maybe we can't split the batch now, but that doesn't mean we could not split it later. This generally requires underestimated size of the inner relation, so that we need to increase the number of batches. And then also hashes non-random in some way. There may be a "common" prefix, or maybe the data is just correlated in some way. So instead of hard-disabling the growth permanently, double the memory limit so that we retry the split after processing more data. Doubling the limit is somewhat arbitrary - it's the earliest when we could split the batch in half even if all the current tuples have duplicate hashes. But we could pick any other value, to retry sooner/later.

This commit was automatically generated by a robot at cfbot.cputube.org. It is based on patches submitted to the PostgreSQL mailing lists and registered in the PostgreSQL Commitfest application. This branch will be overwritten each time a new patch version is posted to the email thread, and also periodically to check for bitrot caused by changes on the master branch. Commitfest entry: https://commitfest.postgresql.org/52/5482 Patch(es): https://www.postgresql.org/message-id/[email protected] Author(s):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Feb 18, 2025

This comparison is taking too long to generate.

Uh oh!