Skip to content

Commit 1b09197

Browse files
tvondraCommitfest Bot
authored and
Commitfest Bot
committed
Reduce the impact of hashjoin batch explosion
Until now ExecChooseHashTableSize() considered only the size of the in-memory hash table when picking the nbatch value, and completely ignored the memory needed for the batch files. Which can be a lot, because each batch needs two BufFiles (each with a BLCKSZ buffer). Same for increasing the number of batches during execution. With enough batches, the batch files may use orders of magnitude more memory than the in-memory hash table. But the sizing logic is oblivious to this. It's also possible to trigger a "batch explosion", e.g. due to duplicate values or skew in general. We've seen reports of joins with hundreds of thousands (or even millions) of batches, consuming gigabytes of memory, triggering OOM errors. These cases are fairly rare, but it's clearly possible to hit them. We can't prevent this during planning - we could improve the planning, but that does nothing for the execution-time batch explosion. But we can reduce the impact by using as little memory as possible. This patch improves the memory by rebalancing how the memory is divided between the hash table and batch files. Sometimes it's better to use fewer batch files, even if it means the hash table exceeds the limit. Whenever we need to increase the capacity of the hash node, we can do that by either doubling the number of batches or doubling the size of the in-memory hash table. The outcome is the same, allowing the hash node to handle a relation twice the size. But the memory usage may be very different - for low nbatch values it's better to add batches, for high nbatch values it's better to allow a larger hash table. It might seem like relaxing the memory limit - but that's not really the case. It has always been like that, except the memory used by batches was ignored, as if the files were free. This commit improves the situation by considering this memory when adjusting nbatch values. Increasing the hashtable memory limit may also help to prevent the batch explosion in the first place. Given enough hash collisions or duplicate hashes it's easy to get a batch that can't be split, resulting in a cycle of quickly doubling the number of batches. Allowing the hashtable to get larger may stop this, once the batch gets large enough to fit the skewed data.
1 parent c623e85 commit 1b09197

File tree

1 file changed

+129
-0
lines changed

1 file changed

+129
-0
lines changed

src/backend/executor/nodeHash.c

+129
Original file line numberDiff line numberDiff line change
@@ -848,6 +848,90 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
848848
nbatch = pg_nextpower2_32(Max(2, minbatch));
849849
}
850850

851+
/*
852+
* Optimize the total amount of memory consumed by the hash node.
853+
*
854+
* The nbatch calculation above focuses on the size of the in-memory hash
855+
* table, assuming no per-batch overhead. Now adjust the number of batches
856+
* and the size of the hash table to minimize total memory consumed by the
857+
* hash node.
858+
*
859+
* Each batch file has a BLCKSZ buffer, and we may need two files per
860+
* batch (inner and outer side). So with enough batches this can be
861+
* significantly more memory than the hashtable itself.
862+
*
863+
* The total memory usage may be expressed by this formula:
864+
*
865+
* (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
866+
*
867+
* where (inner_rel_bytes / nbatch) is the size of the in-memory hash
868+
* table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
869+
* buffers. But for sufficiently large values of inner_rel_bytes value
870+
* there may not be a nbatch value that would make both parts fit into
871+
* hash_table_bytes.
872+
*
873+
* In this case we can't enforce the memory limit - we're going to exceed
874+
* it. We can however minimize the impact and use as little memory as
875+
* possible. (We haven't really enforced it before either, as we simply
876+
* ignored the batch files.)
877+
*
878+
* The formula for total memory usage says that given an inner relation of
879+
* size inner_rel_bytes, we may divide it into an arbitrary number of of
880+
* batches. This determines both the size of the in-memory hash table and
881+
* the amount of memory needed for batch files. These two terms work in
882+
* opposite ways - when one decreases, the other increases.
883+
*
884+
* For low nbatch values, the hash table takes most of the memory, but at
885+
* some point the batch files start to dominate. If you combine these two
886+
* terms, the memory consumption (for a fixed size of the inner relation)
887+
* has a u-shape, with a minimum at some nbatch value.
888+
*
889+
* Our goal is to find this nbatch value, minimizing the memory usage. We
890+
* calculate the memory usage with half the batches (i.e. nbatch/2), and
891+
* if it's lower than the current memory usage we know it's better to use
892+
* fewer batches. We repeat this until reducing the number of batches does
893+
* not reduce the memory usage - we found the optimum. We know the optimum
894+
* exists, thanks to the u-shape.
895+
*
896+
* We only want to do this when exceeding the memory limit, not every
897+
* time. The goal is not to minimize memory usage in every case, but to
898+
* minimize the memory usage when we can't stay within the memory limit.
899+
*
900+
* For this reason we only consider reducing the number of batches. We
901+
* could try the opposite direction too, but that would save memory only
902+
* when most of the memory is used by the hash table. And the hash table
903+
* was used for the initial sizing, so we shouldn't be exceeding the
904+
* memory limit too much. We might save memory by using more batches, but
905+
* it would result in spilling more batch files, which does not seem like
906+
* a great trade off.
907+
*
908+
* While growing the hashtable, we also adjust the number of buckets, to
909+
* not have more than one tuple per bucket (load factor 1). We can onl do
910+
* this during the initial sizing - once we start building the hash,
911+
* nbucket is fixed.
912+
*/
913+
while (nbatch > 0)
914+
{
915+
/* how much memory are we using with current nbatch value */
916+
size_t current_space = hash_table_bytes + (2 * nbatch * BLCKSZ);
917+
918+
/* how much memory would we use with half the batches */
919+
size_t new_space = hash_table_bytes * 2 + (nbatch * BLCKSZ);
920+
921+
/* If the memory usage would not decrease, we found the optimum. */
922+
if (current_space < new_space)
923+
break;
924+
925+
/*
926+
* It's better to use half the batches, so do that and adjust the
927+
* nbucket in the opposite direction, and double the allowance.
928+
*/
929+
nbatch /= 2;
930+
nbuckets *= 2;
931+
932+
*space_allowed = (*space_allowed) * 2;
933+
}
934+
851935
Assert(nbuckets > 0);
852936
Assert(nbatch > 0);
853937

@@ -890,6 +974,47 @@ ExecHashTableDestroy(HashJoinTable hashtable)
890974
pfree(hashtable);
891975
}
892976

977+
/*
978+
* Consider adjusting the allowed hash table size, depending on the number
979+
* of batches, to minimize the overall memory usage (for both the hashtable
980+
* and batch files).
981+
*
982+
* We're adjusting the size of the hash table, not the (optimal) number of
983+
* buckets. We can't change that once we start building the hash, due to how
984+
* ExecHashGetBucketAndBatch calculates batchno/bucketno from the hash. This
985+
* means the load factor may not be optimal, but we're in damage control so
986+
* we accept slower lookups. It's still much better than batch explosion.
987+
*
988+
* Returns true if we chose to increase the batch size (and thus we don't
989+
* need to add batches), and false if we should increase nbatch.
990+
*/
991+
static bool
992+
ExecHashIncreaseBatchSize(HashJoinTable hashtable)
993+
{
994+
/*
995+
* How much additional memory would doubling nbatch use? Each batch may
996+
* require two buffered files (inner/outer), with a BLCKSZ buffer.
997+
*/
998+
size_t batchSpace = (hashtable->nbatch * 2 * BLCKSZ);
999+
1000+
/*
1001+
* Compare the new space needed for doubling nbatch and for enlarging the
1002+
* in-memory hash table. If doubling the hash table needs less memory,
1003+
* just do that. Otherwise, continue with doubling the nbatch.
1004+
*
1005+
* We're either doubling spaceAllowed of batchSpace, so which of those
1006+
* increases the memory usage the least is the same as comparing the
1007+
* values directly.
1008+
*/
1009+
if (hashtable->spaceAllowed <= batchSpace)
1010+
{
1011+
hashtable->spaceAllowed *= 2;
1012+
return true;
1013+
}
1014+
1015+
return false;
1016+
}
1017+
8931018
/*
8941019
* ExecHashIncreaseNumBatches
8951020
* increase the original number of batches in order to reduce
@@ -913,6 +1038,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
9131038
if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
9141039
return;
9151040

1041+
/* consider increasing size of the in-memory hash table instead */
1042+
if (ExecHashIncreaseBatchSize(hashtable))
1043+
return;
1044+
9161045
nbatch = oldnbatch * 2;
9171046
Assert(nbatch > 1);
9181047

0 commit comments

Comments
 (0)