Skip to content

Commit 293e24e

Browse files
committed
Cache hash index's metapage in rel->rd_amcache.
This avoids a very significant amount of buffer manager traffic and contention when scanning hash indexes, because it's no longer necessary to lock and pin the metapage for every scan. We do need some way of figuring out when the cache is too stale to use any more, so that when we lock the primary bucket page to which the cached metapage points us, we can tell whether a split has occurred since we cached the metapage data. To do that, we use the hash_prevblkno field in the primary bucket page, which would otherwise always be set to InvalidBuffer. This patch contains code so that it will continue working (although less efficiently) with hash indexes built before this change, but perhaps we should consider bumping the hash version and ripping out the compatibility code. That decision can be made later, though. Mithun Cy, reviewed by Jesper Pedersen, Amit Kapila, and by me. Before committing, I made a number of cosmetic changes to the last posted version of the patch, adjusted _hash_getcachedmetap to be more careful about order of operation, and made some necessary updates to the pageinspect documentation and regression tests.
1 parent 39c3ca5 commit 293e24e

File tree

8 files changed

+279
-181
lines changed

8 files changed

+279
-181
lines changed

contrib/pageinspect/expected/hash.out

+4-4
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ hash_page_stats(get_raw_page('test_hash_a_idx', 1));
9898
live_items | 0
9999
dead_items | 0
100100
page_size | 8192
101-
hasho_prevblkno | 4294967295
101+
hasho_prevblkno | 3
102102
hasho_nextblkno | 4294967295
103103
hasho_bucket | 0
104104
hasho_flag | 2
@@ -111,7 +111,7 @@ hash_page_stats(get_raw_page('test_hash_a_idx', 2));
111111
live_items | 0
112112
dead_items | 0
113113
page_size | 8192
114-
hasho_prevblkno | 4294967295
114+
hasho_prevblkno | 3
115115
hasho_nextblkno | 4294967295
116116
hasho_bucket | 1
117117
hasho_flag | 2
@@ -124,7 +124,7 @@ hash_page_stats(get_raw_page('test_hash_a_idx', 3));
124124
live_items | 1
125125
dead_items | 0
126126
page_size | 8192
127-
hasho_prevblkno | 4294967295
127+
hasho_prevblkno | 3
128128
hasho_nextblkno | 4294967295
129129
hasho_bucket | 2
130130
hasho_flag | 2
@@ -137,7 +137,7 @@ hash_page_stats(get_raw_page('test_hash_a_idx', 4));
137137
live_items | 0
138138
dead_items | 0
139139
page_size | 8192
140-
hasho_prevblkno | 4294967295
140+
hasho_prevblkno | 3
141141
hasho_nextblkno | 4294967295
142142
hasho_bucket | 3
143143
hasho_flag | 2

doc/src/sgml/pageinspect.sgml

+1-1
Original file line numberDiff line numberDiff line change
@@ -539,7 +539,7 @@ live_items | 407
539539
dead_items | 0
540540
page_size | 8192
541541
free_size | 8
542-
hasho_prevblkno | 4294967295
542+
hasho_prevblkno | 4096
543543
hasho_nextblkno | 8474
544544
hasho_bucket | 0
545545
hasho_flag | 66

src/backend/access/hash/README

+46-22
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,50 @@ We choose to always lock the lower-numbered bucket first. The metapage is
149149
only ever locked after all bucket locks have been taken.
150150

151151

152+
Metapage Caching
153+
----------------
154+
155+
Both scanning the index and inserting tuples require locating the bucket
156+
where a given tuple ought to be located. To do this, we need the bucket
157+
count, highmask, and lowmask from the metapage; however, it's undesirable
158+
for performance reasons to have to have to lock and pin the metapage for
159+
every such operation. Instead, we retain a cached copy of the metapage
160+
in each each backend's relcache entry. This will produce the correct
161+
bucket mapping as long as the target bucket hasn't been split since the
162+
last cache refresh.
163+
164+
To guard against the possibility that such a split has occurred, the
165+
primary page of each bucket chain stores the number of buckets that
166+
existed as of the time the bucket was last split, or if never split as
167+
of the time it was created, in the space normally used for the
168+
previous block number (that is, hasho_prevblkno). This doesn't cost
169+
anything because the primary bucket page is always the first page in
170+
the chain, and the previous block number is therefore always, in
171+
reality, InvalidBlockNumber.
172+
173+
After computing the ostensibly-correct bucket number based on our cached
174+
copy of the metapage, we lock the corresponding primary bucket page and
175+
check whether the bucket count stored in hasho_prevblkno is greater than
176+
our the number of buckets stored in our cached copy of the metapage. If
177+
so, the bucket has certainly been split, because the must originally
178+
have been less than the number of buckets that existed at that time and
179+
can't have increased except due to a split. If not, the bucket can't have
180+
been split, because a split would have created a new bucket with a higher
181+
bucket number than any we'd seen previously. In the latter case, we've
182+
locked the correct bucket and can proceed; in the former case, we must
183+
release the lock on this bucket, lock the metapage, update our cache,
184+
unlock the metapage, and retry.
185+
186+
Needing to retry occasionally might seem expensive, but the number of times
187+
any given bucket can be split is limited to a few dozen no matter how
188+
many times the hash index is accessed, because the total number of
189+
buckets is limited to less than 2^32. On the other hand, the number of
190+
times we access a bucket is unbounded and will be several orders of
191+
magnitude larger even in unsympathetic cases.
192+
193+
(The metapage cache is new in v10. Older hash indexes had the primary
194+
bucket page's hasho_prevblkno initialized to InvalidBuffer.)
195+
152196
Pseudocode Algorithms
153197
---------------------
154198

@@ -188,17 +232,7 @@ track of available overflow pages.
188232

189233
The reader algorithm is:
190234

191-
pin meta page and take buffer content lock in shared mode
192-
loop:
193-
compute bucket number for target hash key
194-
release meta page buffer content lock
195-
if (correct bucket page is already locked)
196-
break
197-
release any existing bucket page buffer content lock (if a concurrent
198-
split happened)
199-
take the buffer content lock on bucket page in shared mode
200-
retake meta page buffer content lock in shared mode
201-
release pin on metapage
235+
lock the primary bucket page of the target bucket
202236
if the target bucket is still being populated by a split:
203237
release the buffer content lock on current bucket page
204238
pin and acquire the buffer content lock on old bucket in shared mode
@@ -238,17 +272,7 @@ which this bucket is formed by split.
238272

239273
The insertion algorithm is rather similar:
240274

241-
pin meta page and take buffer content lock in shared mode
242-
loop:
243-
compute bucket number for target hash key
244-
release meta page buffer content lock
245-
if (correct bucket page is already locked)
246-
break
247-
release any existing bucket page buffer content lock (if a concurrent
248-
split happened)
249-
take the buffer content lock on bucket page in exclusive mode
250-
retake meta page buffer content lock in shared mode
251-
release pin on metapage
275+
lock the primary bucket page of the target bucket
252276
-- (so far same as reader, except for acquisition of buffer content lock in
253277
exclusive mode on primary bucket page)
254278
if the bucket-being-split flag is set for a bucket and pin count on it is

src/backend/access/hash/hash.c

+33-26
Original file line numberDiff line numberDiff line change
@@ -507,28 +507,24 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
507507
Bucket orig_maxbucket;
508508
Bucket cur_maxbucket;
509509
Bucket cur_bucket;
510-
Buffer metabuf;
510+
Buffer metabuf = InvalidBuffer;
511511
HashMetaPage metap;
512-
HashMetaPageData local_metapage;
512+
HashMetaPage cachedmetap;
513513

514514
tuples_removed = 0;
515515
num_index_tuples = 0;
516516

517517
/*
518-
* Read the metapage to fetch original bucket and tuple counts. Also, we
519-
* keep a copy of the last-seen metapage so that we can use its
520-
* hashm_spares[] values to compute bucket page addresses. This is a bit
521-
* hokey but perfectly safe, since the interesting entries in the spares
522-
* array cannot change under us; and it beats rereading the metapage for
523-
* each bucket.
518+
* We need a copy of the metapage so that we can use its hashm_spares[]
519+
* values to compute bucket page addresses, but a cached copy should be
520+
* good enough. (If not, we'll detect that further down and refresh the
521+
* cache as necessary.)
524522
*/
525-
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
526-
metap = HashPageGetMeta(BufferGetPage(metabuf));
527-
orig_maxbucket = metap->hashm_maxbucket;
528-
orig_ntuples = metap->hashm_ntuples;
529-
memcpy(&local_metapage, metap, sizeof(local_metapage));
530-
/* release the lock, but keep pin */
531-
LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
523+
cachedmetap = _hash_getcachedmetap(rel, &metabuf, false);
524+
Assert(cachedmetap != NULL);
525+
526+
orig_maxbucket = cachedmetap->hashm_maxbucket;
527+
orig_ntuples = cachedmetap->hashm_ntuples;
532528

533529
/* Scan the buckets that we know exist */
534530
cur_bucket = 0;
@@ -546,7 +542,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
546542
bool split_cleanup = false;
547543

548544
/* Get address of bucket's start page */
549-
bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
545+
bucket_blkno = BUCKET_TO_BLKNO(cachedmetap, cur_bucket);
550546

551547
blkno = bucket_blkno;
552548

@@ -577,20 +573,27 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
577573
* hashm_lowmask might be old enough to cause us to fail to remove
578574
* tuples left behind by the most recent split. To prevent that,
579575
* now that the primary page of the target bucket has been locked
580-
* (and thus can't be further split), update our cached metapage
581-
* data.
576+
* (and thus can't be further split), check whether we need to
577+
* update our cached metapage data.
578+
*
579+
* NB: The check for InvalidBlockNumber is only needed for
580+
* on-disk compatibility with indexes created before we started
581+
* storing hashm_maxbucket in the primary page's hasho_prevblkno.
582582
*/
583-
LockBuffer(metabuf, BUFFER_LOCK_SHARE);
584-
memcpy(&local_metapage, metap, sizeof(local_metapage));
585-
LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
583+
if (bucket_opaque->hasho_prevblkno != InvalidBlockNumber &&
584+
bucket_opaque->hasho_prevblkno > cachedmetap->hashm_maxbucket)
585+
{
586+
cachedmetap = _hash_getcachedmetap(rel, &metabuf, true);
587+
Assert(cachedmetap != NULL);
588+
}
586589
}
587590

588591
bucket_buf = buf;
589592

590593
hashbucketcleanup(rel, cur_bucket, bucket_buf, blkno, info->strategy,
591-
local_metapage.hashm_maxbucket,
592-
local_metapage.hashm_highmask,
593-
local_metapage.hashm_lowmask, &tuples_removed,
594+
cachedmetap->hashm_maxbucket,
595+
cachedmetap->hashm_highmask,
596+
cachedmetap->hashm_lowmask, &tuples_removed,
594597
&num_index_tuples, split_cleanup,
595598
callback, callback_state);
596599

@@ -600,16 +603,20 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
600603
cur_bucket++;
601604
}
602605

606+
if (BufferIsInvalid(metabuf))
607+
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_NOLOCK, LH_META_PAGE);
608+
603609
/* Write-lock metapage and check for split since we started */
604610
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
605611
metap = HashPageGetMeta(BufferGetPage(metabuf));
606612

607613
if (cur_maxbucket != metap->hashm_maxbucket)
608614
{
609615
/* There's been a split, so process the additional bucket(s) */
610-
cur_maxbucket = metap->hashm_maxbucket;
611-
memcpy(&local_metapage, metap, sizeof(local_metapage));
612616
LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
617+
cachedmetap = _hash_getcachedmetap(rel, &metabuf, true);
618+
Assert(cachedmetap != NULL);
619+
cur_maxbucket = cachedmetap->hashm_maxbucket;
613620
goto loop_top;
614621
}
615622

src/backend/access/hash/hashinsert.c

+18-65
Original file line numberDiff line numberDiff line change
@@ -32,19 +32,14 @@ _hash_doinsert(Relation rel, IndexTuple itup)
3232
Buffer bucket_buf;
3333
Buffer metabuf;
3434
HashMetaPage metap;
35-
BlockNumber blkno;
36-
BlockNumber oldblkno;
37-
bool retry;
35+
HashMetaPage usedmetap = NULL;
3836
Page metapage;
3937
Page page;
4038
HashPageOpaque pageopaque;
4139
Size itemsz;
4240
bool do_expand;
4341
uint32 hashkey;
4442
Bucket bucket;
45-
uint32 maxbucket;
46-
uint32 highmask;
47-
uint32 lowmask;
4843

4944
/*
5045
* Get the hash key for the item (it's stored in the index tuple itself).
@@ -57,10 +52,14 @@ _hash_doinsert(Relation rel, IndexTuple itup)
5752
* need to be consistent */
5853

5954
restart_insert:
60-
/* Read the metapage */
61-
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
55+
56+
/*
57+
* Read the metapage. We don't lock it yet; HashMaxItemSize() will
58+
* examine pd_pagesize_version, but that can't change so we can examine
59+
* it without a lock.
60+
*/
61+
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_NOLOCK, LH_META_PAGE);
6262
metapage = BufferGetPage(metabuf);
63-
metap = HashPageGetMeta(metapage);
6463

6564
/*
6665
* Check whether the item can fit on a hash page at all. (Eventually, we
@@ -76,66 +75,17 @@ _hash_doinsert(Relation rel, IndexTuple itup)
7675
itemsz, HashMaxItemSize(metapage)),
7776
errhint("Values larger than a buffer page cannot be indexed.")));
7877

79-
oldblkno = InvalidBlockNumber;
80-
retry = false;
81-
82-
/*
83-
* Loop until we get a lock on the correct target bucket.
84-
*/
85-
for (;;)
86-
{
87-
/*
88-
* Compute the target bucket number, and convert to block number.
89-
*/
90-
bucket = _hash_hashkey2bucket(hashkey,
91-
metap->hashm_maxbucket,
92-
metap->hashm_highmask,
93-
metap->hashm_lowmask);
94-
95-
blkno = BUCKET_TO_BLKNO(metap, bucket);
96-
97-
/*
98-
* Copy bucket mapping info now; refer the comment in
99-
* _hash_expandtable where we copy this information before calling
100-
* _hash_splitbucket to see why this is okay.
101-
*/
102-
maxbucket = metap->hashm_maxbucket;
103-
highmask = metap->hashm_highmask;
104-
lowmask = metap->hashm_lowmask;
105-
106-
/* Release metapage lock, but keep pin. */
107-
LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
108-
109-
/*
110-
* If the previous iteration of this loop locked the primary page of
111-
* what is still the correct target bucket, we are done. Otherwise,
112-
* drop any old lock before acquiring the new one.
113-
*/
114-
if (retry)
115-
{
116-
if (oldblkno == blkno)
117-
break;
118-
_hash_relbuf(rel, buf);
119-
}
120-
121-
/* Fetch and lock the primary bucket page for the target bucket */
122-
buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
123-
124-
/*
125-
* Reacquire metapage lock and check that no bucket split has taken
126-
* place while we were awaiting the bucket lock.
127-
*/
128-
LockBuffer(metabuf, BUFFER_LOCK_SHARE);
129-
oldblkno = blkno;
130-
retry = true;
131-
}
78+
/* Lock the primary bucket page for the target bucket. */
79+
buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_WRITE,
80+
&usedmetap);
81+
Assert(usedmetap != NULL);
13282

13383
/* remember the primary bucket buffer to release the pin on it at end. */
13484
bucket_buf = buf;
13585

13686
page = BufferGetPage(buf);
13787
pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
138-
Assert(pageopaque->hasho_bucket == bucket);
88+
bucket = pageopaque->hasho_bucket;
13989

14090
/*
14191
* If this bucket is in the process of being split, try to finish the
@@ -151,8 +101,10 @@ _hash_doinsert(Relation rel, IndexTuple itup)
151101
/* release the lock on bucket buffer, before completing the split. */
152102
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
153103

154-
_hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
155-
maxbucket, highmask, lowmask);
104+
_hash_finish_split(rel, metabuf, buf, bucket,
105+
usedmetap->hashm_maxbucket,
106+
usedmetap->hashm_highmask,
107+
usedmetap->hashm_lowmask);
156108

157109
/* release the pin on old and meta buffer. retry for insert. */
158110
_hash_dropbuf(rel, buf);
@@ -225,6 +177,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
225177
*/
226178
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
227179

180+
metap = HashPageGetMeta(metapage);
228181
metap->hashm_ntuples += 1;
229182

230183
/* Make sure this stays in sync with _hash_expandtable() */

0 commit comments

Comments
 (0)