git.postgresql.org Git - postgresql.git/log

Replace binaryheap + index with pairingheap in reorderbuffer.c

A pairing heap can perform the same operations as the binary heap +
index, with as good or better algorithmic complexity, and that's an
existing data structure so that we don't need to invent anything new
compared to v16. This commit makes the new binaryheap functionality
that was added in commits b840508644 and bcb14f4abc unnecessary, but
they will be reverted separately.

Remove the optimization to only build and maintain the heap when the
amount of memory used is close to the limit, becuase the bookkeeping
overhead with the pairing heap seems to be small enough that it
doesn't matter in practice.

Reported-by: Jeff Davis
Author: Heikki Linnakangas
Reviewed-by: Michael Paquier, Hayato Kuroda, Masahiko Sawada
Discussion: https://postgr.es/m/12747c15811d94efcc5cda72d6b35c80d7bf3443.camel%40j-davis.com

Fix grammar.

Reported-by: Michael Paquier <[email protected]>
Discussion: https://postgr.es/m/ZhdKqj5DwoOzirFv%40paquier.xyz

Fix potential stack overflow in incremental backup.

The user can set RELSEG_SIZE to a high number at compile time, so we
can't use it to control the size of an array on the stack: it could be
many gigabytes in size. On closer inspection, we don't really need that
intermediate array anyway. Let's just write directly into the output
array, and then perform the absolute->relative adjustment in place.
This fixes new code from commit dc212340058.

Reviewed-by: Robert Haas <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com

Fix inconsistency with replay of hash squeeze record for clean buffers

aa5edbe379d6 has tweaked _hash_freeovflpage() so as the write buffer's
LSN is updated only when necessary, when REGBUF_NO_CHANGE is not used.

The replay code was not consistent with that, causing the write buffer's
LSN to be updated and its page to be marked as dirty even if the buffer
was registered in a "clean" state.  This was possible for the case of a
squeeze record when there are no tuples to add to the write buffer, for
(is_prim_bucket_same_wrt && !is_prev_bucket_same_wrt).

I have performed some validation of this commit with
wal_consistency_checking and a change in WAL that logs REGBUF_NO_CHANGE
to a new BKPIMAGE_*.  Thanks to that, it is possible to know at replay
if a buffer was clean when it was registered, then cross-checked the LSN
of the "clean" page copy coming from WAL with the LSN of the block once
the record has been replayed.  This eats one bit in bimg_info, which is
not acceptable to be integrated as-is, but it could become handy in the
future.  I didn't spot other areas than the one fixed by this commit at
the extent of what the main regression test suite covers.

As this is an oversight in aa5edbe379d6, no backpatch is required.

Reported-by: Zubeyr Eryilmaz
Author: Hayato Kuroda
Reviewed-by: Amit Kapila, Michael Paquier
Discussion: https://postgr.es/m/[email protected]

Fix plpgsql's handling of -- comments following expressions.

Up to now, read_sql_construct() has collected all the source text from
the statement or expression's initial token up to the character just
before the "until" token.  It normally tries to strip trailing
whitespace from that, largely for neatness.  If there was a "-- text"
comment after the expression, this resulted in removing the newline
that terminates the comment, which creates a hazard if we try to paste
the collected text into a larger SQL construct without inserting a
newline after it.  In particular this caused our handling of CASE
constructs to fail if there's a comment after a WHEN expression.

Commit 4adead1d2 noticed a similar problem with cursor arguments,
and worked around it through the rather crude hack of suppressing
the whitespace-trimming behavior for those.  Rather than do that
and leave the hazard open for future hackers to trip over, let's
fix it properly.  pl_scanner.c already has enough infrastructure
to report the end location of the expression's last token, so
we can copy up to that location and never collect any trailing
whitespace or comment to begin with.

Erik Wienhold and Tom Lane, per report from Michal Bartak.
Back-patch to all supported branches.

Discussion: https://postgr.es/m/CAAVzF_FjRoi8fOVuLCZhQJx6HATQ7MKm=aFOHWZODFnLmjX-xA@mail.gmail.com

Doc: Update ulinks to RFC documents to avoid redirect

The tools.ietf.org site has been decommissioned and replaced by a
number of sites serving various purposes. Links to RFCs and BCPs
are now 301 redirected to their new respective IETF sites. Since
this serves no purpose and only adds network overhead, update our
links to the new locations.

Backpatch to all supported versions.

Discussion: https://postgr.es/m/3C1CEA99-FCED-447D-9858-5A579B4C6687@yesql.se
Backpatch-through: v12

Make GIN tests using injection points concurrent-safe

f587338dec87 has introduced in the test module injection_points a SQL
function called injection_points_set_local(), that can be used to make
all the injection points linked to the process where they are attached,
discarded automatically if any remain once the process exits.

e2e3b8ae9ed7 has added a NO_INSTALLCHECK to the test module to prevent
the use of installcheck. Now that there is a way to make the test
concurrent-safe, let's use it and remove the installcheck restriction.

Concurrency issues could be easily reproduced by running in a tight
loop a command like this one, in src/test/modules/gin/ (hardcoding
pg_sleep() after attaching injection points enlarges the race window)
and a second test suite like contrib/btree_gin/:

make installcheck USE_MODULE_DB=1

Reviewed-by: Andrey Borodin
Discussion: https://postgr.es/m/[email protected]

Fix a test in failover slots regression test.

Wait for the standby to catch up before syncing the slots with
pg_sync_replication_slots(), otherwise, the logical slot could be ahead
and the sync would fail.

The other way to fix the test is to change it to use slotsync worker and
poll for the sync to get finished but the current approach is better as
this is a predictable way to write the test.

Per buildfarm

Author: Hou Zhijie
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/OS0PR01MB571665359F2F5DCD3ADABC9F94002@OS0PR01MB5716.jpnprd01.prod.outlook.com

Fix illegal attribute propagation in LLVM JIT.

Commit 72559438 started copying more attributes from AttributeTemplate
to the functions we generate on the fly. In the case of deform
functions, which return void, this meant that "noundef", from
AttributeTemplate's return value (a Datum) was copied to a void type.
Older LLVM releases were OK with that, but LLVM 18 crashes.

Update our llvm_copy_attributes() function to skip copying the attribute
for the return value, if the target function returns void.

Thanks to Dmitry Dolgov for help chasing this down.

Back-patch to all supported releases, like 72559438.

Reported-by: Pavel Stehule <[email protected]>
Reviewed-by: Dmitry Dolgov <[email protected]>
Discussion: https://postgr.es/m/CAFj8pRACpVFr7LMdVYENUkScG5FCYMZDDdSGNU-tch%2Bw98OxYg%40mail.gmail.com

Fixup various StringInfo function usages

This adjusts various appendStringInfo* function calls to use a more
appropriate and efficient function with the same behavior. For example,
use appendStringInfoChar() when appending a single character rather than
appendStringInfo() and appendStringInfoString() when no formatting is
required rather than using appendStringInfo().

All adjustments made here are in code that's new to v17, so it makes
sense to fix these now rather than wait a few years and make
backpatching harder.

Discussion: https://postgr.es/m/CAApHDvojY2UvMiO+9_55ArTj10P1LBNJyyoGB+C65BLDNT0GsQ@mail.gmail.com
Reviewed-by: Nathan Bossart, Tom Lane

revert: Transform OR clauses to ANY expression

This commit reverts 72bd38cc99 due to implementation and design issues.

Reported-by: Tom Lane
Discussion: https://postgr.es/m/3604469.1712628736%40sss.pgh.pa.us

Remove unused BumpBlockIsValid macro

The bump allocator was recently added in 29f6a959c.  Our other
allocators have a similar macro to this, but seemingly the version of
the macro for those allocators is only used in places where the chunk
header is decoded.  Since the bump allocator has no chunk header, none
of those functions exist for bump therefore macro is unused.  Remove it.

Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/5f724fb2-96e1-4f36-b65b-47b337ad432e@eisentraut.org

Checks for ALTER TABLE ... SPLIT/MERGE PARTITIONS ... commands

Check that the target partition actually belongs to the parent table.

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/cd842601-cf1a-9806-f7b7-d2509b93ba61%40gmail.com
Author: Dmitry Koval

Doc: use "an SQL" instead of "a SQL"

Although which is correct depends entirely on whether you pronounce SQL
as "ess-que-ell" or "sequel", we have standardized on the former in our
user-facing documentation, so use the correct article according to that
pronunciation.

Discussion: https://postgr.es/m/CAApHDvp3osQwQam+wNTp9BdhP+QfWO6aY6ZTixQQMfM-UArKCw@mail.gmail.com

doc: Remove stray comma from list of psql options

Back in 7.2 the list of options had short options and long options
on the same line separated by comma, but since 7.3 they are listed
separate lines. The comma on -X was left behind so fix by removing
and backpatching all the way.

Reported-by: [email protected]
Discussion: https://postgr.es/m/171267154345.684.7212826057932148541@wrigleys.postgresql.org
Backpatch-through: v12

Fix incorrect format placeholders

Update config.guess and config.sub

Fix whitespace

Get rid of anonymous struct

This is a C11 feature, and we require C99. While at it, go the further
step and get rid of the surrounding union (with uintptr_t) entirely,
as there is currently no use case for this file to access the header of
BlocktableEntry as a uintptr_t, and there are no additional alignment
requirements. The least invasive way seems to be to transfer the old
union name to this struct.

Reported by Pavel Borisov and Andres Freund, per buildfarm member mylodon
Reviewed by Pavel Borisov

Discussion: https://postgr.es/m/CALT9ZEH11NYV8AOzKb1bWhCf6J0H=H31f0MgT9xX+HdqvcA1rw@mail.gmail.com

libpq error message fixes

Remove stray paren, capitalize SSL and ALPN.

Author: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/20240409.104613.1653854506705708036 [email protected]

Fix typo in docs

Author: Erik Rijkers
Discussion: https://www.postgresql.org/message-id/0167b1e1-676c-66ba-e857-3ad7cd84404f@xs4all.nl

Add missing set_pglocale_pgservice() for pg_walsummary and pg_combinebackup

These calls are required to make both tools work with NLS.

Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20240408.162702.183779935636035593 [email protected]

injection_points: Fix race condition with local injection point tests

The module relies on a shmem exit callback to clean up any injection
points linked to a specific process. One of the tests checks for the
case of an injection point name reused in a second connection where the
first connection should clean it up, but it did not count for the fact
that the shmem exit callback of the first connection may not have run
when the second connection begins its work.

The regress library includes a wait_pid() that can be used for this
purpose, instead of a custom wait logic, so let's rely on it to wait for
the first connection to exit before working with the second connection.
The module gains a REGRESS_OPTS to be able to look at the regress
library's dlpath.

This issue could be reproduced with a hardcoded sleep() in the shmem
exit callback, and the CI has been able to trigger it sporadically.

Oversight in f587338dec87.

Reported-by: Bharath Rupireddy
Reviewed-by: Andrey Borodin
Discussion: https://postgr.es/m/[email protected]

In psql, avoid leaking a PGresult after a query is cancelled.

After a query cancel, the tail end of ExecQueryAndProcessResults
took care to clear any not-yet-read PGresults; but it forgot about
the one it has already read.  There would only be such a result
when handling a multi-command string made with "\;", so that you'd
have to cancel an earlier command in such a string to reach the
bug at all.  Even then, there would only be leakage of a single
PGresult per cancel, so it's not surprising nobody noticed this.
But a leak is a leak.

Noted while re-reviewing 90f517821, but this is independent of that:
it dates to 7844c9918.  Back-patch to v15 where that came in.

Further review for re-implementation of psql's FETCH_COUNT feature.

Alexander Lakhin noted an obsolete comment, which led me to revisit
some other important comments in the patch, and that study turned up a
couple of unintended ways in which the chunked-fetch code path didn't
match the normal code path in ExecQueryAndProcessResults. The only
nontrivial problem is that it didn't call PrintQueryStatus, so that
we'd not print the final status result from DML ... RETURNING
commands. To avoid code duplication, move the filter for whether a
result is from RETURNING from PrintQueryResult to PrintQueryStatus.

Discussion: https://postgr.es/m/0023bea5-79c0-476e-96c8-dad599cc3ad8@gmail.com

Teach radix tree to embed values at runtime

Previously, the decision to store values in leaves or within the child
pointer was made at compile time, with variable length values using
leaves by necessity. This commit allows introspecting the length of
variable length values at runtime for that decision. This requires
the ability to tell whether the last-level child pointer is actually
a value, so we use a pointer tag in the lowest level bit.

Use this in TID store. This entails adding a byte to the header to
reserve space for the tag. Commit f35bd9bf3 stores up to three offsets
within the header with no bitmap, and now the header can be embedded
as above. This reduces worst-case memory usage when TIDs are sparse.

Reviewed (in an earlier version) by Masahiko Sawada

Discussion: https://postgr.es/m/CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com
Discussion: https://postgr.es/m/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com

Teach TID store to skip bitmap for small numbers of offsets

The header portion of BlocktableEntry has enough padding space for
an array of 3 offsets (1 on 32-bit platforms). Use this space instead
of having a sparse bitmap array. This will take up a constant amount
of space no matter what the offsets are.

Reviewed (in an earlier version) by Masahiko Sawada

Discussion: https://postgr.es/m/CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com
Discussion: https://postgr.es/m/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com

Provide a way block-level table AMs could re-use acquire_sample_rows()

While keeping API the same, this commit provides a way for block-level table
AMs to re-use existing acquire_sample_rows() by providing custom callbacks
for getting the next block and the next tuple.

Reported-by: Andres Freund
Discussion: https://postgr.es/m/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de
Reviewed-by: Pavel Borisov

Fix some grammer errors from error messages and codes comments

Discussion: https://postgr.es/m/CAHewXNkGMPU50QG7V6Q60JGFORfo8LfYO1_GCkCa0VWbmB-fEw%40mail.gmail.com
Author: Tender Wang

Fill CommonRdOptions with default values in extract_autovac_opts()

Reported-by: Thomas Munro
Reported-by: Pavel Borisov
Discussion: https://postgr.es/m/CA%2BhUKGLZzLR50RBvuqOO3MZ%3DF54ETz-rTp1PDX9uDGP_GqyYqA%40mail.gmail.com

Adjust wording of trace_connection_negotiation GUC's description

We're not very consistent about this across all the GUCs, but the
"Logs ..." phrasing is more common than "Log ...", and is used by the
neighboring "log_connections" and "log_disconnections" GUCs, so switch
to that.

Author: Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/20240408.154010.1170771365226258348 [email protected]

Custom reloptions for table AM

Let table AM define custom reloptions for its tables. This allows specifying
AM-specific parameters by the WITH clause when creating a table.

The reloptions, which could be used outside of table AM, are now extracted
into the CommonRdOptions data structure. These options could be by decision
of table AM directly specified by a user or calculated in some way.

The new test module test_tam_options evaluates the ability to set up custom
reloptions and calculate fields of CommonRdOptions on their base.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis

Fix the intermittent buildfarm failures in 040_standby_failover_slots_sync.

It is possible that even if the primary waits for the subscriber to catch
up and then disables the subscription, the XLOG_RUNNING_XACTS record gets
inserted between the two steps by bgwriter and walsender processes it.
This can move the restart_lsn of the corresponding slot in an
unpredictable way which further leads to slot sync failure.

To ensure predictable behaviour, we drop the subscription and manually
create the slot before the test. The other idea we discussed to write a
predictable test is to use injection points to control the bgwriter
logging XLOG_RUNNING_XACTS but that needs more analysis. We can add a
separate test using injection points.

Per buildfarm

Author: Hou Zhijie
Reviewed-by: Amit Kapila, Shveta Malik
Discussion: https://postgr.es/m/CAA4eK1JD8h_XLRsK_o_Xh=5MhTzm+6d4Cb4_uPgFJ2wSQDah=g@mail.gmail.com

Use bump context for TID bitmaps stored by vacuum

Vacuum does not pfree individual entries, and only frees the entire
storage space when finished with it. This allows using a bump context,
eliminating the chunk header in each leaf allocation. Most leaf
allocations will be 16 to 32 bytes, so that's a significant savings.
TidStoreCreateLocal gets a boolean parameter to indicate that the
created store is insert-only.

This requires a separate tree context for iteration, since we free
the iteration state after iteration completes.

Discussion: https://postgr.es/m/CANWCAZac%3DpBePg3rhX8nXkUuaLoiAJJLtmnCfZsPEAS4EtJ%3Dkg%40mail.gmail.com
Discussion: https://postgr.es/m/CANWCAZZQFfxvzO8yZHFWtQV+Z2gAMv1ku16Vu7KWmb5kZQyd1w@mail.gmail.com

JSON_TABLE: Add support for NESTED paths and columns

A NESTED path allows to extract data from nested levels of JSON
objects given by the parent path expression, which are projected as
columns specified using a nested COLUMNS clause, just like the parent
COLUMNS clause. Rows comprised from a NESTED columns are "joined"
to the row comprised from the parent columns. If a particular NESTED
path evaluates to 0 rows, then the nested COLUMNS will emit NULLs,
making it an OUTER join.

NESTED columns themselves may include NESTED paths to allow
extracting data from arbitrary nesting levels, which are likewise
joined against the rows at the parent level.

Multiple NESTED paths at a given level are called "sibling" paths
and their rows are combined by UNIONing them, that is, after being
joined against the parent row as described above.

Author: Nikita Glukhov <[email protected]>
Author: Teodor Sigaev <[email protected]>
Author: Oleg Bartunov <[email protected]>
Author: Alexander Korotkov <[email protected]>
Author: Andrew Dunstan <[email protected]>
Author: Amit Langote <[email protected]>
Author: Jian He <[email protected]>

Reviewers have included (in no particular order):

Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He

Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130 [email protected]
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com

Fix JsonExpr deparsing to emit QUOTES and WRAPPER correctly

Currently, get_json_expr_options() does not emit the default values
for QUOTES (KEEP QUOTES) and WRAPPER (WITHOUT WRAPPER). That causes
the deparsed JSON_TABLE() columns, such as those contained in a a
view's query, to behave differently when executed than the original
definition. That's because the rules encoded in
transformJsonTableColumns() will choose either JSON_VALUE() or
JSON_QUERY() as implementation to execute a given column's path
expression depending on the QUOTES and WRAPPER specificationd and
they have slightly different semantics.

Reported-by: Jian He <[email protected]>
Discussion: https://postgr.es/m/CACJufxEqhqsfrg_p7EMyo5zak3d767iFDL8vz_4%3DZBHpOtrghw%40mail.gmail.com

Fix restriction on specifying KEEP QUOTES in JSON_QUERY()

Currently, transformJsonFuncExpr() enforces some restrictions on
the combinations of QUOTES and WRAPPER clauses that can be specified
in JSON_QUERY(). The intent was to only prevent the useless
combination WITH WRAPPER OMIT QUOTES, but the coding prevented KEEP
QUOTES too, which is not helpful. Fix that.

Fix the wording of or_to_any_transform_limit description

Reported-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20240408.144657.1746688590065601661.horikyota.ntt%40gmail.com

Fix the value of or_to_any_transform_limit in postgresql.conf.sample

Reported-by: Justin Pryzby
Discussion: https://postgr.es/m/ZhM8jH8gsKm5Q-9p%40pryzbyj2023

Remove references to old function name

In a97bbe1f1df I accidentally referenced heapgetpage(), both in a function
name and a comment. But since 44086b09753 the relevant function is named
heap_prepare_pagescan(). Rename the new function to page_collect_tuples().

Reported-by: Melanie Plageman <[email protected]>
Reported-by: David Rowley <[email protected]>
Discussion: https://postgr.es/m/20240407172615 [email protected]
Discussion: https://postgr.es/m/CAApHDvp4SniHopTrVeKWcEvNXFtdki0utAvO=5R7H6TNhtULRQ@mail.gmail.com

Add pg_buffercache_evict() function for testing.

When testing buffer pool logic, it is useful to be able to evict
arbitrary blocks.  This function can be used in SQL queries over the
pg_buffercache view to set up a wide range of buffer pool states.  Of
course, buffer mappings might change concurrently so you might evict a
block other than the one you had in mind, and another session might
bring it back in at any time.  That's OK for the intended purpose of
setting up developer testing scenarios, and more complicated interlocking
schemes to give stronger guararantees about that would likely be less
flexible for actual testing work anyway.  Superuser-only.

Author: Palak Chaturvedi <[email protected]>
Author: Thomas Munro <[email protected]> (docs, small tweaks)
Reviewed-by: Nitin Jadhav <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Cary Huang <[email protected]>
Reviewed-by: Cédric Villemain <[email protected]>
Reviewed-by: Jim Nasby <[email protected]>
Reviewed-by: Maxim Orlov <[email protected]>
Reviewed-by: Thomas Munro <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CALfch19pW48ZwWzUoRSpsaV9hqt0UPyaBPC4bOZ4W+c7FF566A@mail.gmail.com

Fix alignment of stack variable

Declare with union similar to PGAlignedBlock.

Report and fix by Andres Freund

Discussion: https://postgr.es/m/20240407190731.izm3mdazednrsiqk%40awork3.anarazel.de

Add more tab completion support for ALTER DEFAULT PRIVILEGES in psql.

This adds tab completion of "GRANT" and "REVOKE [GRANT OPTION FOR]"
for ALTER DEFAULT PRIVILEGES, and adds "WITH GRANT OPTION" for
ALTER DEFAULT PRIVILEGES ... GRANT ... TO role.

Author: Vignesh C, with cosmetic adjustments by me
Reviewed-by: Shubham Khanna, Masahiko Sawada
Discussion: https://postgr.es/m/CALDaNm1aEdJb-QJi%3DGWStkfj_%2BEDUK_VtDkn%2BTjQ2z7HyU0MBw%40mail.gmail.com

Remove redundant nbtree preprocessing assertions.

One of the assertions was the subject of a false positive complaint from
Coverity, but none of the assertions added much, so get rid of them.

Reported-By: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/3000247.1712537309@sss.pgh.pa.us

simplehash: Free collisions array in SH_STAT

While SH_STAT() is only used for debugging, the allocated array can be large,
and therefore should be freed.

It's unclear why coverity started warning now.

Reported-by: Tom Lane <[email protected]>
Reported-by: Coverity
Discussion: https://postgr.es/m/3005248.1712538233@sss.pgh.pa.us
Backpatch: 12-

Fix check for 'outlen' return from SSL_select_next_proto()

Fixes compiler warning reported by Andres Freund.

Discusssion: https://www.postgresql.org/message-id/20240408015055 [email protected]

Silence perlcritic warnings in new libpq tests

Per buildfarm member 'koel'.

Send ALPN in TLS handshake, require it in direct SSL connections

libpq now always tries to send ALPN. With the traditional negotiated
SSL connections, the server accepts the ALPN, and refuses the
connection if it's not what we expect, but connecting without ALPN is
still OK. With the new direct SSL connections, ALPN is mandatory.

NOTE: This uses "TBD-pgsql" as the protocol ID. We must register a
proper one with IANA before the release!

Author: Greg Stark, Heikki Linnakangas
Reviewed-by: Matthias van de Meent, Jacob Champion

Support TLS handshake directly without SSLRequest negotiation

By skipping SSLRequest, you can eliminate one round-trip when
establishing a TLS connection. It is also more friendly to generic TLS
proxies that don't understand the PostgreSQL protocol.

This is disabled by default in libpq, because the direct TLS handshake
will fail with old server versions. It can be enabled with the
sslnegotation=direct option. It will still fall back to the negotiated
TLS handshake if the server rejects the direct attempt, either because
it is an older version or the server doesn't support TLS at all, but
the fallback can be disabled with the sslnegotiation=requiredirect
option.

Author: Greg Stark, Heikki Linnakangas
Reviewed-by: Matthias van de Meent, Jacob Champion

Refactor libpq state machine for negotiating encryption

This fixes the few corner cases noted in commit 705843d294, as shown
by the changes in the test.

Author: Heikki Linnakangas, Matthias van de Meent
Reviewed-by: Jacob Champion

Use streaming I/O in ANALYZE.

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling [Prefetch|Read]Buffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Jakub Wartak <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com

injection_points: Introduce runtime conditions

This adds a new SQL function injection_points_set_local() that can be
used to force injection points to be run only in the process where they
are attached. This is handy for SQL tests to:
- Detach automatically injection points when the process exits.
- Allow tests with injection points to run concurrently with other test
suites, so as such modules do not have to be marked with
NO_INSTALLCHECK.

Currently, the only condition that can be registered is for a PID.
This could be extended to more kinds later, if required, like database
names/OIDs, roles, or more concepts I did not consider.

Using a single function for SQL scripts is an idea from Heikki
Linnakangas.

Reviewed-by: Andrey Borodin
Discussion: https://postgr.es/m/[email protected]

Enhance libpq encryption negotiation tests with new GUC

The new "log_connection_negotiation" server option causes the server
to print messages to the log when it receives a SSLRequest or
GSSENCRequest packet from the client. Together with "log_connections",
it gives a trace of how a connection and encryption is
negotiatated. Use the option in the libpq_encryption test, to verify
in more detail how libpq negotiates encryption with different
gssencmode and sslmode options.

This revealed a couple of cases where libpq retries encryption or
authentication, when it should already know that it cannot succeed. I
marked them with XXX comments in the test tables. They only happen
when the connection was going to fail anyway, and only with rare
combinations of options, so they're not serious.

Discussion: https://www.postgresql.org/message-id/CAEze2Wja8VUoZygCepwUeiCrWa4jP316k0mvJrOW4PFmWP0Tcw@mail.gmail.com

With gssencmode='require', check credential cache before connecting

Previously, libpq would establish the TCP connection, and then
immediately disconnect if the credentials were not available. The
same thing happened if you tried to use a Unix domain socket with
gssencmode=require. Check those conditions before establishing the TCP
connection.

This is a very minor issue, but my motivation to do this now is that
I'm about to add more detail to the tests for encryption negotiation.
This makes the case of gssencmode=require but no credentials
configured fail at the same stage as with gssencmode=require and
GSSAPI support not compiled at all. That avoids having to deal with
variations in expected output depending on build options.

Discussion: https://www.postgresql.org/message-id/CAEze2Wja8VUoZygCepwUeiCrWa4jP316k0mvJrOW4PFmWP0Tcw@mail.gmail.com

Add tests for libpq gssencmode and sslmode options

Test all combinations of gssencmode, sslmode, whether the server
supports SSL and/or GSSAPI encryption, and whether they are accepted
by pg_hba.conf. This is in preparation for refactoring that code in
libpq, and for adding a new option for "direct SSL" connections, which
adds another dimension to the logic.

If we add even more options in the future, testing all combinations
will become unwieldy and we'll need to rethink this, but for now an
exhaustive test is nice.

Author: Heikki Linnakangas, Matthias van de Meent
Reviewed-by: Jacob Champion
Discussion: https://www.postgresql.org/message-id/a3af4070-3556-461d-aec8-a8d794f94894@iki.fi

Move Kerberos module

So that we can reuse it in new tests.

Discussion: https://www.postgresql.org/message-id/a3af4070-3556-461d-aec8-a8d794f94894@iki.fi
Reviewed-by: Jacob Champion, Matthias van de Meent

Make GIN test using injection points repeatable

As written, the test would fail when run repeatedly because one of the
injection points attached was not detached. This would not matter if
the test is rewritten to be concurrently safe, but let's be clean and
it is a good practice.

Oversight in 6a1ea02c491d.

Discussion: https://postgr.es/m/[email protected]

Fix incorrect KeeperBlock macro in bump.c

The macro was missing a MAXALIGN around the sizeof(BumpContext) which
would cause problems detecting the keeper block on 32-bit systems that
have a MAXALIGN value of 8.

Thank you to Andres Freund, Tomas Vondra and Tom Lane for investigating
and testing.

Reported-by: Melanie Plageman, Tomas Vondra
Discussion: https://postgr.es/m/CAAKRu_Y6dZjiJEZghgNZp0Gjar1JVq-CH7XGDqExDVHnPgDjuw@mail.gmail.com
Discussion: https://postgr.es/m/a4a10b89-6ba8-4abd-b449-019aafff04fc@enterprisedb.com

Fix usage of same ListCell transform_or_to_any()'s in nested loops

Discussion: https://postgr.es/m/CAAKRu_b4SXNW4GAM0bv3e6wcL5ODSXg1ZdRCn6uyLLjSPbveBg%40mail.gmail.com
Author: Melanie Plageman

Transform OR clauses to ANY expression

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...])
on the preliminary stage of optimization when we are still working with the
expression tree.

Here Cn is a n-th constant expression, 'expr' is non-constant expression, 'op'
is an operator which returns boolean result and has a commuter (for the case
of reverse order of constant and non-constant parts of the expression,
like 'Cn op expr').

Sometimes it can lead to not optimal plan. This is why there is a
or_to_any_transform_limit GUC. It specifies a threshold value of length of
arguments in an OR expression that triggers the OR-to-ANY transformation.
Generally, more groupable OR arguments mean that transformation will be more
likely to win than to lose.

Discussion: https://postgr.es/m/567ED6CA.2040504%40sigaev.ru
Author: Alena Rybakina <[email protected]>
Author: Andrey Lepikhov <[email protected]>
Reviewed-by: Peter Geoghegan <[email protected]>
Reviewed-by: Ranier Vilela <[email protected]>
Reviewed-by: Alexander Korotkov <[email protected]>
Reviewed-by: Robert Haas <[email protected]>
Reviewed-by: Jian He <[email protected]>

Change debug printing to log filename

When restarting the cluster fails the code introduced in 33774978c78
printed the full log contents to aid debugging. For cases when the
logfile is large this adds unnecessary overhead. Reduce to printing
the logfile path instead.

Reported-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20240406214439 [email protected]

Doc: clarify behavior of boolean options in replication protocol commands.

Same idea as ec7e053a9, but applying to the walsender commands
described in protocol.sgml.

Peter Smith

Discussion: https://postgr.es/m/CAHut+PvwjZfdGt2R8HTXgSZft=jZKymrS8KUg31pS7zqaaWKKw@mail.gmail.com

Remove useless duplicate call of defGetBoolean().

Seems to be a copy-and-paste error dating to dc2123400.
Noted while reviewing a related documentation patch.

Doc: show how to get the equivalent of LIMIT for UPDATE/DELETE.

Add examples showing use of a CTE and a self-join to perform
partial UPDATEs and DELETEs.

Corey Huinker, reviewed by Laurenz Albe

Discussion: https://postgr.es/m/CADkLM=caNEQsUwPWnfi2jR4ix99E0EJM_3jtcE-YjnEQC7Rssw@mail.gmail.com

Doc: update documentation about EXCLUDE constraint elements.

What the documentation calls an exclude_element is an index_elem
according to gram.y, and it allows all the same options that
a CREATE INDEX column specification does. The COLLATE patch
neglected to update the CREATE/ALTER TABLE docs about that,
and later the opclass-parameters patch made the same oversight.
Add those options to the syntax synopses, and polish the
associated text a bit.

Back-patch to v13 where opclass parameters came in. We could
update v12 with just the COLLATE omission, but it doesn't quite
seem worth the trouble at this point.

Shihao Zhong, reviewed by Daniel Vérité, Shubham Khanna and myself

Discussion: https://postgr.es/m/CAGRkXqShbVyB8E3gapfdtuwiWTiK=Q67Qb9qwxu=+-w0w46EBA@mail.gmail.com

Use conditional variable to wait for next MultiXact offset

In one multixact.c edge case, we need a mechanism to wait for one
multixact offset to be written before being allowed to read the next
one. We used to handle this case by sleeping for one millisecond and
retrying, but such sleeps have been reported as problematic in
production cases. We can avoid the problem by using a condition
variable: readers sleep on it and then every creator of multixacts
broadcasts into the CV when creation is sufficiently far along.

Author: Kyotaro Horiguchi <[email protected]>
Reviewed-by: Andrey Borodin <[email protected]>
Discussion: https://postgr.es/m/47A598F4-B4E7-4029-8FEC-A06A6C3CB4B5@yandex-team.ru
Discussion: https://postgr.es/m/20200515.090333.24867479329066911.horikyota.ntt

Avoid extra lookups with nbtree array inequalities.

nbtree index scans with SAOP inequalities (but no SAOP equalities)
performed extra ORDER proc lookups for any remaining equality strategy
scan keys. This could waste cycles, and caused assertion failures.
Keeping around a separate ORDER proc is only necessary for a scan's
non-array/non-SAOP equality scan keys when the scan has at least one
other SAOP equality strategy key (a SAOP inequality shouldn't count).

To fix, replace _bt_preprocess_array_keys_final's assertion with a test
that makes the function return early when the scan has no SAOP equality
scan keys.

Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp
execution.

Reported-By: Alexander Lakhin <[email protected]>
Discussion: https://postgr.es/m/0539d3d3-a402-0a49-ed5e-26429dffc4bd@gmail.com

Don't clobber test exit code at cleanup in LDAP/Kerberors tests

If the test script die()d before running the first test, the whole test
was interpreted as SKIPped rather than failed. The PostgreSQL::Cluster
module got this right.

Backpatch to all supported versions.

Discussion: https://www.postgresql.org/message-id/fb898a70-3a88-4629-88e9-f2375020061d@iki.fi

Improve check in LDAP test to find the OpenLDAP installation

If the OpenLDAP installation directory is not found, set $setup to 0
so that the LDAP tests are skipped. The macOS checks were already
doing that, but the checks on other OS's were not. While we're at it,
improve the error message when the tests are skipped, to specify
whether the OS is supported at all, or if we just didn't find the
installation directory.

This was accidentally "working" without this, i.e. we were skipping
the tests if the OpenLDAP installation was not found, because of a bug
in the LdapServer test module: the END block clobbered the exit code
so if the script die()s before running the first subtest, the whole
test script was marked as SKIPped. The next commit will fix that bug,
but we need to fix the setup code first.

These checks should probably go into configure/meson, but this is
better than nothing and allows fixing the bug in the END block.

Backpatch to all supported versions.

Discussion: https://www.postgresql.org/message-id/fb898a70-3a88-4629-88e9-f2375020061d@iki.fi

Use streaming I/O in sequential scans.

Instead of calling ReadBuffer() for each block, heap sequential scans
and TID range scans now use the streaming API introduced in b5a9b18cd0.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com

Use bump memory context for tuplesorts

29f6a959c added a bump allocator type for efficient compact allocations.
Here we make use of this for non-bounded tuplesorts to store tuples.
This is very space efficient when storing narrow tuples due to bump.c
not having chunk headers. This means we can fit more tuples in work_mem
before spilling to disk, or perform an in-memory sort touching fewer
cacheline.

Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com

Add XLogCtl->logInsertResult

This tracks the position of WAL that's been fully copied into WAL
buffers by all processes emitting WAL.  (For some reason we call that
"WAL insertion").  This is updated using atomic monotonic advance during
WaitXLogInsertionsToFinish, which is not when the insertions actually
occur, but it's the only place where we know where have all the
insertions have completed.

This value is useful in WALReadFromBuffers, which can verify that
callers don't try to read past what has been inserted.  (However, more
infrastructure is needed in order to actually use WAL after the flush
point, since it could be lost.)

The value is also useful in WaitXLogInsertionsToFinish() itself, since
we can now exit quickly when all WAL has been already inserted, without
even having to take any locks.

Introduce a bump memory allocator

This introduces a bump MemoryContext type.  The bump context is best
suited for short-lived memory contexts which require only allocations
of memory and never a pfree or repalloc, which are unsupported.

Memory palloc'd into a bump context has no chunk header.  This makes
bump a useful context type when lots of small allocations need to be
done without any need to pfree those allocations.  Allocation sizes are
rounded up to the next MAXALIGN boundary, so with this and no chunk
header, allocations are very compact indeed.

Allocations are also very fast as bump does not check any freelists to
try and make use of previously free'd chunks.  It just checks if there
is enough room on the current block, and if so it bumps the freeptr
beyond this chunk and returns the value that the freeptr was previously
pointing to.  Simple and fast.  A new block is malloc'd when there's not
enough space in the current block.

Code using the bump allocator must take care never to call any functions
which could try to call realloc() (or variants), pfree(),
GetMemoryChunkContext() or GetMemoryChunkSpace() on a bump allocated
chunk.  Due to lack of chunk headers, these operations are unsupported.
To increase the chances of catching such issues, when compiled with
MEMORY_CONTEXT_CHECKING, bump allocated chunks are given a header and
any attempt to perform an unsupported operation will result in an ERROR.
Without MEMORY_CONTEXT_CHECKING, code attempting an unsupported
operation could result in a segfault.

A follow-on commit will implement the first user of bump.

Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com

Enlarge bit-space for MemoryContextMethodID

Reserve 4 bits for MemoryContextMethodID rather than 3. 3 bits did
technically allow a maximum of 8 memory context types, however, we've
opted to reserve some bit patterns which left us with only 4 slots, all
of which were used.

Here we add another bit which frees up 8 slots for future memory context
types.

In passing, adjust the enum names in MemoryContextMethodID to make it
more clear which ones can be used and which ones are reserved.

Author: Matthias van de Meent, David Rowley
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com

Avoid needless large memcpys in libpq socket writing

Until now, when calling pq_putmessage to write new data to a libpq
socket, all writes are copied into a buffer and that buffer gets flushed
when full to avoid having to perform small writes to the socket.

There are cases where we must write large amounts of data to the socket,
sometimes larger than the size of the buffer. In this case, it's
wasteful to memcpy this data into the buffer and flush it out, instead,
we can send it directly from the memory location that the data is already
stored in.

Here we adjust internal_putbytes() so that after having just flushed the
buffer to the socket, if the remaining bytes to send is as big or bigger
than the buffer size, we just send directly rather than needlessly
copying into the PqSendBuffer buffer first.

Examples of operations that write large amounts of data in one message
are; outputting large tuples with SELECT or COPY TO STDOUT and
pg_basebackup.

Author: Melih Mutlu
Reviewed-by: Heikki Linnakangas
Reviewed-by: Jelte Fennema-Nio
Reviewed-by: David Rowley
Reviewed-by: Ranier Vilela
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CAGPVpCR15nosj0f6xe-c2h477zFR88q12e6WjEoEZc8ZYkTh3Q@mail.gmail.com

Reduce branches in heapgetpage()'s per-tuple loop

Until now, heapgetpage()'s loop over all tuples performed some conditional
checks for each tuple, even though condition did not change across the loop.

This commit fixes that by moving the loop into an inline function. By calling
it with different constant arguments, the compiler can generate an optimized
loop for the different conditions, at the price of two per-page checks.

For cases of all-visible tables and an isolation level other than
serializable, speedups of up to 25% have been measured.

Reviewed-by: John Naylor <[email protected]>
Reviewed-by: Zhang Mingli <[email protected]>
Tested-by: Quan Zongliang <[email protected]>
Discussion: https://postgr.es/m/20230716015656 [email protected]
Discussion: https://postgr.es/m/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce@yeah.net

Optimize visibilitymap_count() with AVX-512 instructions.

Commit 792752af4e added infrastructure for using AVX-512 intrinsic
functions, and this commit uses that infrastructure to optimize
visibilitymap_count(). Specificially, a new pg_popcount_masked()
function is introduced that applies a bitmask to every byte in the
buffer prior to calculating the population count, which is used to
filter out the all-visible or all-frozen bits as needed. Platforms
without AVX-512 support should also see a nice speedup due to the
reduced number of calls to a function pointer.

Co-authored-by: Ants Aasma
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com

Optimize pg_popcount() with AVX-512 instructions.

Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible.  Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers.  This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions.  If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().

Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16.  The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.

Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com

Fix if/while thinko in read_stream.c edge case.

When we determine that a wanted block can't be combined with the current
pending read, it's time to start that read to get it out of the way.  An
"if" in that code path should have been a "while", because it might take
more than one go in case of partial reads.  This was only broken for
smaller ranges, as the more common case of io_combine_limit-sized ranges
is handled earlier in the code and knows how to loop, hiding the bug for
a while.

Discovered while testing large parallel sequential scans of partially
cached tables.  The ramp-up-and-down block allocator for parallel scans
could hit the problem case and skip some blocks near the end that should
have been streamed.

Defect in commit b5a9b18c.

Discussion: https://postgr.es/m/CA%2BhUKG%2Bh8Whpv0YsJqjMVkjYX%2B80fTVc6oi-V%2BzxJvykLpLHYQ%40mail.gmail.com

Disable parallel query in psql error-with-FETCH_COUNT test.

The buildfarm members using debug_parallel_query = regress are mostly
unhappy with this test.  I guess what is happening is that rows
generated by a parallel worker are buffered, and might or might not
get to the leader before the expected error occurs.  We did not see
any variability in the old version of this test because each FETCH
would succeed or fail atomically, leading to a predictable number of
rows emitted before failure.  I don't find this to be a bug, just
unspecified behavior, so let's disable parallel query for this one
test case to make the results stable.

Re-implement psql's FETCH_COUNT feature atop libpq's chunked mode.

Formerly this was done with a cursor, which is problematic since
not all result-set-returning query types can be put into a cursor.
The new implementation is better integrated into other psql
features, too.

Daniel Vérité, reviewed by Laurenz Albe and myself (and whacked
around a bit by me, so any remaining bugs are my fault)

Discussion: https://postgr.es/m/CAKZiRmxsVTkO928CM+-ADvsMyePmU3L9DQCa9NwqjvLPcEe5QA@mail.gmail.com

Support retrieval of results in chunks with libpq.

This patch generalizes libpq's existing single-row mode to allow
individual partial-result PGresults to contain up to N rows, rather
than always one row.  This reduces malloc overhead compared to plain
single-row mode, and it is very useful for psql's FETCH_COUNT feature,
since otherwise we'd have to add code (and cycles) to either merge
single-row PGresults into a bigger one or teach psql's
results-printing logic to accept arrays of PGresults.

To avoid API breakage, PQsetSingleRowMode() remains the same, and we
add a new function PQsetChunkedRowsMode() to invoke the more general
case.  Also, PGresults obtained the old way continue to carry the
PGRES_SINGLE_TUPLE status code, while if PQsetChunkedRowsMode() is
used then their status code is PGRES_TUPLES_CHUNK.  The underlying
logic is the same either way, though.

Daniel Vérité, reviewed by Laurenz Albe and myself (and whacked
around a bit by me, so any remaining bugs are my fault)

Discussion: https://postgr.es/m/CAKZiRmxsVTkO928CM+-ADvsMyePmU3L9DQCa9NwqjvLPcEe5QA@mail.gmail.com

Change BitmapAdjustPrefetchIterator to accept BlockNumber

BitmapAdjustPrefetchIterator() only used the blockno member of the
passed in TBMIterateResult to ensure that the prefetch iterator and
regular iterator stay in sync. Pass it the BlockNumber only, so that we
can move away from using the TBMIterateResult outside of table AM
specific code.

Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas
Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com

BitmapHeapScan: Use correct recheck flag for skip_fetch

As of 7c70996ebf0949b142a9, BitmapPrefetch() used the recheck flag for
the current block to determine whether or not it should skip prefetching
the proposed prefetch block. As explained in the comment, this assumed
the index AM will report the same recheck value for the future page as
it did for the current page - but there's no guarantee.

This only affects prefetching - if the recheck flag changes, we may
prefetch blocks unecessarily and not prefetch blocks that will be
needed. But we don't need to rely on that assumption - we know the
recheck flag for the block we're considering prefetching, so we can
use that.

The impact is very limited in practice - the opclass would need to
assign different recheck flags to different blocks, but none of the
built-in opclasses seems to do that.

Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Andres Freund, Tom Lane
Discussion: https://postgr.es/m/1939305.1712415547%40sss.pgh.pa.us

BitmapHeapScan: Push skip_fetch optimization into table AM

Commit 7c70996ebf0949b142 introduced an optimization to allow bitmap
scans to operate like index-only scans by not fetching a block from the
heap if none of the underlying data is needed and the block is marked
all visible in the visibility map.

With the introduction of table AMs, a FIXME was added to this code
indicating that the skip_fetch logic should be pushed into the table
AM-specific code, as not all table AMs may use a visibility map in the
same way.

This commit resolves this FIXME for the current block. The layering
violation is still present in BitmapHeapScans's prefetching code, which
uses the visibility map to decide whether or not to prefetch a block.
However, this can be addressed independently.

Author: Melanie Plageman
Reviewed-by: Andres Freund, Heikki Linnakangas, Tomas Vondra, Mark Dilger
Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com

Implement ALTER TABLE ... SPLIT PARTITION ... command

This new DDL command splits a single partition into several parititions.
Just like ALTER TABLE ... MERGE PARTITIONS ... command, new patitions are
created using createPartitionTable() function with parent partition as the
template.

This commit comprises quite naive implementation which works in single process
and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the
operations including the tuple routing. This is why this new DDL command
can't be recommended for large partitioned tables under a high load. However,
this implementation come in handy in certain cases even as is.
Also, it could be used as a foundation for future implementations with lesser
locking and possibly parallel.

Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval
Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby
Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires

Implement ALTER TABLE ... MERGE PARTITIONS ... command

This new DDL command merges several partitions into the one partition of the
target table.  The target partition is created using new
createPartitionTable() function with parent partition as the template.

This commit comprises quite naive implementation which works in single process
and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the
operations including the tuple routing.  This is why this new DDL command
can't be recommended for large partitioned tables under a high load.  However,
this implementation come in handy in certain cases even as is.
Also, it could be used as a foundation for future implementations with lesser
locking and possibly parallel.

Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval
Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby
Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires

BitmapHeapScan: postpone setting can_skip_fetch

Set BitmapHeapScanState->can_skip_fetch in BitmapHeapNext() instead of
in ExecInitBitmapHeapScan(). This is a preliminary step to pushing the
skip fetch optimization into heap AM code.

Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas
Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com

Call WaitLSNCleanup() in AbortTransaction()

Even though waiting for replay LSN happens without explicit transaction,
AbortTransaction() is responsible for the cleanup of the shared memory if
the error is thrown in a stored procedure. So, we need to do WaitLSNCleanup()
there to clean up after some unexpected error happened while waiting for
replay LSN.

Discussion: https://postgr.es/m/202404051815.eri4u5q6oj26%40alvherre.pgsql
Author: Alvaro Herrera

Clarify what is protected by WaitLSNLock

Not just WaitLSNState.waitersHeap, but also WaitLSNState.procInfos and
updating of WaitLSNState.minWaitedLSN is protected by WaitLSNLock. There
is one now documented exclusion on fast-path checking of
WaitLSNProcInfo.inHeap flag.

Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql

Use an LWLock instead of a spinlock in waitlsn.c

This should prevent busy-waiting when number of waiting processes is high.

Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
Author: Alvaro Herrera

BitmapHeapScan: begin scan after bitmap creation

Change the order so that the table scan is initialized only after
initializing the index scan and building the bitmap.

This is mostly a cosmetic change for now, but later commits will need
to pass parameters to table_beginscan_bm() that are unavailable in
ExecInitBitmapHeapScan().

Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas
Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com

Backport IPC::Run optimization to src/test/perl.

This one-liner makes the TAP portion of "make check-world" 7% faster on
a non-Windows machine.

Discussion: https://postgr.es/m/20240331050310 [email protected]

Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively.  This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).

Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient.  All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.

The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value.  Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan.  Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.

Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.

Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute.  These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).

Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes.  They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem).  They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).

There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions.  Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.

Author: Peter Geoghegan <[email protected]>
Author: Matthias van de Meent <[email protected]>
Reviewed-By: Heikki Linnakangas <[email protected]>
Reviewed-By: Matthias van de Meent <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com

Remove obsolete comment in CopyReadLineText().

When this bit of commentary was written, it was alluding to the
fact that we looked for newlines and EOD markers in the raw
(not yet encoding-converted) input data. We don't do that anymore,
preferring to batch the conversion of larger chunks of input and
split it into lines later. Hence there's no longer any need for
assumptions about the relevant characters being encoding-invariant,
and we should remove this comment saying we assume that.

Discussion: https://postgr.es/m/1461688.1712347668@sss.pgh.pa.us

Speed up tail processing when hashing aligned C strings, take two

After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated
the terminator's position and re-loaded the remaining bytes from
the input string. This was slower than the unaligned case for very
short strings. We already have all the data we need in a register,
so let's just mask off the bytes we need and hash them immediately.

In addition to endianness issues, the previous attempt upset valgrind
in the way it computed the mask. Whether by accident or by wisdom,
the author's proposed method passes locally with valgrind 3.22.

Ants Aasma, with cosmetic adjustments by me

Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com

Teach fasthash_accum to use platform endianness for bytewise loads

This function previously used a mix of word-wise loads and bytewise
loads. The bytewise loads happened to be little-endian regardless of
platform. This in itself is not a problem. However, a future commit
will require the same result whether A) the input is loaded as a
word with the relevent bytes masked-off, or B) the input is loaded
one byte at a time.

While at it, improve debuggability of the internal hash state.

Discussion: https://postgr.es/m/CANWCAZZpuV1mES1mtSpAq8tWJewbrv4gEz6R_k4gzNG8GZ5gag%40mail.gmail.com

Increase default vacuum_buffer_usage_limit to 2MB.

The BAS_VACUUM ring size has been 256kB since commit d526575f introduced
the mechanism 17 years ago.  Commit 1cbbee03 recently made it
configurable but retained the traditional default.  The correct default
size has been debated for years, but 256kB is certainly very small.
VACUUM soon needs to write back data it dirtied only 32 blocks ago,
which usually requires flushing the WAL.  New experiments in prefetching
pages for VACUUM exacerbated the problem by crashing into dirty data
even sooner.  Let's make the default 2MB.  That's 1.6% of the default
toy buffer pool size, and 0.2% of 1GB, which would be a considered a
small shared_buffers setting for a real system these days.  Users are
still free to set the GUC to a different value.

Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de
Discussion: https://postgr.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com

Allow BufferAccessStrategy to limit pin count.

While pinning extra buffers to look ahead, users of strategies are in
danger of using too many buffers. For some strategies, that means
"escaping" from the ring, and in others it means forcing dirty data to
disk very frequently with associated WAL flushing. Since external code
has no insight into any of that, allow individual strategy types to
expose a clamp that should be applied when deciding how many buffers to
pin at once.

Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com

Convert uses of hash_string_pointer to fasthash equivalent

Remove duplicate hash_string_pointer() function definitions by creating
a new inline function hash_string() for this purpose.

This has the added advantage of avoiding strlen() calls when doing hash
lookup. It's not clear how many of these are perfomance-sensitive
enough to benefit from that, but the simplification is worth it on
its own.

Reviewed by Jeff Davis

Discussion: https://postgr.es/m/CANWCAZbg_XeSeY0a_PqWmWqeRATvzTzUNYRLeT%2Bbzs%2BYQdC92g%40mail.gmail.com