Strange assertion in procarray.c

Lists: pgsql-hackers
From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Strange assertion in procarray.c
Date: 2024-11-25 19:38:00
Message-ID: CANtu0oiTgFW47QgpTwrMOVm3Bq4N0Y5bjvTy5sP0gYWLQuVgjw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello, everyone!

While working on stabilization of tests for [0] I noticed a strange assert
happens in procarray.c [1].

It looks like this:

[562801][client backend] [isolation/two-ids/s2][50/131:536136]
ERROR: could not serialize access due to read/write dependencies among
transactions
[562801][client backend] [isolation/two-ids/s2][50/131:536136]
DETAIL: Reason code: Canceled on identification as a pivot, during commit
attempt.
[562801][client backend] [isolation/two-ids/s2][50/131:536136]
HINT: The transaction might succeed if retried.
[562801][client backend] [isolation/two-ids/s2][50/131:536136]
STATEMENT: COMMIT;
[562801][client backend] [isolation/two-ids/s2][50/0:536136] ERROR:
ResourceOwnerEnlarge called after release started
[562801][client backend] [isolation/two-ids/s2][50/0:536136]
WARNING: AbortTransaction while in ABORT state
TRAP: failed Assert("TransactionIdIsValid(proc->xid)"), File:
"../src/backend/storage/ipc/procarray.c", Line: 677, PID: 562801
[562819][client backend] [pg_regress/test_parser][:0] LOG:
disconnection: session time: 0:00:00.011 user=someone
database=regression_test_parser host=[local]
postgres: someone isolation_regression [local]
COMMIT(ExceptionalCondition+0xbe)[0x55f2a101f185]
postgres: someone isolation_regression [local]
COMMIT(ProcArrayEndTransaction+0x46)[0x55f2a0ddf7b3]
postgres: someone isolation_regression [local]
COMMIT(+0x1e29b1)[0x55f2a09e59b1]
postgres: someone isolation_regression [local]
COMMIT(+0x1e347b)[0x55f2a09e647b]
postgres: someone isolation_regression [local]
COMMIT(AbortCurrentTransaction+0xe)[0x55f2a09e63a3]
postgres: someone isolation_regression [local]
COMMIT(PostgresMain+0x538)[0x55f2a0e20ff1]
postgres: someone isolation_regression [local]
COMMIT(+0x61457b)[0x55f2a0e1757b]
postgres: someone isolation_regression [local]
COMMIT(postmaster_child_launch+0x137)[0x55f2a0d295bf]
postgres: someone isolation_regression [local]
COMMIT(+0x52cff5)[0x55f2a0d2fff5]
postgres: someone isolation_regression [local]
COMMIT(+0x52a6cd)[0x55f2a0d2d6cd]
postgres: someone isolation_regression [local]
COMMIT(PostmasterMain+0x1629)[0x55f2a0d2cfae]
postgres: someone isolation_regression [local]
COMMIT(+0x404ba2)[0x55f2a0c07ba2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7f6afbe7e1ca]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7f6afbe7e28b]
postgres: someone isolation_regression [local]
COMMIT(_start+0x25)[0x55f2a08e3ab5]

I made a reproducer for that. Ignore index_concurrently_upsert - it should
fail. Also, some source files are changed - but it is only injection points.
But in several cases of "meson test --print-errorlogs --num-processes=8
--setup running" build backend crashes. I was unable to reproduce it during
"non-running" tests.

A full backend log for the crash run is attached.

There are some helpful commands to reproduce locally:

cd build
ninja && meson test --suite setup
cd ../
export
LD_LIBRARY_PATH="$(pwd)/build/tmp_install/usr/local/pgsql/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH"
build/tmp_install/usr/local/pgsql/bin/initdb -N build/runningcheck
--no-instructions -A trust
echo "include '$(pwd)/src/tools/ci/pg_ci_base.conf'" >>
build/runningcheck/postgresql.conf

build/tmp_install/usr/local/pgsql/bin/pg_ctl -c -o '-c fsync=off'
-D build/runningcheck -l build/testrun/runningcheck.log start
cd build
meson test --print-errorlogs --num-processes=8 --setup running

Best regards,
Mikhail.

[0]: https://commitfest.postgresql.org/50/5160/
[1]:
https://github.com/postgres/postgres/blob/478846e7688c9ab73d2695a66822e9ae0574b551/src/backend/storage/ipc/procarray.c#L677

Attachment Content-Type Size
v1-0001-meson-test-print-errorlogs-num-processes-8-setup-.patch text/plain 12.5 KB
runningcheck.log application/octet-stream 1.7 MB

From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, nathan(at)postgresql(dot)org, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: Strange assertion in procarray.c
Date: 2024-11-27 02:27:00
Message-ID: CANtu0ojbx6=esP8euQgzD1CN6tigTQvDmupwEmLTHZT=6_yx_A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello, Nathan and Michael!

I believe I’ve identified the root cause of the issue. It appears to be
related to the GetNamedDSMSegment and injection_points module.

Here’s how it happens:

* A backend attaches (even locally!) to an injection point, which might get
triggered during resource release (ResourceOwnerRelease).
* Another backend attempts to release the same resource (e.g., by aborting
a transaction) and triggers the injection point.
* This leads to a call to GetNamedDSMSegment, as it’s the first time this
backend interacts with injection points.
* Consequently, an assertion failure occurs in ResourceOwnerEnlarge because
the backend is in the process of releasing all resources.

I’ve attached a reproducer—this version is much simpler for debugging.

So, it looks like we need to provide some option to guarantee
GetNamedDSMSegment called for injection points module.

Any other ideas?

Best regards,
Mikhail.

[0]:
https://github.com/postgres/postgres/commit/8b2bcf3f287c79eaebf724cba57e5ff664b01e06

Attachment Content-Type Size
v2-0001-Test-to-reproduce-issue-with-crash-caused-by-pass.patch text/plain 4.3 KB

From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, nathan(at)postgresql(dot)org, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: Strange assertion in procarray.c
Date: 2024-11-27 09:50:15
Message-ID: CANtu0oiXq548OM_RqBbDQ0dgqtKLAg=x_Fu4eOGRNW0tBsNj_Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello, again.

> Another backend attempts to release the same resource (e.g., by aborting
a transaction) and triggers the injection point.
Oh, all that GPT-like correctors required to be carefully checked :)

Correct version: Another backend attempts to release some resource (e.g.,
by aborting a transaction) and triggers the injection point.

Best regards,
Mikhail.


From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, nathan(at)postgresql(dot)org, Michael Paquier <michael(at)paquier(dot)xyz>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Strange assertion in procarray.c
Date: 2024-11-28 20:04:36
Message-ID: CANtu0oiBAcVnzYYETbWY+2gFXUeAx8BKArjnFco4LeAHfH38Sw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello, Heikki, Nathan and Michael!

Oh, please excuse my impudence in bringing you all here, but I finally
found what almost the same issue was fixed by Heikki already [0].

I discovered that a similar issue was previously addressed by Heikki in
commit [0], where installcheck was disabled for injection point tests.
However, in the meson build configuration, this was only applied to
regression tests - the isolation and TAP tests are still running during
installcheck.

As demonstrated in the previously shared reproducer [1], even *local*
injection points can cause backend crashes through unexpected side effects.
Therefore, I propose extending the installcheck disable to cover both TAP
and isolation tests as well.

I've attached a patch implementing these changes.

A patch with such change is attached.

Best regards,
Mikhail.

[0]:
https://github.com/postgres/postgres/commit/e2e3b8ae9ed73fcd3096c5ca93971891a7767388
[1]:
https://www.postgresql.org/message-id/flat/CANtu0ojbx6%3DesP8euQgzD1CN6tigTQvDmupwEmLTHZT%3D6_yx_A%40mail.gmail.com#18544d553544da67b4fc1ef764df3c3d

>

Attachment Content-Type Size
v2-0001-Test-to-reproduce-issue-with-crash-caused-by-pass.patch text/plain 4.3 KB

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, nathan(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Strange assertion in procarray.c
Date: 2024-11-29 05:06:36
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 28, 2024 at 09:04:36PM +0100, Michail Nikolaev wrote:
> I discovered that a similar issue was previously addressed by Heikki in
> commit [0], where installcheck was disabled for injection point tests.
> However, in the meson build configuration, this was only applied to
> regression tests - the isolation and TAP tests are still running during
> installcheck.

I fail to see how this is related? The original issue was that this
was impossible to run safely concurrently, but now we have the
facilities able to do so. There are a few cases where using a wait
point has limits, for example outside a transaction context for some
of the processes, but that has not really been an issue up to now.

> As demonstrated in the previously shared reproducer [1], even *local*
> injection points can cause backend crashes through unexpected side effects.
> Therefore, I propose extending the installcheck disable to cover both TAP
> and isolation tests as well.
>
> I've attached a patch implementing these changes.

@@ -426,6 +427,7 @@ InvalidateCatalogSnapshot(void)
pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
CatalogSnapshot = NULL;
SnapshotResetXmin();
+ INJECTION_POINT("invalidate_catalog_snapshot_end");
[...]
+step s4_attach_locally { SELECT
injection_points_attach('invalidate_catalog_snapshot_end', 'wait'); }
[...]
#4 0x0000563f09d22f39 in ExceptionalCondition
(conditionName=0x563f0a072d00
"TransactionIdIsValid(proc->xid)", fileName=0x563f0a072a4a
"procarray.c", lineNumber=677) at assert.c:66
#5 0x0000563f096f0684 in ProcArrayEndTransaction
(proc=0x7fbd4d083ac0, latestXid=750) at procarray.c:677
#6 0x0000563f088c54f3 in AbortTransaction () at xact.c:2946
#7 0x0000563f088c758d in AbortCurrentTransactionInternal () at
#xact.c:3531
#8 0x0000563f088c72a6 in AbortCurrentTransaction () at xact.c:3449
#9 0x0000563f0979c0f7 in PostgresMain (dbname=0x563f0f128100
"isolation_regression", username=0x563f0f1280e8 "ioltas") at
postgres.c:4524
#10 0x0000563f0978a5e5 in BackendMain (startup_data=0x7ffdcf50cfa8 "",
startup_data_len=4) at backend_startup.c:107
#11 0x0000563f094d8613 in postmaster_child_launch
(child_type=B_BACKEND, child_slot=1,
startup_data=0x7ffdcf50cfa8 "", startup_data_len=4,
client_sock=0x7ffdcf50cfe0)

Isn't that pointing to an actual bug with serializable transactions?

What you are telling here is that there is a race condition where it
is possible to trigger an assertion failure when finishing a session
while another one is waiting on an invalidation, if there's in the mix
a read/write dependency error. Disabling the test hides the problem,
it does not fix it. And we should do the latter, not the former.
--
Michael


From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, nathan(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Strange assertion in procarray.c
Date: 2024-11-29 11:55:00
Message-ID: CANtu0ogaPNFbj=8qQnSW9u2-voVwSfM65WUc7xXxFwr9feGmLw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello, Michael!

> I fail to see how this is related? The original issue was that this
> was impossible to run safely concurrently, but now we have the
> facilities able to do so. There are a few cases where using a wait
> point has limits, for example outside a transaction context for some
> of the processes, but that has not really been an issue up to now.

I encountered this issue while trying to stabilize tests for [0].

Tests were crashing during the installcheck with assertion of that thread.
I have spent time trying to identify the root cause - injection points and
the way it may affect another tests even if them set to be executed locally.

In the spec, the backend performs the following:

SELECT injection_points_set_local();
SELECT injection_points_attach('invalidate_catalog_snapshot_end',
'wait');

That's all - we don't even execute any command that would trigger the wait
condition.

Meanwhile, three different backends are attempting to test SERIALIZABLE
isolation without any injection points.
Initially, this was a separate 'two-ids' test executed in parallel during
installcheck,
but I incorporated it into the reproducer spec for simplicity.

> Isn't that pointing to an actual bug with serializable transactions?

No, let me explain.

> What you are telling here is that there is a race condition where it
> is possible to trigger an assertion failure when finishing a session
> while another one is waiting on an invalidation, if there's in the mix
> a read/write dependency error.

Actually, no backend is waiting for invalidation in this case.

Here's the sequence of events:

* The s4 backend creates a local injection point but performs no further
actions. The injection point is marked as local for that pid.
* Three other backends proceed with their serializable snapshot operations.
* s2 determines it cannot commit and correctly decides to abort the
transaction.
* s2 begins releasing resources:

ResourceOwnerReleaseInternal resowner.c:694 <--- NOTE: After
starting the release process, by calling this function, no new
ResourceOwnerRelease resowner.c:654 resources
can be remembered in the resource owner.
AbortTransaction xact.c:2960
AbortCurrentTransactionInternal xact.c:3531
AbortCurrentTransaction xact.c:3449
PostgresMain postgres.c:4513
BackendMain backend_startup.c:107
postmaster_child_launch launch_backend.c:274
BackendStartup postmaster.c:3377
ServerLoop postmaster.c:1663
PostmasterMain postmaster.c:1361
main main.c:196

* During transaction abort, s2 invalidates its catalog snapshot with this
stack trace:

InvalidateCatalogSnapshot snapmgr.c:430
AtEOXact_Snapshot snapmgr.c:1050
CleanupTransaction xact.c:3016
AbortCurrentTransactionInternal xact.c:3532
AbortCurrentTransaction xact.c:3449
PostgresMain postgres.c:4513
BackendMain backend_startup.c:107
postmaster_child_launch launch_backend.c:274
BackendStartup postmaster.c:3377
ServerLoop postmaster.c:1663
PostmasterMain postmaster.c:1361
main main.c:196

* Consequently, s2 encounters
INJECTION_POINT("invalidate_catalog_snapshot_end");
* Although invalidate_catalog_snapshot_end is set to 'wait' only for s4, s2
enters the injection_wait handler
* Since this is s2's first interaction with injection points (as
injection_points isn't in shared_preload_libraries), it calls
injection_init_shmem
* Here, GetNamedDSMSegment is called - this is new infrastructure for
initializing shared memory for extensions without shared_preload_libraries,
committed by Nathan [1]
* GetNamedDSMSegment attempts to attach to memory and triggers the
assertion:

if (owner->releasing)
elog(ERROR, "ResourceOwnerEnlarge called after release
started");

ResourceOwnerEnlarge resowner.c:449
dsm_create_descriptor dsm.c:1206
dsm_attach dsm.c:696
dsa_attach dsa.c:519
init_dsm_registry dsm_registry.c:115
GetNamedDSMSegment dsm_registry.c:156
injection_init_shmem injection_points.c:185
injection_wait injection_points.c:277
InjectionPointRun injection_point.c:551
InvalidateCatalogSnapshot snapmgr.c:430
AtEOXact_Snapshot snapmgr.c:1050
CleanupTransaction xact.c:3016
AbortCurrentTransactionInternal xact.c:3532
AbortCurrentTransaction xact.c:3449
PostgresMain postgres.c:4513
BackendMain backend_startup.c:107
postmaster_child_launch launch_backend.c:274
BackendStartup postmaster.c:3377
ServerLoop postmaster.c:1663
PostmasterMain postmaster.c:1361
main main.c:196

* This assertion during transaction abort triggers another abort call (this
could be improved):

ProcArrayEndTransaction procarray.c:677
AbortTransaction xact.c:2946
AbortCurrentTransactionInternal xact.c:3531
AbortCurrentTransaction xact.c:3449
PostgresMain postgres.c:4513 <--------------- exception handler here
BackendMain backend_startup.c:107
postmaster_child_launch launch_backend.c:274
BackendStartup postmaster.c:3377
ServerLoop postmaster.c:1663
PostmasterMain postmaster.c:1361
main main.c:196

This isn't the same abort attempt - it's the second one, which triggers
Assert(TransactionIdIsValid(proc->xid));

In summary:

Each code component functions correctly in isolation
However, when an injection point registered as local by one backend causes
another backend to register resources (and potentially other operations),
it can lead to difficult-to-diagnose issues

I see several potential solutions:

* Add injection_points to shared_preload_libraries for all tests
* Implement a mechanism to call prev_shmem_startup_hook for libraries
outside shared_preload_libraries
* Modify GetNamedDSMSegment's behavior
* Run all injection_points tests in an isolated environment

In my opinion, the last option seems most appropriate.

Best regards,
Mikhail.

[0]: https://commitfest.postgresql.org/50/5160/
[1]:
https://github.com/postgres/postgres/commit/8b2bcf3f287c79eaebf724cba57e5ff664b01e06


From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, nathan(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Strange assertion in procarray.c
Date: 2025-01-12 16:40:09
Message-ID: CANtu0og18wNZ=BsoqUh3Gpaxndn45s6hdgi+nBpiTd5mjaKxeA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello, everyone!

Decide just to clarify - the patch is failing on CFbot [0], but it is as
designed - it contains a reproducer which shows how unrelated backends may
affect each other even in case of **local** injection points, causing the
crash.

Best regards,
Michail.

[0]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5160

>


From: Mihail Nikalayeu <mihailnikalayeu(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, nathan(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Strange assertion in procarray.c
Date: 2025-02-20 16:28:42
Message-ID: CADzfLwUcTjgYy1KW_GDhtv76ZV-ERpV_HYAXN=CwYeqpq=VHvQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello!

Noah, I have noticed you already disabled runningcheck for isolation tests
already in injection_points[0].
The whole patch here was about to make it default for all types of tests
for injection_points.

Seems like we may close this entry.

Attached patch is just to put a rebased version here for history reasons.
[0]:
https://github.com/postgres/postgres/blob/b3ac4aa83458b1e3cc8299508a8c3e0e1490cb23/src/test/modules/injection_points/meson.build#L52

Attachment Content-Type Size
v3-0001-Test-to-reproduce-issue-with-crash-caused-by-pass.patch text/x-patch 4.4 KB