You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What: any session what is trying to select from pg_wait_sampling_profile is being blocked by pg_wait_sampling collector
Example:
pid | usename | blocked_by | blocking_duration | blocked_query
-------+----------+------------+-------------------+-----------------------------------------------------
30120 | postgres | {8966} | 7.279804 | select * FROM pg_wait_sampling_profile ;
select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid from pg_catalog.pg_locks l
left join pg_catalog.pg_database d on d."oid" = l."database"
where l."database" is not null and l."database" <>0
and pid=8966;
datname | locktype | database | transactionid | mode | pid
---------+----------+------------+---------------+---------------+------
| userlock | 3398742279 | | ExclusiveLock | 8966
(1 row)
As you see, there's no database with oid=3398742279
Environment:
pg_wait_sampling version 1.1
Postgres12.3
CentOS 7
Please let me know what else I need to share with you to help confirm and fix (if it's a bug)
Добрый день, столкнулся с аналогичной проблемой версия PostgrePro Standart 11.9 RH7 ( на двух разных серверах)(pg_wait_sampling version 1.1)
блокируются все запросы от мамонсу.
Hi, I am not sure whether it is a bug or some initial architecture problem.
@banlex73, non-existing database Oid looks weird of course. Is it blocked forever by collector? Or it is still possible to select from pg_wait_sampling_profile from time to time?
@PavelSorokin, if you have an active PostgresPro support, you can contact them as well.
Hi Alexey @ololobus , I've just checked one of my cluster and can confirm that LOCK is PERMANENT. Tried to select * from pg_wait_sampling_profile and it was blocked by collector.
Hope it helps.
PS: Please feel free to contact me if you need anything else to investigate this issue.
I have tried today to reproduce this issue with pg_wait_sampling as per stable branch state and REL_12_STABLE and cannot reproduce it.
What I have tried so far is to repeat following steps (~a couple of dozens times):
Put pg_wait_sampling into shared_preload_libraries and create extension.
Do select * from pg_wait_sampling_profile;.
Create extra test database and pg_wait_sampling extension there.
Do select * from pg_wait_sampling_profile; again with or without additional payload (checkpoints, create table, etc).
Drop test database.
Do select * from pg_wait_sampling_profile; a couple of time again. IIUC, it should get stuck, but it works for me.
Am I missing something? @banlex73, maybe you (or @PavelSorokin) can provide a more specific steps, so I can reproduce this issue?
P.S. I have accidentally closed this issue when typing this comment with some weird keyboard shortcut, but I have reopened it immediately, do not worry.
Alexey, thank you for trying..
What I can do, setup the environment and try to reproduce this issue. ETA:
next couple of days
пн, 30 лист. 2020 о 07:10 Alexey Kondratov <notifications@github.com> пише:
I have tried today to reproduce this issue with pg_wait_sampling as per
stable branch state and REL_12_STABLE and cannot reproduce it.
What I have tried so far is to repeat following steps (~a couple of dozens
times):
1. Put pg_wait_sampling into shared_preload_libraries.
2. Do select * from pg_wait_sampling_profile;.
3. Create extra test database.
4. Do select * from pg_wait_sampling_profile; again with or without
additional payload (checkpoints, create table, etc).
5. Drop test database.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYADJN5S2IA7DVRMMQLSSOYXHANCNFSM4TMHLMFQ>
.
Quick update
I've been running a script to reproduce this issue for 5 days already in a loop, no luck.
What I am doing:
In a loop,
create database
generate load (bulk insert into a table, several heavy selects)
drop database
another session periodically trying to select data from pg_wait_sampling_profile.
Everything works fine so far, nothing blocked.
I wil run my test 2 more days and give up.
I, I think we're having the exact same problem, we're waiting on a lock and cannot use the view pg_wait_sampling_profile anymore. We cannot use the pg_wait_sampling_history either. pg_wait_sampling_current is still accessible. Here is the lock we are stuck on:
No link with drop database though, as we don't drop databases. Tell me if you'd rather have me open an other issue. For now, we have a stack backtrace of when this occurs.
(gdb) bt
#0 0x00007fbe92b9d7b7 in epoll_wait (epfd=3, events=0x55f4a34f8d00, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x000055f4a2cda169 in WaitEventSetWait ()
#2 0x000055f4a2cda609 in WaitLatchOrSocket ()
#3 0x000055f4a2cdf8bf in ?? ()
#4 0x000055f4a2cdfc10 in shm_mq_sendv ()
#5 0x000055f4a2cdfdb0 in shm_mq_send ()
#6 0x00007fbe8ff1c51f in send_profile (mqh=0x55f4a34fae70, profile_hash=0x55f4a3554f70) at collector.c:266
#7 collector_main (main_arg=<optimized out>) at collector.c:459
#8 0x000055f4a2c74fbe in StartBackgroundWorker ()
#9 0x000055f4a2c82328 in ?? ()
#10 0x000055f4a2c8261f in ?? ()
#11 <signal handler called>
#12 0x00007fbe92b94ff7 in __GI___select (nfds=6, readfds=0x7ffc19870b00, writefds=0x0, exceptfds=0x0, timeout=0x7ffc19870a60) at ../sysdeps/unix/sysv/linux/select.c:41
#13 0x000055f4a2c83426 in ?? ()
#14 0x000055f4a2c84faf in PostmasterMain ()
#15 0x000055f4a29eeeed in main ()
If you'd rather have me open another issue, please tell
OK, I have spend some time today digging the code and I can stably reproduce this issue (I hope).
To reproduce the same collector hanging with 100% chance one have to:
Reduce shm_mq size COLLECTOR_QUEUE_SIZE to e.g. 64 bytes
Put a sleep just before send_profile(), I have used pg_usleep(1000L*1000L*60);
Recompile and install pg_wait_sampling extension.
Then just open a psql session, do select * from pg_wait_sampling_get_profile();, wait a couple of seconds and cancel this query. That is all, collector will hang on forever with the same stacktrace as provided by @marco44:
* thread #1, name = 'postgres', stop reason = signal SIGSTOP
* frame #0: 0x00007f50e83b87b7 libc.so.6`epoll_wait(epfd=3, events=0x000055d2eca369b8, maxevents=1, timeout=-1) at epoll_wait.c:30 frame #1: 0x000055d2eb296483 postgres`WaitEventSetWaitBlock(set=0x000055d2eca36940, cur_timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1) at latch.c:1080 frame #2: 0x000055d2eb29635c postgres`WaitEventSetWait(set=0x000055d2eca36940, timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1, wait_event_info=134217755) at latch.c:1032 frame #3: 0x000055d2eb295a72 postgres`WaitLatchOrSocket(latch=0x00007f50e7d0e254, wakeEvents=33, sock=-1, timeout=-1, wait_event_info=134217755) at latch.c:407 frame #4: 0x000055d2eb2958d9 postgres`WaitLatch(latch=0x00007f50e7d0e254, wakeEvents=33, timeout=0, wait_event_info=134217755) at latch.c:347 frame #5: 0x000055d2eb29eb06 postgres`shm_mq_send_bytes(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false, bytes_written=0x00007fffc6f05b88) at shm_mq.c:976 frame #6: 0x000055d2eb29dfb5 postgres`shm_mq_sendv(mqh=0x000055d2eca37bc0, iov=0x00007fffc6f05c00, iovcnt=1, nowait=false) at shm_mq.c:478 frame #7: 0x000055d2eb29db40 postgres`shm_mq_send(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false) at shm_mq.c:328 frame #8: 0x00007f50e8649bec pg_wait_sampling.so`send_profile(profile_hash=0x000055d2eca7ac58, mqh=0x000055d2eca37bc0) at collector.c:258 frame #9: 0x00007f50e864a215 pg_wait_sampling.so`collector_main(main_arg=0) at collector.c:464
Without cancelling the query it works well, you have to just wait this extra sleep time. So we do not break collector logic with these hacks.
These repro steps may look purely synthetic, but I think that the same can be reached in the wild life under the following conditions:
Profile size is really high, so it does not fit into the shm_mq size at once (~16 KB).
Query requesting a profile was almost immediately canceled after issuing.
Not sure that 1) is absolutely required, but without it race will be much more tight. Eventually, collector wants to put more data into the queue, but nobody is listening out there already.
Anyway, this is the best hypothesis I have right now.
Thank you Alexey
I have it on all my affected clusters...
pg_stat_statements.max = '10000'
pg_stat_statements.track = 'all'
Probably I need to reduce it to make pg_wait_sampling extension works
пт, 11 груд. 2020 о 12:06 Alexey Kondratov <notifications@github.com> пише:
OK, I have spend some time today digging the code and I can stably
reproduce this issue (I hope).
To reproduce the same collector hanging with 100% chance one have to:
1. Reduce shm_mq size COLLECTOR_QUEUE_SIZE to e.g. 64 bytes
2. Put a sleep just before send_profile(), I have used
pg_usleep(1000L*1000L*60);
3. Recompile and install pg_wait_sampling extension.
Then just open a psql session, do select * from
pg_wait_sampling_get_profile();, wait a couple of seconds and cancel this
query. That is all, collector will hang on forever with the same stacktrace
as provided by @marco44 <https://github.com/marco44>:
* thread #1, name = 'postgres', stop reason = signal SIGSTOP
* frame #0: 0x00007f50e83b87b7 libc.so.6`epoll_wait(epfd=3, events=0x000055d2eca369b8, maxevents=1, timeout=-1) at epoll_wait.c:30 frame #1: 0x000055d2eb296483 postgres`WaitEventSetWaitBlock(set=0x000055d2eca36940, cur_timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1) at latch.c:1080 frame #2: 0x000055d2eb29635c postgres`WaitEventSetWait(set=0x000055d2eca36940, timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1, wait_event_info=134217755) at latch.c:1032 frame #3: 0x000055d2eb295a72 postgres`WaitLatchOrSocket(latch=0x00007f50e7d0e254, wakeEvents=33, sock=-1, timeout=-1, wait_event_info=134217755) at latch.c:407 frame #4: 0x000055d2eb2958d9 postgres`WaitLatch(latch=0x00007f50e7d0e254, wakeEvents=33, timeout=0, wait_event_info=134217755) at latch.c:347 frame #5: 0x000055d2eb29eb06 postgres`shm_mq_send_bytes(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false, bytes_written=0x00007fffc6f05b88) at shm_mq.c:976 frame #6: 0x000055d2eb29dfb5 postgres`shm_mq_sendv(mqh=0x000055d2eca37bc0, iov=0x00007fffc6f05c00, iovcnt=1, nowait=false) at shm_mq.c:478 frame #7: 0x000055d2eb29db40 postgres`shm_mq_send(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false) at shm_mq.c:328 frame #8: 0x00007f50e8649bec pg_wait_sampling.so`send_profile(profile_hash=0x000055d2eca7ac58, mqh=0x000055d2eca37bc0) at collector.c:258 frame #9: 0x00007f50e864a215 pg_wait_sampling.so`collector_main(main_arg=0) at collector.c:464
and holding the same lock as reported by @banlex73
<https://github.com/banlex73> and @PavelSorokin
<https://github.com/PavelSorokin>:
=# select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid from pg_catalog.pg_locks l left join pg_catalog.pg_database d on d."oid" = l."database";
datname | locktype | database | transactionid | mode | pid ---------+------------+------------+---------------+-----------------+-------
alexk | relation | 16384 | (null) | AccessShareLock | 18054
(null) | virtualxid | (null) | (null) | ExclusiveLock | 18054
(null) | relation | 0 | (null) | AccessShareLock | 18054
(null) | userlock | 3398742279 | (null) | ExclusiveLock | 17890
These repro steps may look purely synthetic, but I think that the same can
be reached in the wild life under the following conditions:
1. Profile size is really high, so it does not fit into the shm_mq
size at once (~16 KB).
2. Query requesting a profile was almost immediately canceled after
issuing.
Not sure that 1) is absolutely required, but without it race will be much
more tight. Eventually, collector wants to put more data into the queue,
but nobody is listening out there already.
Anyway, this is the best hypothesis I have right now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYGYYKIFH6JPM2DYAHLSUJ3UTANCNFSM4TMHLMFQ>
.
That's probably what occurred to us too, I was trying to debug another issue in powa, and must have control-C'ed a query (our view is really big), and I have a tendency to realize I did something stupid the exact moment I have done it, and cancel immediately what I've just asked :)
I think I have managed to fix this issue. At least I have added a proper cleanup (shm_mq detach) at backend ERROR, FATAL, Ctrl+C, so collector can continue its operation in this case.
It is still not clear for me whether it was an original issue, so testing of the branch issue#29 is very welcome from everyone in this thread.
pg_wait_sampling process blocks select * FROM pg_wait_sampling_profile ; When database was dropped from the cluster · Issue #29 · postgrespro/pg_wait_sampling
Activity
PavelSorokin commentedon Nov 6, 2020
Добрый день, столкнулся с аналогичной проблемой версия PostgrePro Standart 11.9 RH7 ( на двух разных серверах)(pg_wait_sampling version 1.1)

блокируются все запросы от мамонсу.
ololobus commentedon Nov 17, 2020
Hi, I am not sure whether it is a bug or some initial architecture problem.
@banlex73, non-existing database Oid looks weird of course. Is it blocked forever by collector? Or it is still possible to select from pg_wait_sampling_profile from time to time?
@PavelSorokin, if you have an active PostgresPro support, you can contact them as well.
Anyway, we will try to look on this issue closer.
banlex73 commentedon Nov 18, 2020
Hi Alexey @ololobus , I've just checked one of my cluster and can confirm that LOCK is PERMANENT. Tried to select * from pg_wait_sampling_profile and it was blocked by collector.
Hope it helps.
PS: Please feel free to contact me if you need anything else to investigate this issue.
ololobus commentedon Nov 30, 2020
I have tried today to reproduce this issue with pg_wait_sampling as per
stable
branch state andREL_12_STABLE
and cannot reproduce it.What I have tried so far is to repeat following steps (~a couple of dozens times):
pg_wait_sampling
intoshared_preload_libraries
and create extension.select * from pg_wait_sampling_profile;
.test
database andpg_wait_sampling
extension there.select * from pg_wait_sampling_profile;
again with or without additional payload (checkpoints, create table, etc).test
database.select * from pg_wait_sampling_profile;
a couple of time again. IIUC, it should get stuck, but it works for me.Am I missing something? @banlex73, maybe you (or @PavelSorokin) can provide a more specific steps, so I can reproduce this issue?
P.S. I have accidentally closed this issue when typing this comment with some weird keyboard shortcut, but I have reopened it immediately, do not worry.
banlex73 commentedon Dec 1, 2020
banlex73 commentedon Dec 9, 2020
Quick update
I've been running a script to reproduce this issue for 5 days already in a loop, no luck.
What I am doing:
In a loop,
another session periodically trying to select data from pg_wait_sampling_profile.
Everything works fine so far, nothing blocked.
I wil run my test 2 more days and give up.
marco44 commentedon Dec 10, 2020
I, I think we're having the exact same problem, we're waiting on a lock and cannot use the view pg_wait_sampling_profile anymore. We cannot use the pg_wait_sampling_history either. pg_wait_sampling_current is still accessible. Here is the lock we are stuck on:
No link with drop database though, as we don't drop databases. Tell me if you'd rather have me open an other issue. For now, we have a stack backtrace of when this occurs.
If you'd rather have me open another issue, please tell
anayrat commentedon Dec 11, 2020
FYI, I'm pretty sure we already hit this bug in the past (~2 years ago) on Powa demo.
ololobus commentedon Dec 11, 2020
OK, I have spend some time today digging the code and I can stably reproduce this issue (I hope).
To reproduce the same collector hanging with 100% chance one have to:
COLLECTOR_QUEUE_SIZE
to e.g. 64 bytessend_profile()
, I have usedpg_usleep(1000L*1000L*60);
pg_wait_sampling
extension.Then just open a
psql
session, doselect * from pg_wait_sampling_get_profile();
, wait a couple of seconds and cancel this query. That is all, collector will hang on forever with the same stacktrace as provided by @marco44:and holding the same lock as reported by @banlex73 and @PavelSorokin:
Without cancelling the query it works well, you have to just wait this extra sleep time. So we do not break collector logic with these hacks.
These repro steps may look purely synthetic, but I think that the same can be reached in the wild life under the following conditions:
Not sure that 1) is absolutely required, but without it race will be much more tight. Eventually, collector wants to put more data into the queue, but nobody is listening out there already.
Anyway, this is the best hypothesis I have right now.
banlex73 commentedon Dec 11, 2020
marco44 commentedon Dec 12, 2020
That's probably what occurred to us too, I was trying to debug another issue in powa, and must have control-C'ed a query (our view is really big), and I have a tendency to realize I did something stupid the exact moment I have done it, and cancel immediately what I've just asked :)
ololobus commentedon Dec 15, 2020
I think I have managed to fix this issue. At least I have added a proper cleanup (shm_mq detach) at backend ERROR, FATAL, Ctrl+C, so collector can continue its operation in this case.
It is still not clear for me whether it was an original issue, so testing of the branch issue#29 is very welcome from everyone in this thread.
UPD: I have tested it with PG11-13 on my own.
20 remaining items