Search code, repositories, users, issues, pull requests...

opened

on Nov 6, 2020

· edited by banlex73

What: any session what is trying to select from pg_wait_sampling_profile is being blocked by pg_wait_sampling collector
Example:
pid | usename | blocked_by | blocking_duration | blocked_query
-------+----------+------------+-------------------+-----------------------------------------------------
30120 | postgres | {8966} | 7.279804 | select * FROM pg_wait_sampling_profile ;

ps -ef|grep 8966
postgres 8966 4118 5 Jul14 ? 6-05:23:43 postgres: nanp2: pg_wait_sampling collector

List all locks what pid 8966 holds:

select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid  from pg_catalog.pg_locks l 
left join pg_catalog.pg_database d on d."oid" = l."database" 
where l."database" is not null and l."database" <>0
and pid=8966;

Environment:
pg_wait_sampling version 1.1
Postgres12.3
CentOS 7

Please let me know what else I need to share with you to help confirm and fix (if it's a bug)

Thank you in advance.

PavelSorokin

Добрый день, столкнулся с аналогичной проблемой версия PostgrePro Standart 11.9 RH7 ( на двух разных серверах)(pg_wait_sampling version 1.1)
блокируются все запросы от мамонсу.

Contributor

Hi, I am not sure whether it is a bug or some initial architecture problem.

@banlex73, non-existing database Oid looks weird of course. Is it blocked forever by collector? Or it is still possible to select from pg_wait_sampling_profile from time to time?

@PavelSorokin, if you have an active PostgresPro support, you can contact them as well.

Anyway, we will try to look on this issue closer.

Author

Hi Alexey @ololobus , I've just checked one of my cluster and can confirm that LOCK is PERMANENT. Tried to select * from pg_wait_sampling_profile and it was blocked by collector.
Hope it helps.
PS: Please feel free to contact me if you need anything else to investigate this issue.

Contributor

I have tried today to reproduce this issue with pg_wait_sampling as per stable branch state and REL_12_STABLE and cannot reproduce it.

What I have tried so far is to repeat following steps (~a couple of dozens times):

Put pg_wait_sampling into shared_preload_libraries and create extension.
Do select * from pg_wait_sampling_profile;.
Create extra test database and pg_wait_sampling extension there.
Do select * from pg_wait_sampling_profile; again with or without additional payload (checkpoints, create table, etc).
Drop test database.
Do select * from pg_wait_sampling_profile; a couple of time again. IIUC, it should get stuck, but it works for me.

Am I missing something? @banlex73, maybe you (or @PavelSorokin) can provide a more specific steps, so I can reproduce this issue?

P.S. I have accidentally closed this issue when typing this comment with some weird keyboard shortcut, but I have reopened it immediately, do not worry.

closed this as completed

reopened this

Author

Alexey, thank you for trying.. What I can do, setup the environment and try to reproduce this issue. ETA: next couple of days пн, 30 лист. 2020 о 07:10 Alexey Kondratov <notifications@github.com> пише:

I have tried today to reproduce this issue with pg_wait_sampling as per stable branch state and REL_12_STABLE and cannot reproduce it. What I have tried so far is to repeat following steps (~a couple of dozens times): 1. Put pg_wait_sampling into shared_preload_libraries. 2. Do select * from pg_wait_sampling_profile;. 3. Create extra test database. 4. Do select * from pg_wait_sampling_profile; again with or without additional payload (checkpoints, create table, etc). 5. Drop test database. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIHWEYADJN5S2IA7DVRMMQLSSOYXHANCNFSM4TMHLMFQ> .

Author

Quick update
I've been running a script to reproduce this issue for 5 days already in a loop, no luck.
What I am doing:
In a loop,

create database
generate load (bulk insert into a table, several heavy selects)
drop database

another session periodically trying to select data from pg_wait_sampling_profile.
Everything works fine so far, nothing blocked.
I wil run my test 2 more days and give up.

marco44

I, I think we're having the exact same problem, we're waiting on a lock and cannot use the view pg_wait_sampling_profile anymore. We cannot use the pg_wait_sampling_history either. pg_wait_sampling_current is still accessible. Here is the lock we are stuck on:

 locktype |  database  | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |     mode      | granted | fastpath 
----------+------------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------+---------+----------
 userlock | 3398742279 |          |      |       |            |               |       1 |     0 |        0 | 2/0                | 30754 | ExclusiveLock | t       | f

No link with drop database though, as we don't drop databases. Tell me if you'd rather have me open an other issue. For now, we have a stack backtrace of when this occurs.

(gdb) bt
#0  0x00007fbe92b9d7b7 in epoll_wait (epfd=3, events=0x55f4a34f8d00, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055f4a2cda169 in WaitEventSetWait ()
#2  0x000055f4a2cda609 in WaitLatchOrSocket ()
#3  0x000055f4a2cdf8bf in ?? ()
#4  0x000055f4a2cdfc10 in shm_mq_sendv ()
#5  0x000055f4a2cdfdb0 in shm_mq_send ()
#6  0x00007fbe8ff1c51f in send_profile (mqh=0x55f4a34fae70, profile_hash=0x55f4a3554f70) at collector.c:266
#7  collector_main (main_arg=<optimized out>) at collector.c:459
#8  0x000055f4a2c74fbe in StartBackgroundWorker ()
#9  0x000055f4a2c82328 in ?? ()
#10 0x000055f4a2c8261f in ?? ()
#11 <signal handler called>
#12 0x00007fbe92b94ff7 in __GI___select (nfds=6, readfds=0x7ffc19870b00, writefds=0x0, exceptfds=0x0, timeout=0x7ffc19870a60) at ../sysdeps/unix/sysv/linux/select.c:41
#13 0x000055f4a2c83426 in ?? ()
#14 0x000055f4a2c84faf in PostmasterMain ()
#15 0x000055f4a29eeeed in main ()

If you'd rather have me open another issue, please tell

Collaborator

FYI, I'm pretty sure we already hit this bug in the past (~2 years ago) on Powa demo.

Contributor

OK, I have spend some time today digging the code and I can stably reproduce this issue (I hope).

To reproduce the same collector hanging with 100% chance one have to:

Reduce shm_mq size COLLECTOR_QUEUE_SIZE to e.g. 64 bytes
Put a sleep just before send_profile(), I have used pg_usleep(1000L*1000L*60);
Recompile and install pg_wait_sampling extension.

Then just open a psql session, do select * from pg_wait_sampling_get_profile();, wait a couple of seconds and cancel this query. That is all, collector will hang on forever with the same stacktrace as provided by @marco44:

* thread #1, name = 'postgres', stop reason = signal SIGSTOP
  * frame #0: 0x00007f50e83b87b7 libc.so.6`epoll_wait(epfd=3, events=0x000055d2eca369b8, maxevents=1, timeout=-1) at epoll_wait.c:30
    frame #1: 0x000055d2eb296483 postgres`WaitEventSetWaitBlock(set=0x000055d2eca36940, cur_timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1) at latch.c:1080
    frame #2: 0x000055d2eb29635c postgres`WaitEventSetWait(set=0x000055d2eca36940, timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1, wait_event_info=134217755) at latch.c:1032
    frame #3: 0x000055d2eb295a72 postgres`WaitLatchOrSocket(latch=0x00007f50e7d0e254, wakeEvents=33, sock=-1, timeout=-1, wait_event_info=134217755) at latch.c:407
    frame #4: 0x000055d2eb2958d9 postgres`WaitLatch(latch=0x00007f50e7d0e254, wakeEvents=33, timeout=0, wait_event_info=134217755) at latch.c:347
    frame #5: 0x000055d2eb29eb06 postgres`shm_mq_send_bytes(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false, bytes_written=0x00007fffc6f05b88) at shm_mq.c:976
    frame #6: 0x000055d2eb29dfb5 postgres`shm_mq_sendv(mqh=0x000055d2eca37bc0, iov=0x00007fffc6f05c00, iovcnt=1, nowait=false) at shm_mq.c:478
    frame #7: 0x000055d2eb29db40 postgres`shm_mq_send(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false) at shm_mq.c:328
    frame #8: 0x00007f50e8649bec pg_wait_sampling.so`send_profile(profile_hash=0x000055d2eca7ac58, mqh=0x000055d2eca37bc0) at collector.c:258
    frame #9: 0x00007f50e864a215 pg_wait_sampling.so`collector_main(main_arg=0) at collector.c:464

and holding the same lock as reported by @banlex73 and @PavelSorokin:

=# select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid  from pg_catalog.pg_locks l 
left join pg_catalog.pg_database d on d."oid" = l."database";
 datname |  locktype  |  database  | transactionid |      mode       |  pid  
---------+------------+------------+---------------+-----------------+-------
 alexk   | relation   |      16384 |        (null) | AccessShareLock | 18054
 (null)  | virtualxid |     (null) |        (null) | ExclusiveLock   | 18054
 (null)  | relation   |          0 |        (null) | AccessShareLock | 18054
 (null)  | userlock   | 3398742279 |        (null) | ExclusiveLock   | 17890

Without cancelling the query it works well, you have to just wait this extra sleep time. So we do not break collector logic with these hacks.

These repro steps may look purely synthetic, but I think that the same can be reached in the wild life under the following conditions:

Profile size is really high, so it does not fit into the shm_mq size at once (~16 KB).
Query requesting a profile was almost immediately canceled after issuing.

Not sure that 1) is absolutely required, but without it race will be much more tight. Eventually, collector wants to put more data into the queue, but nobody is listening out there already.

Anyway, this is the best hypothesis I have right now.

Author

Thank you Alexey I have it on all my affected clusters... pg_stat_statements.max = '10000' pg_stat_statements.track = 'all' Probably I need to reduce it to make pg_wait_sampling extension works пт, 11 груд. 2020 о 12:06 Alexey Kondratov <notifications@github.com> пише:

OK, I have spend some time today digging the code and I can stably reproduce this issue (I hope). To reproduce the same collector hanging with 100% chance one have to: 1. Reduce shm_mq size COLLECTOR_QUEUE_SIZE to e.g. 64 bytes 2. Put a sleep just before send_profile(), I have used pg_usleep(1000L*1000L*60); 3. Recompile and install pg_wait_sampling extension. Then just open a psql session, do select * from pg_wait_sampling_get_profile();, wait a couple of seconds and cancel this query. That is all, collector will hang on forever with the same stacktrace as provided by @marco44 <https://github.com/marco44>: * thread #1, name = 'postgres', stop reason = signal SIGSTOP * frame #0: 0x00007f50e83b87b7 libc.so.6`epoll_wait(epfd=3, events=0x000055d2eca369b8, maxevents=1, timeout=-1) at epoll_wait.c:30 frame #1: 0x000055d2eb296483 postgres`WaitEventSetWaitBlock(set=0x000055d2eca36940, cur_timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1) at latch.c:1080 frame #2: 0x000055d2eb29635c postgres`WaitEventSetWait(set=0x000055d2eca36940, timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1, wait_event_info=134217755) at latch.c:1032 frame #3: 0x000055d2eb295a72 postgres`WaitLatchOrSocket(latch=0x00007f50e7d0e254, wakeEvents=33, sock=-1, timeout=-1, wait_event_info=134217755) at latch.c:407 frame #4: 0x000055d2eb2958d9 postgres`WaitLatch(latch=0x00007f50e7d0e254, wakeEvents=33, timeout=0, wait_event_info=134217755) at latch.c:347 frame #5: 0x000055d2eb29eb06 postgres`shm_mq_send_bytes(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false, bytes_written=0x00007fffc6f05b88) at shm_mq.c:976 frame #6: 0x000055d2eb29dfb5 postgres`shm_mq_sendv(mqh=0x000055d2eca37bc0, iov=0x00007fffc6f05c00, iovcnt=1, nowait=false) at shm_mq.c:478 frame #7: 0x000055d2eb29db40 postgres`shm_mq_send(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false) at shm_mq.c:328 frame #8: 0x00007f50e8649bec pg_wait_sampling.so`send_profile(profile_hash=0x000055d2eca7ac58, mqh=0x000055d2eca37bc0) at collector.c:258 frame #9: 0x00007f50e864a215 pg_wait_sampling.so`collector_main(main_arg=0) at collector.c:464 and holding the same lock as reported by @banlex73 <https://github.com/banlex73> and @PavelSorokin <https://github.com/PavelSorokin>: =# select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid from pg_catalog.pg_locks l left join pg_catalog.pg_database d on d."oid" = l."database"; datname | locktype | database | transactionid | mode | pid ---------+------------+------------+---------------+-----------------+------- alexk | relation | 16384 | (null) | AccessShareLock | 18054 (null) | virtualxid | (null) | (null) | ExclusiveLock | 18054 (null) | relation | 0 | (null) | AccessShareLock | 18054 (null) | userlock | 3398742279 | (null) | ExclusiveLock | 17890 These repro steps may look purely synthetic, but I think that the same can be reached in the wild life under the following conditions: 1. Profile size is really high, so it does not fit into the shm_mq size at once (~16 KB). 2. Query requesting a profile was almost immediately canceled after issuing. Not sure that 1) is absolutely required, but without it race will be much more tight. Eventually, collector wants to put more data into the queue, but nobody is listening out there already. Anyway, this is the best hypothesis I have right now. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIHWEYGYYKIFH6JPM2DYAHLSUJ3UTANCNFSM4TMHLMFQ> .

marco44

That's probably what occurred to us too, I was trying to debug another issue in powa, and must have control-C'ed a query (our view is really big), and I have a tendency to realize I did something stupid the exact moment I have done it, and cancel immediately what I've just asked :)

mentioned this

on Dec 15, 2020

Resolve Issue#29 #31

Contributor

I think I have managed to fix this issue. At least I have added a proper cleanup (shm_mq detach) at backend ERROR, FATAL, Ctrl+C, so collector can continue its operation in this case.

It is still not clear for me whether it was an original issue, so testing of the branch issue#29 is very welcome from everyone in this thread.

UPD: I have tested it with PG11-13 on my own.

5 remaining items

Contributor

This issue was closed automatically with merging PR, but feel free to reopen if it still remains.

Author

Thanks! вт, 12 січ. 2021 о 02:52 Alexey Kondratov <notifications@github.com> пише:

@banlex73 <https://github.com/banlex73>, thanks for trying! OK, then I am going to merge PR, since it looks as a right change anyway, and release new version with PG13 support and these fixes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIHWEYFOK6XI3Y6AARTB5A3SZQSVRANCNFSM4TMHLMFQ> .

Author

@ololobus correct me if I am wrong, the latest Release 1.1.2 doesn't have this fix, correct?

Collaborator

Yes

Author

@anayrat @ololobus do you have any ETA when a new version will released?

Contributor

@banlex73, I am still waiting for internal review. Really hope that it will be done during next week, so I will be able to roll out new release

Author

Ok, thanks for the update. пт, 22 січ. 2021 о 09:03 Alexey Kondratov <notifications@github.com> пише:

@banlex73 <https://github.com/banlex73>, I am still waiting for internal review. Really hope that it will be done during next week, so I will be able to roll out new release — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIHWEYE67EEP2ZYBYZ3LCOLS3GVUZANCNFSM4TMHLMFQ> .

arssher

LGTM. Sorry for the delay, let's produce the release now.

Contributor

Thanks!

Just released https://github.com/postgrespro/pg_wait_sampling/releases/tag/v1.1.3. Also added info about our release process (and Debian packaging) into README.md.

And merged stable into debian. FYI, @anayrat, @df7cb

Collaborator

Great! thanks!

Collaborator

@df7cb I pushed b91bd81 on debian branch but it seems debian tracker is looking inside master instead of debian:
https://qa.debian.org/cgi-bin/vcswatch?package=pg-wait-sampling

Git: https://github.com/postgrespro/pg_wait_sampling.git
Branch: master
Path: debian/changelog

I don't know if I am right or how to change this :-/

df7cb

Collaborator

Thanks @anayrat!
I fixed the branch name via debian/control(.in):

Vcs-Git: https://github.com/postgrespro/pg_wait_sampling.git -b debian

and also added the missing -1 on the version number in debian/changelog.
Uploads to apt.postgresql.org and Debian on the way.

Author

great news! By the way, I am on CentoOS 7 and yum cannot find 1.1.3 yet yum list pg_wait_sampling_12 Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile Installed Packages pg_wait_sampling_12.x86_64 * 1.1.2-1.rhel7* ср, 27 січ. 2021 о 03:53 Adrien nayrat <notifications@github.com> пише:

@df7cb <https://github.com/df7cb> I pushed b91bd81 <b91bd81> on debian branch but it seems debian tracker is looking inside master instead of debian: https://qa.debian.org/cgi-bin/vcswatch?package=pg-wait-sampling Git: https://github.com/postgrespro/pg_wait_sampling.git Branch: master Path: debian/changelog I don't know if I am right or how to change this :-/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIHWEYEJOEXAASUHXX62EJ3S375B3ANCNFSM4TMHLMFQ> .

Collaborator

@banlex73 can you try again, ti seems it is available now https://twitter.com/DevrimGunduz/status/1354589225937727490

Author

thanks a lot, I can see it now yum list pg_wait_sampling_12 Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile Available Packages pg_wait_sampling_12.x86_64 *1.1.3-1.rhel7* чт, 28 січ. 2021 о 11:42 Adrien nayrat <notifications@github.com> пише: