Skip to content

pg_wait_sampling process blocks select * FROM pg_wait_sampling_profile ; When database was dropped from the cluster #29

@banlex73

Description

@banlex73

What: any session what is trying to select from pg_wait_sampling_profile is being blocked by pg_wait_sampling collector
Example:
pid | usename | blocked_by | blocking_duration | blocked_query
-------+----------+------------+-------------------+-----------------------------------------------------
30120 | postgres | {8966} | 7.279804 | select * FROM pg_wait_sampling_profile ;

ps -ef|grep 8966
postgres 8966 4118 5 Jul14 ? 6-05:23:43 postgres: nanp2: pg_wait_sampling collector

List all locks what pid 8966 holds:

select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid  from pg_catalog.pg_locks l 
left join pg_catalog.pg_database d on d."oid" = l."database" 
where l."database" is not null and l."database" <>0
and pid=8966;

datname | locktype | database | transactionid | mode | pid
---------+----------+------------+---------------+---------------+------
| userlock | 3398742279 | | ExclusiveLock | 8966
(1 row)
As you see, there's no database with oid=3398742279

Environment:
pg_wait_sampling version 1.1
Postgres12.3
CentOS 7

Please let me know what else I need to share with you to help confirm and fix (if it's a bug)

Thank you in advance.

Activity

PavelSorokin

PavelSorokin commented on Nov 6, 2020

@PavelSorokin

Добрый день, столкнулся с аналогичной проблемой версия PostgrePro Standart 11.9 RH7 ( на двух разных серверах)(pg_wait_sampling version 1.1)
блокируются все запросы от мамонсу.
image

ololobus

ololobus commented on Nov 17, 2020

@ololobus
Contributor

Hi, I am not sure whether it is a bug or some initial architecture problem.

@banlex73, non-existing database Oid looks weird of course. Is it blocked forever by collector? Or it is still possible to select from pg_wait_sampling_profile from time to time?

@PavelSorokin, if you have an active PostgresPro support, you can contact them as well.

Anyway, we will try to look on this issue closer.

banlex73

banlex73 commented on Nov 18, 2020

@banlex73
Author

Hi Alexey @ololobus , I've just checked one of my cluster and can confirm that LOCK is PERMANENT. Tried to select * from pg_wait_sampling_profile and it was blocked by collector.
Hope it helps.
PS: Please feel free to contact me if you need anything else to investigate this issue.

ololobus

ololobus commented on Nov 30, 2020

@ololobus
Contributor

I have tried today to reproduce this issue with pg_wait_sampling as per stable branch state and REL_12_STABLE and cannot reproduce it.

What I have tried so far is to repeat following steps (~a couple of dozens times):

  1. Put pg_wait_sampling into shared_preload_libraries and create extension.
  2. Do select * from pg_wait_sampling_profile;.
  3. Create extra test database and pg_wait_sampling extension there.
  4. Do select * from pg_wait_sampling_profile; again with or without additional payload (checkpoints, create table, etc).
  5. Drop test database.
  6. Do select * from pg_wait_sampling_profile; a couple of time again. IIUC, it should get stuck, but it works for me.

Am I missing something? @banlex73, maybe you (or @PavelSorokin) can provide a more specific steps, so I can reproduce this issue?

P.S. I have accidentally closed this issue when typing this comment with some weird keyboard shortcut, but I have reopened it immediately, do not worry.

banlex73

banlex73 commented on Dec 1, 2020

@banlex73
Author
banlex73

banlex73 commented on Dec 9, 2020

@banlex73
Author

Quick update
I've been running a script to reproduce this issue for 5 days already in a loop, no luck.
What I am doing:
In a loop,

  • create database
  • generate load (bulk insert into a table, several heavy selects)
  • drop database

another session periodically trying to select data from pg_wait_sampling_profile.
Everything works fine so far, nothing blocked.
I wil run my test 2 more days and give up.

marco44

marco44 commented on Dec 10, 2020

@marco44

I, I think we're having the exact same problem, we're waiting on a lock and cannot use the view pg_wait_sampling_profile anymore. We cannot use the pg_wait_sampling_history either. pg_wait_sampling_current is still accessible. Here is the lock we are stuck on:

 locktype |  database  | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |     mode      | granted | fastpath 
----------+------------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------+---------+----------
 userlock | 3398742279 |          |      |       |            |               |       1 |     0 |        0 | 2/0                | 30754 | ExclusiveLock | t       | f

No link with drop database though, as we don't drop databases. Tell me if you'd rather have me open an other issue. For now, we have a stack backtrace of when this occurs.

(gdb) bt
#0  0x00007fbe92b9d7b7 in epoll_wait (epfd=3, events=0x55f4a34f8d00, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055f4a2cda169 in WaitEventSetWait ()
#2  0x000055f4a2cda609 in WaitLatchOrSocket ()
#3  0x000055f4a2cdf8bf in ?? ()
#4  0x000055f4a2cdfc10 in shm_mq_sendv ()
#5  0x000055f4a2cdfdb0 in shm_mq_send ()
#6  0x00007fbe8ff1c51f in send_profile (mqh=0x55f4a34fae70, profile_hash=0x55f4a3554f70) at collector.c:266
#7  collector_main (main_arg=<optimized out>) at collector.c:459
#8  0x000055f4a2c74fbe in StartBackgroundWorker ()
#9  0x000055f4a2c82328 in ?? ()
#10 0x000055f4a2c8261f in ?? ()
#11 <signal handler called>
#12 0x00007fbe92b94ff7 in __GI___select (nfds=6, readfds=0x7ffc19870b00, writefds=0x0, exceptfds=0x0, timeout=0x7ffc19870a60) at ../sysdeps/unix/sysv/linux/select.c:41
#13 0x000055f4a2c83426 in ?? ()
#14 0x000055f4a2c84faf in PostmasterMain ()
#15 0x000055f4a29eeeed in main ()

If you'd rather have me open another issue, please tell

anayrat

anayrat commented on Dec 11, 2020

@anayrat
Collaborator

FYI, I'm pretty sure we already hit this bug in the past (~2 years ago) on Powa demo.

ololobus

ololobus commented on Dec 11, 2020

@ololobus
Contributor

OK, I have spend some time today digging the code and I can stably reproduce this issue (I hope).

To reproduce the same collector hanging with 100% chance one have to:

  1. Reduce shm_mq size COLLECTOR_QUEUE_SIZE to e.g. 64 bytes
  2. Put a sleep just before send_profile(), I have used pg_usleep(1000L*1000L*60);
  3. Recompile and install pg_wait_sampling extension.

Then just open a psql session, do select * from pg_wait_sampling_get_profile();, wait a couple of seconds and cancel this query. That is all, collector will hang on forever with the same stacktrace as provided by @marco44:

* thread #1, name = 'postgres', stop reason = signal SIGSTOP
  * frame #0: 0x00007f50e83b87b7 libc.so.6`epoll_wait(epfd=3, events=0x000055d2eca369b8, maxevents=1, timeout=-1) at epoll_wait.c:30
    frame #1: 0x000055d2eb296483 postgres`WaitEventSetWaitBlock(set=0x000055d2eca36940, cur_timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1) at latch.c:1080
    frame #2: 0x000055d2eb29635c postgres`WaitEventSetWait(set=0x000055d2eca36940, timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1, wait_event_info=134217755) at latch.c:1032
    frame #3: 0x000055d2eb295a72 postgres`WaitLatchOrSocket(latch=0x00007f50e7d0e254, wakeEvents=33, sock=-1, timeout=-1, wait_event_info=134217755) at latch.c:407
    frame #4: 0x000055d2eb2958d9 postgres`WaitLatch(latch=0x00007f50e7d0e254, wakeEvents=33, timeout=0, wait_event_info=134217755) at latch.c:347
    frame #5: 0x000055d2eb29eb06 postgres`shm_mq_send_bytes(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false, bytes_written=0x00007fffc6f05b88) at shm_mq.c:976
    frame #6: 0x000055d2eb29dfb5 postgres`shm_mq_sendv(mqh=0x000055d2eca37bc0, iov=0x00007fffc6f05c00, iovcnt=1, nowait=false) at shm_mq.c:478
    frame #7: 0x000055d2eb29db40 postgres`shm_mq_send(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false) at shm_mq.c:328
    frame #8: 0x00007f50e8649bec pg_wait_sampling.so`send_profile(profile_hash=0x000055d2eca7ac58, mqh=0x000055d2eca37bc0) at collector.c:258
    frame #9: 0x00007f50e864a215 pg_wait_sampling.so`collector_main(main_arg=0) at collector.c:464

and holding the same lock as reported by @banlex73 and @PavelSorokin:

=# select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid  from pg_catalog.pg_locks l 
left join pg_catalog.pg_database d on d."oid" = l."database";
 datname |  locktype  |  database  | transactionid |      mode       |  pid  
---------+------------+------------+---------------+-----------------+-------
 alexk   | relation   |      16384 |        (null) | AccessShareLock | 18054
 (null)  | virtualxid |     (null) |        (null) | ExclusiveLock   | 18054
 (null)  | relation   |          0 |        (null) | AccessShareLock | 18054
 (null)  | userlock   | 3398742279 |        (null) | ExclusiveLock   | 17890

Without cancelling the query it works well, you have to just wait this extra sleep time. So we do not break collector logic with these hacks.

These repro steps may look purely synthetic, but I think that the same can be reached in the wild life under the following conditions:

  1. Profile size is really high, so it does not fit into the shm_mq size at once (~16 KB).
  2. Query requesting a profile was almost immediately canceled after issuing.

Not sure that 1) is absolutely required, but without it race will be much more tight. Eventually, collector wants to put more data into the queue, but nobody is listening out there already.

Anyway, this is the best hypothesis I have right now.

banlex73

banlex73 commented on Dec 11, 2020

@banlex73
Author
marco44

marco44 commented on Dec 12, 2020

@marco44

That's probably what occurred to us too, I was trying to debug another issue in powa, and must have control-C'ed a query (our view is really big), and I have a tendency to realize I did something stupid the exact moment I have done it, and cancel immediately what I've just asked :)

ololobus

ololobus commented on Dec 15, 2020

@ololobus
Contributor

I think I have managed to fix this issue. At least I have added a proper cleanup (shm_mq detach) at backend ERROR, FATAL, Ctrl+C, so collector can continue its operation in this case.

It is still not clear for me whether it was an original issue, so testing of the branch issue#29 is very welcome from everyone in this thread.

UPD: I have tested it with PG11-13 on my own.

20 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @marco44@ololobus@df7cb@anayrat@arssher

      Issue actions

        pg_wait_sampling process blocks select * FROM pg_wait_sampling_profile ; When database was dropped from the cluster · Issue #29 · postgrespro/pg_wait_sampling