Skip to content

Файлы архивов WAL *.gz.part прерывают архивацию #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
triwada opened this issue Apr 29, 2021 · 5 comments
Assignees

Comments

@triwada
Copy link

triwada commented Apr 29, 2021

Столкнулся с такой проблемой:
заполнился полностью раздел для бэкапов (NFS), очистил часть места, но процесс архивации WAL продолжал висеть в прежнем состоянии (на одном и том же WAL)

$ pg_probackup -V
pg_probackup 2.4.15 (Postgres Pro 9.6.18.1 enterprise)

archive_command = '/usr/bin/pg_probackup archive-push -B /pgdump/backup --instance=dummy --compress-level=1 -j 20 --batch-size=200 --wal-file-path %p --wal-file-name %f --overwrite'

В списке процессов было

postgres 123710  0.0  0.0 816616  4160 ?        Sl   Apr28   0:15 /usr/bin/pg_probackup archive-push -B /pgdump/backup --instance=dummy --compress-level=1 -j 10 --batch-size=100 --wal-file-path pg_xlog/000000A1000287D300000087 --wal-file-name 000000A1000287D300000087 --overwrite

[postgres@somehost ~]$ ps auxwww | grep archiver
postgres  38529  0.7  0.0 112848  4588 ?        Ss   Apr18 116:05 postgres: archiver process   archiving 000000A1000287D300000087

В логе pg_probackup новых записей не было, лог остановился на моменте заполнения раздела.

Прибил процесс зависшей команды из archive_command
[postgres@somehost ~]$ kill -9 123710

Запустилась новая команда pid=101930

[postgres@somehost ~]$ ps auxwww | grep pg_probackup
postgres 101930  237  0.0 1567064 6044 ?        Sl   02:40   2:03 /usr/bin/pg_probackup archive-push -B /pgdump/backup --instance=dummy --compress-level=1 -j 20 --batch-size=200 --wal-file-path pg_xlog/000000A1000287D300000087 --wal-file-name 000000A1000287D300000087 --overwrite
postgres 103193  0.0  0.0   8244   796 pts/3    S+   02:40   0:00 grep --color=auto pg_probackup

В лог pg_probackup начали добавляться новые записи



2021-04-29 02:40:00 MSK [101930]: INFO: pg_probackup archive-push WAL file: 000000A1000287D300000087, threads: 20/20, batch: 200/200, compression: zlib
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000E0"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000DF"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000DE"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D300000087"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000E1"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000E5"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000E7"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000E8"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000DF.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E0.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000DE.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000E9"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000EA"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000EB"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000EC"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000EE"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000ED"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E1.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E5.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E7.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E8.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E9.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000EA.gz.part"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000F0"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000EF"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000F1"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000F2"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000F3"
2021-04-29 02:40:00 MSK [101930]: LOG: pushing file "000000A1000287D3000000F4"
2021-04-29 02:40:01 MSK [101930]: LOG: WAL file already exists in archive with the same checksum, skip pushing: "/pgdata/9.6/data/pg_xlog/000000A1000287D300000087"
2021-04-29 02:40:01 MSK [101930]: LOG: pushing file "000000A1000287D3000000F5"
2021-04-29 02:40:02 MSK [101930]: LOG: pushing file "000000A1000287D3000000F6"


Но процесс архивации снова висел на том же WAL, что и до kill -9 старого процесса.


Cнова прибиваю

[postgres@somehost ~]$ kill -9 101930


В логе новые записи, но по сути - тоже самое


2021-04-29 02:40:20 MSK [101930]: LOG: pushing file "000000A1000287D4000000A5"
2021-04-29 02:40:20 MSK [101930]: LOG: pushing file "000000A1000287D4000000A6"
2021-04-29 02:40:20 MSK [101930]: LOG: pushing file "000000A1000287D4000000A7"
2021-04-29 02:40:20 MSK [101930]: LOG: pushing file "000000A1000287D4000000A8"
2021-04-29 02:49:05 MSK [113985]: INFO: pg_probackup archive-push WAL file: 000000A1000287D300000087, threads: 20/20, batch: 200/200, compression: zlib
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000DE"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D300000087"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000DF"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000E1"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000E0"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000E5"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000E7"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000E8"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000DE.gz.part"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000DF.gz.part"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E1.gz.part"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E0.gz.part"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E5.gz.part"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E7.gz.part"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E8.gz.part"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000EA"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D3000000E9"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000A9"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000AA"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000AB"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000AC"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000AD"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000AE"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000AF"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000B0"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000B1"
2021-04-29 02:49:05 MSK [113985]: LOG: pushing file "000000A1000287D4000000B2"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000E9.gz.part"
2021-04-29 02:49:05 MSK [113985]: LOG: Temp WAL file already exists, waiting on it 24000 seconds: "/pgdump/backup/wal/dummy/000000A1000287D3000000EA.gz.part"
2021-04-29 02:49:06 MSK [113985]: LOG: pushing file "000000A1000287D4000000B3"
2021-04-29 02:49:06 MSK [113985]: LOG: pushing file "000000A1000287D4000000B4"


[postgres@somehost ~]$ ls -lh /pgdump/backup/wal/dummy/000000A1000287D3*.gz.part
-rw------- 1 postgres postgres 3.1M Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000DE.gz.part
-rw------- 1 postgres postgres 3.1M Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000DF.gz.part
-rw------- 1 postgres postgres 3.2M Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000E0.gz.part
-rw------- 1 postgres postgres 1.2M Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000E1.gz.part
-rw------- 1 postgres postgres    0 Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000E5.gz.part
-rw------- 1 postgres postgres    0 Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000E7.gz.part
-rw------- 1 postgres postgres    0 Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000E8.gz.part
-rw------- 1 postgres postgres    0 Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000E9.gz.part
-rw------- 1 postgres postgres    0 Apr 28 21:51 /pgdump/backup/wal/dummy/000000A1000287D3000000EA.gz.part

Удалил эти файлы .gz.part
[postgres@somehost ~]$ rm -f /pgdump/backup/wal/dummy/000000A1000287D3*.gz.part

После этого архивация заработала.

@gsmolk
Copy link
Contributor

gsmolk commented Apr 29, 2021

Добрый день!
Это stale check, push видит уже существующий .part файл и ждет archive_timeout времени. Если за это время файл не пропадет, то он признается stale и выпиливается.

waiting on it 24000 seconds

Вот эта чиселка берется от archive-timeout, и видимо это было не очень умно с нашей стороны так делать. Надо добавить отдельный параметр

@gsmolk gsmolk added this to the 2.4.16 milestone Apr 29, 2021
@gsmolk gsmolk self-assigned this Apr 29, 2021
@gsmolk
Copy link
Contributor

gsmolk commented Apr 29, 2021

Вы можете защититься от этого, просто задав небольшое значение для параметра --archive-timeout в команде archive-push

@triwada
Copy link
Author

triwada commented Apr 29, 2021

к сожалению, это параметр пришлось увеличивать (#209)

@gsmolk
Copy link
Contributor

gsmolk commented Apr 29, 2021

к сожалению, это параметр пришлось увеличивать (#209)

Ну так ничто не мешает переопределить его для archive-push на командной строке.

@gsmolk gsmolk modified the milestones: 2.4.17, 2.5.3 Oct 15, 2021
@CHERTS
Copy link

CHERTS commented Nov 11, 2021

Всем привет. Мы тоже на NFS архивируем WAL и ситуация с заполнением иногда бывает, но как-то проносило и все само рассасывалось при увеличении раздела на стороне NFS сервера через некоторое время WAL'ы начинали архивироваться.
Но мне тут вот что непонятно - почему не удалять .part файл который помечен как stale скажем через 2 x --archive-timeout времени? Конечно --archive-timeout может быть разный на primary и standby (ситуация когда мы архивируем валы и на мастере и на репликах), но даже введение отдельного параметра не сможет достоверно сказать по какой причине .part файл не переименовывается, поэтому логично ждать некоторое время и тупо удалять эти файлы.

@kulaginm kulaginm modified the milestones: 2.5.4, 2.5.5 Dec 22, 2021
@kulaginm kulaginm modified the milestones: 2.5.5, 2.5.6 Feb 16, 2022
@kulaginm kulaginm modified the milestones: 2.5.6, 2.5.7 May 26, 2022
@MetalDream666 MetalDream666 modified the milestones: 2.5.7, 2.5.8 Jun 1, 2022
@Burus Burus removed this from the 2.5.8 milestone Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants