Quick Links

About to add WAL write/fsync statistics to pg_stat_wal view

Lists:	pgsql-hackers

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2020-12-08 05:06:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following
information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to "track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query
performance than database I/O activity.

```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time
FROM pg_stat_wal;
-[ RECORD 1 ]--+----
wal_write | 650 # Total number of times WAL data was written to
the disk

wal_write_time | 43 # Total amount of time that has been spent in the
portion of WAL data was written to disk
# if track-wal-io-timing is enabled, otherwise
zero

wal_sync | 78 # Total number of times WAL data was synced to the
disk

wal_sync_time | 104 # Total amount of time that has been spent in the
portion of WAL data was synced to disk
# if track-wal-io-timing is enabled, otherwise
zero
```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
0001_add_wal_io_activity_to_the_pg_stat_wal.patch	text/x-diff	14.4 KB

From:	Li Japin <japinli(at)hotmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2020-12-08 07:45:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

> On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>
> Hi,
>
> I propose to add wal write/fsync statistics to pg_stat_wal view.
> It's useful not only for developing/improving source code related to WAL
> but also for users to detect workload changes, HW failure, and so on.
>
> I introduce "track_wal_io_timing" parameter and provide the following information on pg_stat_wal view.
> I separate the parameter from "track_io_timing" to "track_wal_io_timing"
> because IIUC, WAL I/O activity may have a greater impact on query performance than database I/O activity.
>
> ```
> postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time FROM pg_stat_wal;
> -[ RECORD 1 ]--+----
> wal_write | 650 # Total number of times WAL data was written to the disk
>
> wal_write_time | 43 # Total amount of time that has been spent in the portion of WAL data was written to disk
> # if track-wal-io-timing is enabled, otherwise zero
>
> wal_sync | 78 # Total number of times WAL data was synced to the disk
>
> wal_sync_time | 104 # Total amount of time that has been spent in the portion of WAL data was synced to disk
> # if track-wal-io-timing is enabled, otherwise zero
> ```
>
> What do you think?
> Please let me know your comments.
>
> Regards
> --
> Masahiro Ikeda
> NTT DATA CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>

There is a no previous prototype warning for ‘fsyncMethodCalled’, and it now only used in xlog.c,
should we declare with static? And this function wants a boolean as a return, should we use
true/false other than 0/1?

+/*
+ * Check if fsync mothod is called.
+ */
+bool
+fsyncMethodCalled()
+{
+ if (!enableFsync)
+ return 0;
+
+ switch (sync_method)
+ {
+ case SYNC_METHOD_FSYNC:
+ case SYNC_METHOD_FSYNC_WRITETHROUGH:
+ case SYNC_METHOD_FDATASYNC:
+ return 1;
+ default:
+ /* others don't have a specific fsync method */
+ return 0;
+ }
+}
+

--
Best regards
ChengDu WenWu Information Technology Co.,Ltd.
Japin Li

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Li Japin <japinli(at)hotmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2020-12-08 11:39:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2020-12-08 16:45, Li Japin wrote:
> Hi,
>
>> On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
>> wrote:
>>
>> Hi,
>>
>> I propose to add wal write/fsync statistics to pg_stat_wal view.
>> It's useful not only for developing/improving source code related to
>> WAL
>> but also for users to detect workload changes, HW failure, and so on.
>>
>> I introduce "track_wal_io_timing" parameter and provide the following
>> information on pg_stat_wal view.
>> I separate the parameter from "track_io_timing" to
>> "track_wal_io_timing"
>> because IIUC, WAL I/O activity may have a greater impact on query
>> performance than database I/O activity.
>>
>> ```
>> postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time
>> FROM pg_stat_wal;
>> -[ RECORD 1 ]--+----
>> wal_write | 650 # Total number of times WAL data was written to
>> the disk
>>
>> wal_write_time | 43 # Total amount of time that has been spent in
>> the portion of WAL data was written to disk
>> # if track-wal-io-timing is enabled, otherwise
>> zero
>>
>> wal_sync | 78 # Total number of times WAL data was synced to
>> the disk
>>
>> wal_sync_time | 104 # Total amount of time that has been spent in
>> the portion of WAL data was synced to disk
>> # if track-wal-io-timing is enabled, otherwise
>> zero
>> ```
>>
>> What do you think?
>> Please let me know your comments.
>>
>> Regards
>> --
>> Masahiro Ikeda
>> NTT DATA
>> CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>
>
> There is a no previous prototype warning for ‘fsyncMethodCalled’, and
> it now only used in xlog.c,
> should we declare with static? And this function wants a boolean as a
> return, should we use
> true/false other than 0/1?
>
> +/*
> + * Check if fsync mothod is called.
> + */
> +bool
> +fsyncMethodCalled()
> +{
> + if (!enableFsync)
> + return 0;
> +
> + switch (sync_method)
> + {
> + case SYNC_METHOD_FSYNC:
> + case SYNC_METHOD_FSYNC_WRITETHROUGH:
> + case SYNC_METHOD_FDATASYNC:
> + return 1;
> + default:
> + /* others don't have a specific fsync method */
> + return 0;
> + }
> +}
> +

Thanks for your review.
I agree with your comments. I fixed them.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
0002_add_wal_io_activity_to_the_pg_stat_wal.patch	text/x-diff	14.7 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Li Japin <japinli(at)hotmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2020-12-25 09:45:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I rebased the patch to the master branch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
0003_add_wal_io_activity_to_the_pg_stat_wal.patch	text/x-diff	14.8 KB

From:	"kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>
To:	'Masahiro Ikeda' <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-22 02:54:17
Message-ID:	TYAPR01MB3168FD6040FFA8CA1D9E3E8BF5A00@TYAPR01MB3168.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dear Ikeda-san,

This patch cannot be applied to the HEAD, but anyway I put a comment.

```
+ /*
+ * Measure i/o timing to fsync WAL data.
+ *
+ * The wal receiver skip to collect it to avoid performance degradation of standy servers.
+ * If sync_method doesn't have its fsync method, to skip too.
+ */
+ if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+ INSTR_TIME_SET_CURRENT(start);
```

I think m_wal_sync_time should be collected even if the process is WalRecevier.
Because all wal_fsync should be recorded, and
some performance issues have been aleady occurred if track_wal_io_timing is turned on.
I think it's strange only to take care of the walrecevier case.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-22 05:50:20
Message-ID:	CAD21AoB=RCYME3+iroY+8TrC9tsHTPo7kRKCpU7C_kZ8kgGW2A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>
> Hi,
>
> I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are comments:

---
+ if (track_wal_io_timing)
+ {
+ INSTR_TIME_SET_CURRENT(duration);
+ INSTR_TIME_SUBTRACT(duration, start);
+ WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+ }

* I think it should add the time in micro sec.

After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

---
+ /*
+ * Measure i/o timing to fsync WAL data.
+ *
+ * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+ * If sync_method doesn't have its fsync method, to skip too.
+ */
+ if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+ INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.

* How about checking track_wal_io_timing first?

* s/standy/standby/

---
+ /* increment the i/o timing and the number of times to fsync WAL data */
+ if (fsyncMethodCalled())
+ {
+ if (!AmWalReceiverProcess() && track_wal_io_timing)
+ {
+ INSTR_TIME_SET_CURRENT(duration);
+ INSTR_TIME_SUBTRACT(duration, start);
+ WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration);
+ }
+
+ WalStats.m_wal_sync++;
+ }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+ if (!enableFsync)
+ return false;
+
+ switch (sync_method)
+ {
+ case SYNC_METHOD_FSYNC:
+ case SYNC_METHOD_FSYNC_WRITETHROUGH:
+ case SYNC_METHOD_FDATASYNC:
+ return true;
+ default:
+ /* others don't have a specific fsync method */
+ return false;
+ }
+}

* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

* void is missing as argument?

* s/mothod/method/

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	kuroda(dot)hayato(at)fujitsu(dot)com
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-22 12:14:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-01-22 11:54, kuroda(dot)hayato(at)fujitsu(dot)com wrote:
> Dear Ikeda-san,
>
> This patch cannot be applied to the HEAD, but anyway I put a comment.
>
> ```
> + /*
> + * Measure i/o timing to fsync WAL data.
> + *
> + * The wal receiver skip to collect it to avoid performance
> degradation of standy servers.
> + * If sync_method doesn't have its fsync method, to skip too.
> + */
> + if (!AmWalReceiverProcess() && track_wal_io_timing &&
> fsyncMethodCalled())
> + INSTR_TIME_SET_CURRENT(start);
> ```
>
> I think m_wal_sync_time should be collected even if the process is
> WalRecevier.
> Because all wal_fsync should be recorded, and
> some performance issues have been aleady occurred if
> track_wal_io_timing is turned on.
> I think it's strange only to take care of the walrecevier case.

Kuroda-san, Thanks for your comments.

Although I thought that the performance impact may be bigger in standby
servers
because WALReceiver didn't use wal buffers, it's no need to be
considered.
I agreed that if track_wal_io_timing is turned on, the primary server's
performance degradation occurs too.

I will make rebased and modified.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-22 13:05:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-01-22 14:50, Masahiko Sawada wrote:
> On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>
>> Hi,
>>
>> I rebased the patch to the master branch.
>
> Thank you for working on this. I've read the latest patch. Here are
> comments:
>
> ---
> + if (track_wal_io_timing)
> + {
> + INSTR_TIME_SET_CURRENT(duration);
> + INSTR_TIME_SUBTRACT(duration, start);
> + WalStats.m_wal_write_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> + }
>
> * I think it should add the time in micro sec.
> After running pgbench with track_wal_io_timing = on for 30 sec,
> pg_stat_wal showed the following on my environment:
>
> postgres(1:61569)=# select * from pg_stat_wal;
> -[ RECORD 1 ]----+-----------------------------
> wal_records | 285947
> wal_fpi | 53285
> wal_bytes | 442008213
> wal_buffers_full | 0
> wal_write | 25516
> wal_write_time | 0
> wal_sync | 25437
> wal_sync_time | 14490
> stats_reset | 2021-01-22 10:56:13.29464+09
>
> Since writes can complete less than a millisecond, wal_write_time
> didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

> ---
> + /*
> + * Measure i/o timing to fsync WAL data.
> + *
> + * The wal receiver skip to collect it to avoid performance
> degradation of standy servers.
> + * If sync_method doesn't have its fsync method, to skip too.
> + */
> + if (!AmWalReceiverProcess() && track_wal_io_timing &&
> fsyncMethodCalled())
> + INSTR_TIME_SET_CURRENT(start);
>
> * Why does only the wal receiver skip it even if track_wal_io_timinig
> is true? I think the performance degradation is also true for backend
> processes. If there is another reason for that, I think it's better to
> mention in both the doc and comment.
> * How about checking track_wal_io_timing first?
> * s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

> ---
> + /* increment the i/o timing and the number of times to fsync WAL
> data */
> + if (fsyncMethodCalled())
> + {
> + if (!AmWalReceiverProcess() && track_wal_io_timing)
> + {
> + INSTR_TIME_SET_CURRENT(duration);
> + INSTR_TIME_SUBTRACT(duration, start);
> + WalStats.m_wal_sync_time +=
> INSTR_TIME_GET_MILLISEC(duration);
> + }
> +
> + WalStats.m_wal_sync++;
> + }
>
> * I'd avoid always calling fsyncMethodCalled() in this path. How about
> incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

> ---
> +/*
> + * Check if fsync mothod is called.
> + */
> +static bool
> +fsyncMethodCalled()
> +{
> + if (!enableFsync)
> + return false;
> +
> + switch (sync_method)
> + {
> + case SYNC_METHOD_FSYNC:
> + case SYNC_METHOD_FSYNC_WRITETHROUGH:
> + case SYNC_METHOD_FDATASYNC:
> + return true;
> + default:
> + /* others don't have a specific fsync method */
> + return false;
> + }
> +}
>
> * I'm concerned that the function name could confuse the reader
> because it's called even before the fsync method is called. As I
> commented above, calling to fsyncMethodCalled() can be eliminated.
> That way, this function is called at only once. So do we really need
> this function?

Thanks to your comments, I removed them.

> * As far as I read the code, issue_xlog_fsync() seems to do fsync even
> if enableFsync is false. Why does the function return false in that
> case? I might be missing something.

IIUC, the reason is that I thought that each fsync functions like
pg_fsync_no_writethrough() check enableFsync.

If this code doesn't check, m_wal_sync_time may be incremented
even though some sync methods like SYNC_METHOD_OPEN don't call to sync
some data to the disk at the time.

> * void is missing as argument?
>
> * s/mothod/method/

I removed them.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v4-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	15.5 KB

From:	japin <japinli(at)hotmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-22 16:46:47
Message-ID:	MEYP282MB1669B25B065F504ACB6C06E0B6A00@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Masahiro

Thanks for you update the v4 patch. Here are some comments:

(1)
+ char *msg = NULL;
+ bool sync_called; /* whether to sync data to the disk. */
+ instr_time start;
+ instr_time duration;
+
+ /* check whether to sync data to the disk is really occurred. */
+ sync_called = false;

Maybe we can initialize the "sync_called" variable when declare it.

(2)
+ if (sync_called)
+ {
+ /* increment the i/o timing and the number of times to fsync WAL data */
+ if (track_wal_io_timing)
+ {
+ INSTR_TIME_SET_CURRENT(duration);
+ INSTR_TIME_SUBTRACT(duration, start);
+ WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
+ }
+
+ WalStats.m_wal_sync++;
+ }

There is an extra space before INSTR_TIME_GET_MICROSEC(duration).

In the issue_xlog_fsync(), the comment says that if sync_method is
SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
Does that mean it synced when write the WAL data? And for those cases, we
cannot get accurate write/sync timing and number of write/sync times, right?

case SYNC_METHOD_OPEN:
case SYNC_METHOD_OPEN_DSYNC:
/* write synced it already */
break;

On Fri, 22 Jan 2021 at 21:05, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> On 2021-01-22 14:50, Masahiko Sawada wrote:
>> On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>
>>> Hi,
>>>
>>> I rebased the patch to the master branch.
>>
>> Thank you for working on this. I've read the latest patch. Here are
>> comments:
>>
>> ---
>> + if (track_wal_io_timing)
>> + {
>> + INSTR_TIME_SET_CURRENT(duration);
>> + INSTR_TIME_SUBTRACT(duration, start);
>> + WalStats.m_wal_write_time +=
>> INSTR_TIME_GET_MILLISEC(duration);
>> + }
>>
>> * I think it should add the time in micro sec.
>> After running pgbench with track_wal_io_timing = on for 30 sec,
>> pg_stat_wal showed the following on my environment:
>>
>> postgres(1:61569)=# select * from pg_stat_wal;
>> -[ RECORD 1 ]----+-----------------------------
>> wal_records | 285947
>> wal_fpi | 53285
>> wal_bytes | 442008213
>> wal_buffers_full | 0
>> wal_write | 25516
>> wal_write_time | 0
>> wal_sync | 25437
>> wal_sync_time | 14490
>> stats_reset | 2021-01-22 10:56:13.29464+09
>>
>> Since writes can complete less than a millisecond, wal_write_time
>> didn't increase. I think sync_time could also have the same problem.
>
> Thanks for your comments. I didn't notice that.
> I changed the unit from milliseconds to microseconds.
>
>> ---
>> + /*
>> + * Measure i/o timing to fsync WAL data.
>> + *
>> + * The wal receiver skip to collect it to avoid performance
>> degradation of standy servers.
>> + * If sync_method doesn't have its fsync method, to skip too.
>> + */
>> + if (!AmWalReceiverProcess() && track_wal_io_timing &&
>> fsyncMethodCalled())
>> + INSTR_TIME_SET_CURRENT(start);
>>
>> * Why does only the wal receiver skip it even if track_wal_io_timinig
>> is true? I think the performance degradation is also true for backend
>> processes. If there is another reason for that, I think it's better to
>> mention in both the doc and comment.
>> * How about checking track_wal_io_timing first?
>> * s/standy/standby/
>
> I fixed it.
> As kuroda-san mentioned too, the skip is no need to be considered.
>
>> ---
>> + /* increment the i/o timing and the number of times to fsync WAL
>> data */
>> + if (fsyncMethodCalled())
>> + {
>> + if (!AmWalReceiverProcess() && track_wal_io_timing)
>> + {
>> + INSTR_TIME_SET_CURRENT(duration);
>> + INSTR_TIME_SUBTRACT(duration, start);
>> + WalStats.m_wal_sync_time +=
>> INSTR_TIME_GET_MILLISEC(duration);
>> + }
>> +
>> + WalStats.m_wal_sync++;
>> + }
>>
>> * I'd avoid always calling fsyncMethodCalled() in this path. How about
>> incrementing m_wal_sync after each sync operation?
>
> I think if syncing the disk does not occur, m_wal_sync should not be
> incremented.
> It depends enableFsync and sync_method.
>
> enableFsync is checked in each fsync method like
> pg_fsync_no_writethrough(),
> so if incrementing m_wal_sync after each sync operation, it should be
> implemented
> in each fsync method. It leads to many duplicated codes.
>
> So, why don't you change the function to a flag whether to
> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>
>
>> ---
>> +/*
>> + * Check if fsync mothod is called.
>> + */
>> +static bool
>> +fsyncMethodCalled()
>> +{
>> + if (!enableFsync)
>> + return false;
>> +
>> + switch (sync_method)
>> + {
>> + case SYNC_METHOD_FSYNC:
>> + case SYNC_METHOD_FSYNC_WRITETHROUGH:
>> + case SYNC_METHOD_FDATASYNC:
>> + return true;
>> + default:
>> + /* others don't have a specific fsync method */
>> + return false;
>> + }
>> +}
>>
>> * I'm concerned that the function name could confuse the reader
>> because it's called even before the fsync method is called. As I
>> commented above, calling to fsyncMethodCalled() can be eliminated.
>> That way, this function is called at only once. So do we really need
>> this function?
>
> Thanks to your comments, I removed them.
>
>
>> * As far as I read the code, issue_xlog_fsync() seems to do fsync even
>> if enableFsync is false. Why does the function return false in that
>> case? I might be missing something.
>
> IIUC, the reason is that I thought that each fsync functions like
> pg_fsync_no_writethrough() check enableFsync.
>
> If this code doesn't check, m_wal_sync_time may be incremented
> even though some sync methods like SYNC_METHOD_OPEN don't call to sync
> some data to the disk at the time.
>
>> * void is missing as argument?
>>
>> * s/mothod/method/
>
> I removed them.
>
>
> Regards,

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	japin <japinli(at)hotmail(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-24 23:33:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Japin

Thanks for your comments.

On 2021-01-23 01:46, japin wrote:
> Hi, Masahiro
>
> Thanks for you update the v4 patch. Here are some comments:
>
> (1)
> + char *msg = NULL;
> + bool sync_called; /* whether to sync
> data to the disk. */
> + instr_time start;
> + instr_time duration;
> +
> + /* check whether to sync data to the disk is really occurred.
> */
> + sync_called = false;
>
> Maybe we can initialize the "sync_called" variable when declare it.

Yes, I fixed it.

> (2)
> + if (sync_called)
> + {
> + /* increment the i/o timing and the number of times to
> fsync WAL data */
> + if (track_wal_io_timing)
> + {
> + INSTR_TIME_SET_CURRENT(duration);
> + INSTR_TIME_SUBTRACT(duration, start);
> + WalStats.m_wal_sync_time =
> INSTR_TIME_GET_MICROSEC(duration);
> + }
> +
> + WalStats.m_wal_sync++;
> + }
>
> There is an extra space before INSTR_TIME_GET_MICROSEC(duration).

Yes, I removed it.

> In the issue_xlog_fsync(), the comment says that if sync_method is
> SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
> Does that mean it synced when write the WAL data? And for those cases,
> we
> cannot get accurate write/sync timing and number of write/sync times,
> right?
>
> case SYNC_METHOD_OPEN:
> case SYNC_METHOD_OPEN_DSYNC:
> /* write synced it already */
> break;

Yes, I add the following comments in the document.

@@ -3515,6 +3515,9 @@ SELECT pid, wait_event_type, wait_event FROM
pg_stat_activity WHERE wait_event i
</para>
<para>
Total number of times WAL data was synced to disk
+ (if <xref linkend="guc-wal-sync-method"/> is
<literal>open_datasync</literal> or
+ <literal>open_sync</literal>, this value is zero because WAL
data is synced
+ when to write it).
</para></entry>
</row>

@@ -3525,7 +3528,10 @@ SELECT pid, wait_event_type, wait_event FROM
pg_stat_activity WHERE wait_event i
<para>
Total amount of time that has been spent in the portion of
WAL data was synced to disk, in milliseconds
- (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
otherwise zero)
+ (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
otherwise zero.
+ if <xref linkend="guc-wal-sync-method"/> is
<literal>open_datasync</literal> or
+ <literal>open_sync</literal>, this value is zero too because WAL
data is synced
+ when to write it).
</para></entry>
</row>

I attached a modified patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v5-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	16.2 KB

From:	"kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>
To:	'Masahiro Ikeda' <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	japin <japinli(at)hotmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 01:34:53
Message-ID:	OSBPR01MB315762ADE2AE7B871C356235F5BD0@OSBPR01MB3157.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dear Ikeda-san,

Thank you for updating the patch. This can be applied to master, and
can be used on my RHEL7.
wal_write_time and wal_sync_time increase normally :-).

I put a further comment:

```
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<structfield>wal_buffers_full</structfield> <type>bigint</type>
</para>
<para>
- Number of times WAL data was written to disk because WAL buffers became full
+ Total number of times WAL data was written to disk because WAL buffers became full
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_write</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of times WAL data was written to disk
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_write_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total amount of time that has been spent in the portion of
+ WAL data was written to disk, in milliseconds
+ (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_sync</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Total number of times WAL data was synced to disk
+ (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or
+ <literal>open_sync</literal>, this value is zero because WAL data is synced
+ when to write it).
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_sync_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total amount of time that has been spent in the portion of
+ WAL data was synced to disk, in milliseconds
+ (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero.
+ if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or
+ <literal>open_sync</literal>, this value is zero too because WAL data is synced
+ when to write it).
</para></entry>
</row>
```

Maybe "Total amount of time" should be used, not "Total number of time."
Other views use "amount."

I have no comments anymore.

Hayato Kuroda
FUJITSU LIMITED

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 01:36:31
Message-ID:	CAD21AoC+Z7-OAwr0RES+WgAvMmi3Uv-bh6u4W1cfS_=Z0hddqg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>
> On 2021-01-22 14:50, Masahiko Sawada wrote:
> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
> > <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> >>
> >> Hi,
> >>
> >> I rebased the patch to the master branch.
> >
> > Thank you for working on this. I've read the latest patch. Here are
> > comments:
> >
> > ---
> > + if (track_wal_io_timing)
> > + {
> > + INSTR_TIME_SET_CURRENT(duration);
> > + INSTR_TIME_SUBTRACT(duration, start);
> > + WalStats.m_wal_write_time +=
> > INSTR_TIME_GET_MILLISEC(duration);
> > + }
> >
> > * I think it should add the time in micro sec.
> > After running pgbench with track_wal_io_timing = on for 30 sec,
> > pg_stat_wal showed the following on my environment:
> >
> > postgres(1:61569)=# select * from pg_stat_wal;
> > -[ RECORD 1 ]----+-----------------------------
> > wal_records | 285947
> > wal_fpi | 53285
> > wal_bytes | 442008213
> > wal_buffers_full | 0
> > wal_write | 25516
> > wal_write_time | 0
> > wal_sync | 25437
> > wal_sync_time | 14490
> > stats_reset | 2021-01-22 10:56:13.29464+09
> >
> > Since writes can complete less than a millisecond, wal_write_time
> > didn't increase. I think sync_time could also have the same problem.
>
> Thanks for your comments. I didn't notice that.
> I changed the unit from milliseconds to microseconds.
>
> > ---
> > + /*
> > + * Measure i/o timing to fsync WAL data.
> > + *
> > + * The wal receiver skip to collect it to avoid performance
> > degradation of standy servers.
> > + * If sync_method doesn't have its fsync method, to skip too.
> > + */
> > + if (!AmWalReceiverProcess() && track_wal_io_timing &&
> > fsyncMethodCalled())
> > + INSTR_TIME_SET_CURRENT(start);
> >
> > * Why does only the wal receiver skip it even if track_wal_io_timinig
> > is true? I think the performance degradation is also true for backend
> > processes. If there is another reason for that, I think it's better to
> > mention in both the doc and comment.
> > * How about checking track_wal_io_timing first?
> > * s/standy/standby/
>
> I fixed it.
> As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?

>
> > ---
> > + /* increment the i/o timing and the number of times to fsync WAL
> > data */
> > + if (fsyncMethodCalled())
> > + {
> > + if (!AmWalReceiverProcess() && track_wal_io_timing)
> > + {
> > + INSTR_TIME_SET_CURRENT(duration);
> > + INSTR_TIME_SUBTRACT(duration, start);
> > + WalStats.m_wal_sync_time +=
> > INSTR_TIME_GET_MILLISEC(duration);
> > + }
> > +
> > + WalStats.m_wal_sync++;
> > + }
> >
> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
> > incrementing m_wal_sync after each sync operation?
>
> I think if syncing the disk does not occur, m_wal_sync should not be
> incremented.
> It depends enableFsync and sync_method.
>
> enableFsync is checked in each fsync method like
> pg_fsync_no_writethrough(),
> so if incrementing m_wal_sync after each sync operation, it should be
> implemented
> in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

> So, why don't you change the function to a flag whether to
> sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

I think we can change the first switch-case to an if statement.

>
>
> > * As far as I read the code, issue_xlog_fsync() seems to do fsync even
> > if enableFsync is false. Why does the function return false in that
> > case? I might be missing something.
>
> IIUC, the reason is that I thought that each fsync functions like
> pg_fsync_no_writethrough() check enableFsync.
>
> If this code doesn't check, m_wal_sync_time may be incremented
> even though some sync methods like SYNC_METHOD_OPEN don't call to sync
> some data to the disk at the time.

Right.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

From:	japin <japinli(at)hotmail(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 02:47:21
Message-ID:	MEYP282MB1669CEB83A1BAB044C66CF5AB6BD0@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>
>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>> > <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I rebased the patch to the master branch.
>> >
>> > Thank you for working on this. I've read the latest patch. Here are
>> > comments:
>> >
>> > ---
>> > + if (track_wal_io_timing)
>> > + {
>> > + INSTR_TIME_SET_CURRENT(duration);
>> > + INSTR_TIME_SUBTRACT(duration, start);
>> > + WalStats.m_wal_write_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > + }
>> >
>> > * I think it should add the time in micro sec.
>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>> > pg_stat_wal showed the following on my environment:
>> >
>> > postgres(1:61569)=# select * from pg_stat_wal;
>> > -[ RECORD 1 ]----+-----------------------------
>> > wal_records | 285947
>> > wal_fpi | 53285
>> > wal_bytes | 442008213
>> > wal_buffers_full | 0
>> > wal_write | 25516
>> > wal_write_time | 0
>> > wal_sync | 25437
>> > wal_sync_time | 14490
>> > stats_reset | 2021-01-22 10:56:13.29464+09
>> >
>> > Since writes can complete less than a millisecond, wal_write_time
>> > didn't increase. I think sync_time could also have the same problem.
>>
>> Thanks for your comments. I didn't notice that.
>> I changed the unit from milliseconds to microseconds.
>>
>> > ---
>> > + /*
>> > + * Measure i/o timing to fsync WAL data.
>> > + *
>> > + * The wal receiver skip to collect it to avoid performance
>> > degradation of standy servers.
>> > + * If sync_method doesn't have its fsync method, to skip too.
>> > + */
>> > + if (!AmWalReceiverProcess() && track_wal_io_timing &&
>> > fsyncMethodCalled())
>> > + INSTR_TIME_SET_CURRENT(start);
>> >
>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>> > is true? I think the performance degradation is also true for backend
>> > processes. If there is another reason for that, I think it's better to
>> > mention in both the doc and comment.
>> > * How about checking track_wal_io_timing first?
>> > * s/standy/standby/
>>
>> I fixed it.
>> As kuroda-san mentioned too, the skip is no need to be considered.
>
> I think you also removed the code to have the wal receiver report the
> stats. So with the latest patch, the wal receiver tracks those
> statistics but doesn't report.
>
> And maybe XLogWalRcvWrite() also needs to track I/O?
>
>>
>> > ---
>> > + /* increment the i/o timing and the number of times to fsync WAL
>> > data */
>> > + if (fsyncMethodCalled())
>> > + {
>> > + if (!AmWalReceiverProcess() && track_wal_io_timing)
>> > + {
>> > + INSTR_TIME_SET_CURRENT(duration);
>> > + INSTR_TIME_SUBTRACT(duration, start);
>> > + WalStats.m_wal_sync_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > + }
>> > +
>> > + WalStats.m_wal_sync++;
>> > + }
>> >
>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>> > incrementing m_wal_sync after each sync operation?
>>
>> I think if syncing the disk does not occur, m_wal_sync should not be
>> incremented.
>> It depends enableFsync and sync_method.
>>
>> enableFsync is checked in each fsync method like
>> pg_fsync_no_writethrough(),
>> so if incrementing m_wal_sync after each sync operation, it should be
>> implemented
>> in each fsync method. It leads to many duplicated codes.
>
> Right. I missed that each fsync function checks enableFsync.
>
>> So, why don't you change the function to a flag whether to
>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>
> Looks better. Since we don't necessarily need to increment m_wal_sync
> after doing fsync we can write the code without an additional variable
> as follows:
>
> if (enableFsync)
> {
> switch (sync_method)
> {
> case SYNC_METHOD_FSYNC:
> #ifdef HAVE_FSYNC_WRITETHROUGH
> case SYNC_METHOD_FSYNC_WRITETHROUGH:
> #endif
> #ifdef HAVE_FDATASYNC
> case SYNC_METHOD_FDATASYNC:
> #endif
> WalStats.m_wal_sync++;
> if (track_wal_io_timing)
> INSTR_TIME_SET_CURRENT(start);
> break;
> default:
> break;
> }
> }
>
> (do fsync and error handling here)
>
> /* increment the i/o timing and the number of times to fsync WAL data */
> if (track_wal_io_timing)
> {
> INSTR_TIME_SET_CURRENT(duration);
> INSTR_TIME_SUBTRACT(duration, start);
> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
> }
>
> I think we can change the first switch-case to an if statement.
>

+1. We can also narrow the scope of "duration" into "if (track_wal_io_timing)" branch.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	kuroda(dot)hayato(at)fujitsu(dot)com
Cc:	japin <japinli(at)hotmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 03:53:05
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-01-25 10:34, kuroda(dot)hayato(at)fujitsu(dot)com wrote:
> Dear Ikeda-san,
>
> Thank you for updating the patch. This can be applied to master, and
> can be used on my RHEL7.
> wal_write_time and wal_sync_time increase normally :-).
>
> ```
> postgres=# select * from pg_stat_wal;
> -[ RECORD 1 ]----+------------------------------
> wal_records | 121781
> wal_fpi | 2287
> wal_bytes | 36055146
> wal_buffers_full | 799
> wal_write | 12770
> wal_write_time | 4.469
> wal_sync | 11962
> wal_sync_time | 132.352
> stats_reset | 2021-01-25 00:51:40.674412+00
> ```

Thanks for checking.

> I put a further comment:
>
> ```
> @@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM
> pg_stat_activity WHERE wait_event i
> <structfield>wal_buffers_full</structfield> <type>bigint</type>
> </para>
> <para>
> - Number of times WAL data was written to disk because WAL
> buffers became full
> + Total number of times WAL data was written to disk because WAL
> buffers became full
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para
> role="column_definition">
> + <structfield>wal_write</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Total number of times WAL data was written to disk
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para
> role="column_definition">
> + <structfield>wal_write_time</structfield> <type>double
> precision</type>
> + </para>
> + <para>
> + Total amount of time that has been spent in the portion of
> + WAL data was written to disk, in milliseconds
> + (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
> otherwise zero).
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para
> role="column_definition">
> + <structfield>wal_sync</structfield> <type>bigint</type>
> + </para>
> + <para>
> + Total number of times WAL data was synced to disk
> + (if <xref linkend="guc-wal-sync-method"/> is
> <literal>open_datasync</literal> or
> + <literal>open_sync</literal>, this value is zero because WAL
> data is synced
> + when to write it).
> + </para></entry>
> + </row>
> +
> + <row>
> + <entry role="catalog_table_entry"><para
> role="column_definition">
> + <structfield>wal_sync_time</structfield> <type>double
> precision</type>
> + </para>
> + <para>
> + Total amount of time that has been spent in the portion of
> + WAL data was synced to disk, in milliseconds
> + (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
> otherwise zero.
> + if <xref linkend="guc-wal-sync-method"/> is
> <literal>open_datasync</literal> or
> + <literal>open_sync</literal>, this value is zero too because
> WAL data is synced
> + when to write it).
> </para></entry>
> </row>
> ```
>
> Maybe "Total amount of time" should be used, not "Total number of
> time."
> Other views use "amount."

Thanks.

I checked columns' descriptions of other views.
There are "Number of xxx", "Total number of xxx", "Total amount of time
that xxx" and "Total time spent xxx".

Since the "time" is used for showing spending time, not count,
I'll change it to "Total number of WAL data written/synced to disk".
Thought?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	"kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>
To:	'Masahiro Ikeda' <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	japin <japinli(at)hotmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 04:09:44
Message-ID:	OSBPR01MB315773BE96CB4A716DF25C40F5BD0@OSBPR01MB3157.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dear Ikeda-san,

> I checked columns' descriptions of other views.
> There are "Number of xxx", "Total number of xxx", "Total amount of time
> that xxx" and "Total time spent xxx".

Right.

> Since the "time" is used for showing spending time, not count,
> I'll change it to "Total number of WAL data written/synced to disk".
> Thought?

I misread your patch, sorry. I prefer your suggestion.
Please fix like that way with others.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 04:15:22
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-01-25 10:36, Masahiko Sawada wrote:
> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>
>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>> > <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I rebased the patch to the master branch.
>> >
>> > Thank you for working on this. I've read the latest patch. Here are
>> > comments:
>> >
>> > ---
>> > + if (track_wal_io_timing)
>> > + {
>> > + INSTR_TIME_SET_CURRENT(duration);
>> > + INSTR_TIME_SUBTRACT(duration, start);
>> > + WalStats.m_wal_write_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > + }
>> >
>> > * I think it should add the time in micro sec.
>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>> > pg_stat_wal showed the following on my environment:
>> >
>> > postgres(1:61569)=# select * from pg_stat_wal;
>> > -[ RECORD 1 ]----+-----------------------------
>> > wal_records | 285947
>> > wal_fpi | 53285
>> > wal_bytes | 442008213
>> > wal_buffers_full | 0
>> > wal_write | 25516
>> > wal_write_time | 0
>> > wal_sync | 25437
>> > wal_sync_time | 14490
>> > stats_reset | 2021-01-22 10:56:13.29464+09
>> >
>> > Since writes can complete less than a millisecond, wal_write_time
>> > didn't increase. I think sync_time could also have the same problem.
>>
>> Thanks for your comments. I didn't notice that.
>> I changed the unit from milliseconds to microseconds.
>>
>> > ---
>> > + /*
>> > + * Measure i/o timing to fsync WAL data.
>> > + *
>> > + * The wal receiver skip to collect it to avoid performance
>> > degradation of standy servers.
>> > + * If sync_method doesn't have its fsync method, to skip too.
>> > + */
>> > + if (!AmWalReceiverProcess() && track_wal_io_timing &&
>> > fsyncMethodCalled())
>> > + INSTR_TIME_SET_CURRENT(start);
>> >
>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>> > is true? I think the performance degradation is also true for backend
>> > processes. If there is another reason for that, I think it's better to
>> > mention in both the doc and comment.
>> > * How about checking track_wal_io_timing first?
>> > * s/standy/standby/
>>
>> I fixed it.
>> As kuroda-san mentioned too, the skip is no need to be considered.
>
> I think you also removed the code to have the wal receiver report the
> stats. So with the latest patch, the wal receiver tracks those
> statistics but doesn't report.
> And maybe XLogWalRcvWrite() also needs to track I/O?

Thanks, I forgot to add them.
I'll fix it.

>>
>> > ---
>> > + /* increment the i/o timing and the number of times to fsync WAL
>> > data */
>> > + if (fsyncMethodCalled())
>> > + {
>> > + if (!AmWalReceiverProcess() && track_wal_io_timing)
>> > + {
>> > + INSTR_TIME_SET_CURRENT(duration);
>> > + INSTR_TIME_SUBTRACT(duration, start);
>> > + WalStats.m_wal_sync_time +=
>> > INSTR_TIME_GET_MILLISEC(duration);
>> > + }
>> > +
>> > + WalStats.m_wal_sync++;
>> > + }
>> >
>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>> > incrementing m_wal_sync after each sync operation?
>>
>> I think if syncing the disk does not occur, m_wal_sync should not be
>> incremented.
>> It depends enableFsync and sync_method.
>>
>> enableFsync is checked in each fsync method like
>> pg_fsync_no_writethrough(),
>> so if incrementing m_wal_sync after each sync operation, it should be
>> implemented
>> in each fsync method. It leads to many duplicated codes.
>
> Right. I missed that each fsync function checks enableFsync.
>
>> So, why don't you change the function to a flag whether to
>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>
> Looks better. Since we don't necessarily need to increment m_wal_sync
> after doing fsync we can write the code without an additional variable
> as follows:
>
> if (enableFsync)
> {
> switch (sync_method)
> {
> case SYNC_METHOD_FSYNC:
> #ifdef HAVE_FSYNC_WRITETHROUGH
> case SYNC_METHOD_FSYNC_WRITETHROUGH:
> #endif
> #ifdef HAVE_FDATASYNC
> case SYNC_METHOD_FDATASYNC:
> #endif
> WalStats.m_wal_sync++;
> if (track_wal_io_timing)
> INSTR_TIME_SET_CURRENT(start);
> break;
> default:
> break;
> }
> }
>
> (do fsync and error handling here)
>
> /* increment the i/o timing and the number of times to fsync WAL
> data */
> if (track_wal_io_timing)
> {
> INSTR_TIME_SET_CURRENT(duration);
> INSTR_TIME_SUBTRACT(duration, start);
> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
> }

IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

> I think we can change the first switch-case to an if statement.

Yes, I'll change it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	japin <japinli(at)hotmail(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 04:22:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-01-25 11:47, japin wrote:
> On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
> wrote:
>> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>
>>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>>> > <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I rebased the patch to the master branch.
>>> >
>>> > Thank you for working on this. I've read the latest patch. Here are
>>> > comments:
>>> >
>>> > ---
>>> > + if (track_wal_io_timing)
>>> > + {
>>> > + INSTR_TIME_SET_CURRENT(duration);
>>> > + INSTR_TIME_SUBTRACT(duration, start);
>>> > + WalStats.m_wal_write_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > + }
>>> >
>>> > * I think it should add the time in micro sec.
>>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>>> > pg_stat_wal showed the following on my environment:
>>> >
>>> > postgres(1:61569)=# select * from pg_stat_wal;
>>> > -[ RECORD 1 ]----+-----------------------------
>>> > wal_records | 285947
>>> > wal_fpi | 53285
>>> > wal_bytes | 442008213
>>> > wal_buffers_full | 0
>>> > wal_write | 25516
>>> > wal_write_time | 0
>>> > wal_sync | 25437
>>> > wal_sync_time | 14490
>>> > stats_reset | 2021-01-22 10:56:13.29464+09
>>> >
>>> > Since writes can complete less than a millisecond, wal_write_time
>>> > didn't increase. I think sync_time could also have the same problem.
>>>
>>> Thanks for your comments. I didn't notice that.
>>> I changed the unit from milliseconds to microseconds.
>>>
>>> > ---
>>> > + /*
>>> > + * Measure i/o timing to fsync WAL data.
>>> > + *
>>> > + * The wal receiver skip to collect it to avoid performance
>>> > degradation of standy servers.
>>> > + * If sync_method doesn't have its fsync method, to skip too.
>>> > + */
>>> > + if (!AmWalReceiverProcess() && track_wal_io_timing &&
>>> > fsyncMethodCalled())
>>> > + INSTR_TIME_SET_CURRENT(start);
>>> >
>>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>>> > is true? I think the performance degradation is also true for backend
>>> > processes. If there is another reason for that, I think it's better to
>>> > mention in both the doc and comment.
>>> > * How about checking track_wal_io_timing first?
>>> > * s/standy/standby/
>>>
>>> I fixed it.
>>> As kuroda-san mentioned too, the skip is no need to be considered.
>>
>> I think you also removed the code to have the wal receiver report the
>> stats. So with the latest patch, the wal receiver tracks those
>> statistics but doesn't report.
>>
>> And maybe XLogWalRcvWrite() also needs to track I/O?
>>
>>>
>>> > ---
>>> > + /* increment the i/o timing and the number of times to fsync WAL
>>> > data */
>>> > + if (fsyncMethodCalled())
>>> > + {
>>> > + if (!AmWalReceiverProcess() && track_wal_io_timing)
>>> > + {
>>> > + INSTR_TIME_SET_CURRENT(duration);
>>> > + INSTR_TIME_SUBTRACT(duration, start);
>>> > + WalStats.m_wal_sync_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > + }
>>> > +
>>> > + WalStats.m_wal_sync++;
>>> > + }
>>> >
>>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>>> > incrementing m_wal_sync after each sync operation?
>>>
>>> I think if syncing the disk does not occur, m_wal_sync should not be
>>> incremented.
>>> It depends enableFsync and sync_method.
>>>
>>> enableFsync is checked in each fsync method like
>>> pg_fsync_no_writethrough(),
>>> so if incrementing m_wal_sync after each sync operation, it should be
>>> implemented
>>> in each fsync method. It leads to many duplicated codes.
>>
>> Right. I missed that each fsync function checks enableFsync.
>>
>>> So, why don't you change the function to a flag whether to
>>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>>
>> Looks better. Since we don't necessarily need to increment m_wal_sync
>> after doing fsync we can write the code without an additional variable
>> as follows:
>>
>> if (enableFsync)
>> {
>> switch (sync_method)
>> {
>> case SYNC_METHOD_FSYNC:
>> #ifdef HAVE_FSYNC_WRITETHROUGH
>> case SYNC_METHOD_FSYNC_WRITETHROUGH:
>> #endif
>> #ifdef HAVE_FDATASYNC
>> case SYNC_METHOD_FDATASYNC:
>> #endif
>> WalStats.m_wal_sync++;
>> if (track_wal_io_timing)
>> INSTR_TIME_SET_CURRENT(start);
>> break;
>> default:
>> break;
>> }
>> }
>>
>> (do fsync and error handling here)
>>
>> /* increment the i/o timing and the number of times to fsync WAL
>> data */
>> if (track_wal_io_timing)
>> {
>> INSTR_TIME_SET_CURRENT(duration);
>> INSTR_TIME_SUBTRACT(duration, start);
>> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
>> }
>>
>> I think we can change the first switch-case to an if statement.
>>
>
> +1. We can also narrow the scope of "duration" into "if
> (track_wal_io_timing)" branch.

Thanks, I'll change it.

--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 04:28:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-01-25 13:15, Masahiro Ikeda wrote:
> On 2021-01-25 10:36, Masahiko Sawada wrote:
>> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>
>>> On 2021-01-22 14:50, Masahiko Sawada wrote:
>>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
>>> > <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I rebased the patch to the master branch.
>>> >
>>> > Thank you for working on this. I've read the latest patch. Here are
>>> > comments:
>>> >
>>> > ---
>>> > + if (track_wal_io_timing)
>>> > + {
>>> > + INSTR_TIME_SET_CURRENT(duration);
>>> > + INSTR_TIME_SUBTRACT(duration, start);
>>> > + WalStats.m_wal_write_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > + }
>>> >
>>> > * I think it should add the time in micro sec.
>>> > After running pgbench with track_wal_io_timing = on for 30 sec,
>>> > pg_stat_wal showed the following on my environment:
>>> >
>>> > postgres(1:61569)=# select * from pg_stat_wal;
>>> > -[ RECORD 1 ]----+-----------------------------
>>> > wal_records | 285947
>>> > wal_fpi | 53285
>>> > wal_bytes | 442008213
>>> > wal_buffers_full | 0
>>> > wal_write | 25516
>>> > wal_write_time | 0
>>> > wal_sync | 25437
>>> > wal_sync_time | 14490
>>> > stats_reset | 2021-01-22 10:56:13.29464+09
>>> >
>>> > Since writes can complete less than a millisecond, wal_write_time
>>> > didn't increase. I think sync_time could also have the same problem.
>>>
>>> Thanks for your comments. I didn't notice that.
>>> I changed the unit from milliseconds to microseconds.
>>>
>>> > ---
>>> > + /*
>>> > + * Measure i/o timing to fsync WAL data.
>>> > + *
>>> > + * The wal receiver skip to collect it to avoid performance
>>> > degradation of standy servers.
>>> > + * If sync_method doesn't have its fsync method, to skip too.
>>> > + */
>>> > + if (!AmWalReceiverProcess() && track_wal_io_timing &&
>>> > fsyncMethodCalled())
>>> > + INSTR_TIME_SET_CURRENT(start);
>>> >
>>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
>>> > is true? I think the performance degradation is also true for backend
>>> > processes. If there is another reason for that, I think it's better to
>>> > mention in both the doc and comment.
>>> > * How about checking track_wal_io_timing first?
>>> > * s/standy/standby/
>>>
>>> I fixed it.
>>> As kuroda-san mentioned too, the skip is no need to be considered.
>>
>> I think you also removed the code to have the wal receiver report the
>> stats. So with the latest patch, the wal receiver tracks those
>> statistics but doesn't report.
>> And maybe XLogWalRcvWrite() also needs to track I/O?
>
> Thanks, I forgot to add them.
> I'll fix it.
>
>
>>>
>>> > ---
>>> > + /* increment the i/o timing and the number of times to fsync WAL
>>> > data */
>>> > + if (fsyncMethodCalled())
>>> > + {
>>> > + if (!AmWalReceiverProcess() && track_wal_io_timing)
>>> > + {
>>> > + INSTR_TIME_SET_CURRENT(duration);
>>> > + INSTR_TIME_SUBTRACT(duration, start);
>>> > + WalStats.m_wal_sync_time +=
>>> > INSTR_TIME_GET_MILLISEC(duration);
>>> > + }
>>> > +
>>> > + WalStats.m_wal_sync++;
>>> > + }
>>> >
>>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
>>> > incrementing m_wal_sync after each sync operation?
>>>
>>> I think if syncing the disk does not occur, m_wal_sync should not be
>>> incremented.
>>> It depends enableFsync and sync_method.
>>>
>>> enableFsync is checked in each fsync method like
>>> pg_fsync_no_writethrough(),
>>> so if incrementing m_wal_sync after each sync operation, it should be
>>> implemented
>>> in each fsync method. It leads to many duplicated codes.
>>
>> Right. I missed that each fsync function checks enableFsync.
>>
>>> So, why don't you change the function to a flag whether to
>>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
>>
>> Looks better. Since we don't necessarily need to increment m_wal_sync
>> after doing fsync we can write the code without an additional variable
>> as follows:
>>
>> if (enableFsync)
>> {
>> switch (sync_method)
>> {
>> case SYNC_METHOD_FSYNC:
>> #ifdef HAVE_FSYNC_WRITETHROUGH
>> case SYNC_METHOD_FSYNC_WRITETHROUGH:
>> #endif
>> #ifdef HAVE_FDATASYNC
>> case SYNC_METHOD_FDATASYNC:
>> #endif
>> WalStats.m_wal_sync++;
>> if (track_wal_io_timing)
>> INSTR_TIME_SET_CURRENT(start);
>> break;
>> default:
>> break;
>> }
>> }
>>
>> (do fsync and error handling here)
>>
>> /* increment the i/o timing and the number of times to fsync WAL
>> data */
>> if (track_wal_io_timing)
>> {
>> INSTR_TIME_SET_CURRENT(duration);
>> INSTR_TIME_SUBTRACT(duration, start);
>> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
>> }
>
> IIUC, I think we can't handle the following case.
>
> When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
> "track_wal_io_timing" is enabled, "start" doesn't be initialized.
>
> My understanding is something wrong, isn't it?

I thought the following is better.

```
/* Measure i/o timing to sync WAL data.*/
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);

(do fsync and error handling here)

/* check whether to sync WAL data to the disk right now. */
if (enableFsync)
{
if ((sync_method == SYNC_METHOD_FSYNC) ||
(sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
(sync_method == SYNC_METHOD_FDATASYNC))
{
/* increment the i/o timing and the number of times to fsync WAL data
*/
if (track_wal_io_timing)
{
instr_time duration;

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
WalStats.m_wal_sync++;
}
}
```

Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
of the "sync_method" and "enableFsync", we don't make additional
variables.
But it's ok because "track_wal_io_timing" leads already performance
degradation.

What do you think?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 04:58:01
Message-ID:	CAD21AoAMxDwey2v5tVF-3Xhu2VNUXA-9CUuZ1ZmKS3__nceJTw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 25, 2021 at 1:28 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>
> On 2021-01-25 13:15, Masahiro Ikeda wrote:
> > On 2021-01-25 10:36, Masahiko Sawada wrote:
> >> On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
> >> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> >>>
> >>> On 2021-01-22 14:50, Masahiko Sawada wrote:
> >>> > On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
> >>> > <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I rebased the patch to the master branch.
> >>> >
> >>> > Thank you for working on this. I've read the latest patch. Here are
> >>> > comments:
> >>> >
> >>> > ---
> >>> > + if (track_wal_io_timing)
> >>> > + {
> >>> > + INSTR_TIME_SET_CURRENT(duration);
> >>> > + INSTR_TIME_SUBTRACT(duration, start);
> >>> > + WalStats.m_wal_write_time +=
> >>> > INSTR_TIME_GET_MILLISEC(duration);
> >>> > + }
> >>> >
> >>> > * I think it should add the time in micro sec.
> >>> > After running pgbench with track_wal_io_timing = on for 30 sec,
> >>> > pg_stat_wal showed the following on my environment:
> >>> >
> >>> > postgres(1:61569)=# select * from pg_stat_wal;
> >>> > -[ RECORD 1 ]----+-----------------------------
> >>> > wal_records | 285947
> >>> > wal_fpi | 53285
> >>> > wal_bytes | 442008213
> >>> > wal_buffers_full | 0
> >>> > wal_write | 25516
> >>> > wal_write_time | 0
> >>> > wal_sync | 25437
> >>> > wal_sync_time | 14490
> >>> > stats_reset | 2021-01-22 10:56:13.29464+09
> >>> >
> >>> > Since writes can complete less than a millisecond, wal_write_time
> >>> > didn't increase. I think sync_time could also have the same problem.
> >>>
> >>> Thanks for your comments. I didn't notice that.
> >>> I changed the unit from milliseconds to microseconds.
> >>>
> >>> > ---
> >>> > + /*
> >>> > + * Measure i/o timing to fsync WAL data.
> >>> > + *
> >>> > + * The wal receiver skip to collect it to avoid performance
> >>> > degradation of standy servers.
> >>> > + * If sync_method doesn't have its fsync method, to skip too.
> >>> > + */
> >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing &&
> >>> > fsyncMethodCalled())
> >>> > + INSTR_TIME_SET_CURRENT(start);
> >>> >
> >>> > * Why does only the wal receiver skip it even if track_wal_io_timinig
> >>> > is true? I think the performance degradation is also true for backend
> >>> > processes. If there is another reason for that, I think it's better to
> >>> > mention in both the doc and comment.
> >>> > * How about checking track_wal_io_timing first?
> >>> > * s/standy/standby/
> >>>
> >>> I fixed it.
> >>> As kuroda-san mentioned too, the skip is no need to be considered.
> >>
> >> I think you also removed the code to have the wal receiver report the
> >> stats. So with the latest patch, the wal receiver tracks those
> >> statistics but doesn't report.
> >> And maybe XLogWalRcvWrite() also needs to track I/O?
> >
> > Thanks, I forgot to add them.
> > I'll fix it.
> >
> >
> >>>
> >>> > ---
> >>> > + /* increment the i/o timing and the number of times to fsync WAL
> >>> > data */
> >>> > + if (fsyncMethodCalled())
> >>> > + {
> >>> > + if (!AmWalReceiverProcess() && track_wal_io_timing)
> >>> > + {
> >>> > + INSTR_TIME_SET_CURRENT(duration);
> >>> > + INSTR_TIME_SUBTRACT(duration, start);
> >>> > + WalStats.m_wal_sync_time +=
> >>> > INSTR_TIME_GET_MILLISEC(duration);
> >>> > + }
> >>> > +
> >>> > + WalStats.m_wal_sync++;
> >>> > + }
> >>> >
> >>> > * I'd avoid always calling fsyncMethodCalled() in this path. How about
> >>> > incrementing m_wal_sync after each sync operation?
> >>>
> >>> I think if syncing the disk does not occur, m_wal_sync should not be
> >>> incremented.
> >>> It depends enableFsync and sync_method.
> >>>
> >>> enableFsync is checked in each fsync method like
> >>> pg_fsync_no_writethrough(),
> >>> so if incrementing m_wal_sync after each sync operation, it should be
> >>> implemented
> >>> in each fsync method. It leads to many duplicated codes.
> >>
> >> Right. I missed that each fsync function checks enableFsync.
> >>
> >>> So, why don't you change the function to a flag whether to
> >>> sync data to the disk will be occurred or not in issue_xlog_fsync()?
> >>
> >> Looks better. Since we don't necessarily need to increment m_wal_sync
> >> after doing fsync we can write the code without an additional variable
> >> as follows:
> >>
> >> if (enableFsync)
> >> {
> >> switch (sync_method)
> >> {
> >> case SYNC_METHOD_FSYNC:
> >> #ifdef HAVE_FSYNC_WRITETHROUGH
> >> case SYNC_METHOD_FSYNC_WRITETHROUGH:
> >> #endif
> >> #ifdef HAVE_FDATASYNC
> >> case SYNC_METHOD_FDATASYNC:
> >> #endif
> >> WalStats.m_wal_sync++;
> >> if (track_wal_io_timing)
> >> INSTR_TIME_SET_CURRENT(start);
> >> break;
> >> default:
> >> break;
> >> }
> >> }
> >>
> >> (do fsync and error handling here)
> >>
> >> /* increment the i/o timing and the number of times to fsync WAL
> >> data */
> >> if (track_wal_io_timing)
> >> {
> >> INSTR_TIME_SET_CURRENT(duration);
> >> INSTR_TIME_SUBTRACT(duration, start);
> >> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
> >> }
> >
> > IIUC, I think we can't handle the following case.
> >
> > When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
> > "track_wal_io_timing" is enabled, "start" doesn't be initialized.
> >
> > My understanding is something wrong, isn't it?

You're right. We might want to initialize 'start' with 0 in those two
cases and check if INSTR_TIME_IS_ZERO() later when accumulating the
I/O time.

>
> I thought the following is better.
>
>
> ```
> /* Measure i/o timing to sync WAL data.*/
> if (track_wal_io_timing)
> INSTR_TIME_SET_CURRENT(start);
>
> (do fsync and error handling here)
>
> /* check whether to sync WAL data to the disk right now. */
> if (enableFsync)
> {
> if ((sync_method == SYNC_METHOD_FSYNC) ||
> (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
> (sync_method == SYNC_METHOD_FDATASYNC))
> {
> /* increment the i/o timing and the number of times to fsync WAL data
> */
> if (track_wal_io_timing)
> {
> instr_time duration;
>
> INSTR_TIME_SET_CURRENT(duration);
> INSTR_TIME_SUBTRACT(duration, start);
> WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
> }
> WalStats.m_wal_sync++;
> }
> }
> ```
>
> Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
> of the "sync_method" and "enableFsync", we don't make additional
> variables.
> But it's ok because "track_wal_io_timing" leads already performance
> degradation.
>
> What do you think?

That also fine with me.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Li Japin <japinli(at)hotmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 07:51:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, thanks for the reviews.

I updated the attached patch.
The summary of the changes is following.

1. fix document

I followed another view's comments.

2. refactor issue_xlog_fsync()

I removed "sync_called" variables, narrowed the "duration" scope and
change the switch statement to if statement.

3. make wal-receiver report WAL statistics

I add the code to collect the statistics for a written operation
in XLogWalRcvWrite() and to report stats in WalReceiverMain().

Since WalReceiverMain() can loop fast, to avoid loading stats collector,
I add "force" argument to the pgstat_send_wal function. If "force" is
false, it can skip reporting until at least 500 msec since it last
reported. WalReceiverMain() almost calls pgstat_send_wal() with "force"
as false.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v5_v6.diff	text/x-diff	13.1 KB
v6-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	19.8 KB

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 15:03:01
Message-ID:	CAD21AoC0B_miiA5rkk-yAac8_q6C4aVmNWFOHhE6vwNncma0Ng@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>
> Hi, thanks for the reviews.
>
> I updated the attached patch.

Thank you for updating the patch!

> The summary of the changes is following.
>
> 1. fix document
>
> I followed another view's comments.
>
>
> 2. refactor issue_xlog_fsync()
>
> I removed "sync_called" variables, narrowed the "duration" scope and
> change the switch statement to if statement.

Looking at the code again, I think if we check if an fsync was really
called when calculating the I/O time, it's better to check that before
starting the measurement.

bool issue_fsync = false;

if (enableFsync &&
(sync_method == SYNC_METHOD_FSYNC ||
sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
sync_method == SYNC_METHOD_FDATASYNC))
{
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
issue_fsync = true;
}
(snip)
if (issue_fsync)
{
if (track_wal_io_timing)
{
instr_time duration;

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
WalStats.m_wal_sync++;
}

So I prefer either the above which is a modified version of the
original approach or my idea that doesn’t introduce a new local
variable I proposed before. But I'm not going to insist on that.

>
>
> 3. make wal-receiver report WAL statistics
>
> I add the code to collect the statistics for a written operation
> in XLogWalRcvWrite() and to report stats in WalReceiverMain().
>
> Since WalReceiverMain() can loop fast, to avoid loading stats collector,
> I add "force" argument to the pgstat_send_wal function. If "force" is
> false, it can skip reporting until at least 500 msec since it last
> reported. WalReceiverMain() almost calls pgstat_send_wal() with "force"
> as false.

void
-pgstat_send_wal(void)
+pgstat_send_wal(bool force)
{
/* We assume this initializes to zeroes */
static const PgStat_MsgWal all_zeroes;
+ static TimestampTz last_report = 0;

+ TimestampTz now;
WalUsage walusage;

+ /*
+ * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
+ * msec since we last sent one or specified "force".
+ */
+ now = GetCurrentTimestamp();
+ if (!force &&
+ !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+ return;
+
+ last_report = now;

Hmm, I don’t think it's good to use PGSTAT_STAT_INTERVAL for this
purpose since it is used as a minimum time for stats file updates. If
we want an interval, I think we should define another one Also, with
the patch, pgstat_send_wal() calls GetCurrentTimestamp() every time
even if track_wal_io_timing is off, which is not good. On the other
hand, I agree that your concern that the wal receiver should not send
the stats for whenever receiving wal records. So an idea could be to
send the wal stats when finishing the current WAL segment file and
when timeout in the main loop. That way we can guarantee that the wal
stats on a replica is updated at least every time finishing a WAL
segment file when actively receiving WAL records and every
NAPTIME_PER_CYCLE in other cases.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 23:37:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-01-26 00:03, Masahiko Sawada wrote:
> On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>
>> Hi, thanks for the reviews.
>>
>> I updated the attached patch.
>
> Thank you for updating the patch!
>
>> The summary of the changes is following.
>>
>> 1. fix document
>>
>> I followed another view's comments.
>>
>>
>> 2. refactor issue_xlog_fsync()
>>
>> I removed "sync_called" variables, narrowed the "duration" scope and
>> change the switch statement to if statement.
>
> Looking at the code again, I think if we check if an fsync was really
> called when calculating the I/O time, it's better to check that before
> starting the measurement.
>
> bool issue_fsync = false;
>
> if (enableFsync &&
> (sync_method == SYNC_METHOD_FSYNC ||
> sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
> sync_method == SYNC_METHOD_FDATASYNC))
> {
> if (track_wal_io_timing)
> INSTR_TIME_SET_CURRENT(start);
> issue_fsync = true;
> }
> (snip)
> if (issue_fsync)
> {
> if (track_wal_io_timing)
> {
> instr_time duration;
>
> INSTR_TIME_SET_CURRENT(duration);
> INSTR_TIME_SUBTRACT(duration, start);
> WalStats.m_wal_sync_time =
> INSTR_TIME_GET_MICROSEC(duration);
> }
> WalStats.m_wal_sync++;
> }
>
> So I prefer either the above which is a modified version of the
> original approach or my idea that doesn’t introduce a new local
> variable I proposed before. But I'm not going to insist on that.

Thanks for the comments.
I change the code to the above.

>>
>>
>> 3. make wal-receiver report WAL statistics
>>
>> I add the code to collect the statistics for a written operation
>> in XLogWalRcvWrite() and to report stats in WalReceiverMain().
>>
>> Since WalReceiverMain() can loop fast, to avoid loading stats
>> collector,
>> I add "force" argument to the pgstat_send_wal function. If "force" is
>> false, it can skip reporting until at least 500 msec since it last
>> reported. WalReceiverMain() almost calls pgstat_send_wal() with
>> "force"
>> as false.
>
> void
> -pgstat_send_wal(void)
> +pgstat_send_wal(bool force)
> {
> /* We assume this initializes to zeroes */
> static const PgStat_MsgWal all_zeroes;
> + static TimestampTz last_report = 0;
>
> + TimestampTz now;
> WalUsage walusage;
>
> + /*
> + * Don't send a message unless it's been at least
> PGSTAT_STAT_INTERVAL
> + * msec since we last sent one or specified "force".
> + */
> + now = GetCurrentTimestamp();
> + if (!force &&
> + !TimestampDifferenceExceeds(last_report, now,
> PGSTAT_STAT_INTERVAL))
> + return;
> +
> + last_report = now;
>
> Hmm, I don’t think it's good to use PGSTAT_STAT_INTERVAL for this
> purpose since it is used as a minimum time for stats file updates. If
> we want an interval, I think we should define another one Also, with
> the patch, pgstat_send_wal() calls GetCurrentTimestamp() every time
> even if track_wal_io_timing is off, which is not good. On the other
> hand, I agree that your concern that the wal receiver should not send
> the stats for whenever receiving wal records. So an idea could be to
> send the wal stats when finishing the current WAL segment file and
> when timeout in the main loop. That way we can guarantee that the wal
> stats on a replica is updated at least every time finishing a WAL
> segment file when actively receiving WAL records and every
> NAPTIME_PER_CYCLE in other cases.

I agree with your comments. I think it should report when
reaching the end of WAL too. I add the code to report the stats
when finishing the current WAL segment file when timeout in the
main loop and when reaching the end of WAL.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v6_v7.diff	text/x-diff	11.8 KB
v7-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	18.8 KB

From:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 23:48:09
Message-ID:	CAKFQuwa7xDBXwk2nr2=+_f+=2YcwUaXC2eU7eCpokipCEBJe7A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 25, 2021 at 8:03 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:

> On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
> wrote:
> >
> > Hi, thanks for the reviews.
> >
> > I updated the attached patch.
>
> Thank you for updating the patch!
>

Your original email with "total number of times" is more correct, removing
the "of times" and just writing "total number of WAL" is not good wording.

Specifically, this change is strictly worse than the original.

- Number of times WAL data was written to disk because WAL buffers
became full
+ Total number of WAL data written to disk because WAL buffers became
full

Both have the flaw that they leave implied exactly what it means to "write
WAL to disk". It is also unclear whether a counter, bytes, or both, would
be more useful here. I've incorporated this into my documentation
suggestions below:

(wal_buffers_full)
-- Revert - the original was better, though maybe add more detail similar
to the below. I didn't research exactly how this works.

(wal_write)
The number of times WAL buffers were written out to disk via XLogWrite

-- Seems like this should have a bytes version too

(wal_write_time)
The amount of time spent writing WAL buffers to disk, excluding sync time
unless the wal_sync_method is either open_datasync or open_sync.
Units are in milliseconds with microsecond resolution. This is zero when
track_wal_io_timing is disabled.

(wal_sync)
The number of times WAL files were synced to disk while wal_sync_method was
set to one of the "sync at commit" options (i.e., fdatasync, fsync,
or fsync_writethrough).

-- it is not going to be zero just because those settings are presently
disabled as they could have been enabled at some point since the last time
these statistics were reset.

(wal_sync_time)
The amount of time spent syncing WAL files to disk, in milliseconds with
microsecond resolution. This requires setting wal_sync_method to one of
the "sync at commit" options (i.e., fdatasync, fsync,
or fsync_writethrough).

Also,

I would suggest extracting the changes to postmaster/pgstat.c and
replication/walreceiver.c to a separate patch as you've fundamentally
changed how it behaves with regards to that function and how it interacts
with the WAL receiver. That seems an entirely separate topic warranting
its own patch and discussion.

David J.

From:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-25 23:52:44
Message-ID:	CAKFQuwbUmjZwkHREx84U7nZmS008+BUcfwAAAtpaEsB=h6CdYA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 25, 2021 at 4:37 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
wrote:

>
> I agree with your comments. I think it should report when
> reaching the end of WAL too. I add the code to report the stats
> when finishing the current WAL segment file when timeout in the
> main loop and when reaching the end of WAL.
>
>
The following is not an improvement:

- /* Send WAL statistics to the stats collector. */
+ /* Send WAL statistics to stats collector */

The word "the" there makes it proper English. Your copy-pasting should
have kept the existing good wording in the other locations rather than
replace the existing location with the newly added incorrect wording.

This doesn't make sense:

* current WAL segment file to avoid loading stats collector.

Maybe "overloading" or "overwhelming"?

I see you removed the pgstat_send_wal(force) change. The rest of my
comments on the v6 patch still stand I believe.

David J.

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-26 06:56:22
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, David.

Thanks for your comments.

On 2021-01-26 08:48, David G. Johnston wrote:
> On Mon, Jan 25, 2021 at 8:03 AM Masahiko Sawada
> <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
>> On Mon, Jan 25, 2021 at 4:51 PM Masahiro Ikeda
>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>
>>> Hi, thanks for the reviews.
>>>
>>> I updated the attached patch.
>>
>> Thank you for updating the patch!
>
> Your original email with "total number of times" is more correct,
> removing the "of times" and just writing "total number of WAL" is not
> good wording.
>
> Specifically, this change is strictly worse than the original.
>
> - Number of times WAL data was written to disk because WAL
> buffers became full
> + Total number of WAL data written to disk because WAL buffers
> became full
>
> Both have the flaw that they leave implied exactly what it means to
> "write WAL to disk". It is also unclear whether a counter, bytes, or
> both, would be more useful here. I've incorporated this into my
> documentation suggestions below:
> (wal_buffers_full)
>
> -- Revert - the original was better, though maybe add more detail
> similar to the below. I didn't research exactly how this works.

OK, I understood.
I reverted since this is a counter statistics.

> (wal_write)
> The number of times WAL buffers were written out to disk via XLogWrite
>

Thanks.

I thought it's better to omit "The" and "XLogWrite" because other views'
description
omits "The" and there is no description of "XlogWrite" in the documents.
What do you think?

> -- Seems like this should have a bytes version too

Do you mean that we need to separate statistics for wal write?

> (wal_write_time)
> The amount of time spent writing WAL buffers to disk, excluding sync
> time unless the wal_sync_method is either open_datasync or open_sync.
> Units are in milliseconds with microsecond resolution. This is zero
> when track_wal_io_timing is disabled.

Thanks, I'll fix it.

> (wal_sync)
> The number of times WAL files were synced to disk while
> wal_sync_method was set to one of the "sync at commit" options (i.e.,
> fdatasync, fsync, or fsync_writethrough).

Thanks, I'll fix it.

> -- it is not going to be zero just because those settings are
> presently disabled as they could have been enabled at some point since
> the last time these statistics were reset.

Right, your description is correct.
The "track_wal_io_timing" has the same limitation, doesn't it?

> (wal_sync_time)
> The amount of time spent syncing WAL files to disk, in milliseconds
> with microsecond resolution. This requires setting wal_sync_method to
> one of the "sync at commit" options (i.e., fdatasync, fsync, or
> fsync_writethrough).

Thanks, I'll fix it.
I will add the comments related to "track_wal_io_timing".

> Also,
>
> I would suggest extracting the changes to postmaster/pgstat.c and
> replication/walreceiver.c to a separate patch as you've fundamentally
> changed how it behaves with regards to that function and how it
> interacts with the WAL receiver. That seems an entirely separate
> topic warranting its own patch and discussion.

OK, I will separate two patches.

On 2021-01-26 08:52, David G. Johnston wrote:
> On Mon, Jan 25, 2021 at 4:37 PM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>
>> I agree with your comments. I think it should report when
>> reaching the end of WAL too. I add the code to report the stats
>> when finishing the current WAL segment file when timeout in the
>> main loop and when reaching the end of WAL.
>
> The following is not an improvement:
>
> - /* Send WAL statistics to the stats collector. */+ /* Send WAL
> statistics to stats collector */
>
> The word "the" there makes it proper English. Your copy-pasting
> should have kept the existing good wording in the other locations
> rather than replace the existing location with the newly added
> incorrect wording.

Thanks, I'll fix it.

> This doesn't make sense:
>
> * current WAL segment file to avoid loading stats collector.
>
> Maybe "overloading" or "overwhelming"?
>
> I see you removed the pgstat_send_wal(force) change. The rest of my
> comments on the v6 patch still stand I believe.

Yes, "overloading" is right. Thanks.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-26 15:14:16
Message-ID:	CAKFQuwbpRk644CXZ5M_T3j3a-wb5MN-FGn5FkRQ0PvKnAmD9jg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
wrote:

>
> > (wal_write)
> > The number of times WAL buffers were written out to disk via XLogWrite
> >
>
> Thanks.
>
> I thought it's better to omit "The" and "XLogWrite" because other views'
> description
> omits "The" and there is no description of "XlogWrite" in the documents.
> What do you think?
>
>
The documentation for WAL does get into the public API level of detail and
doing so here makes what this measures crystal clear. The potential
absence of sufficient detail elsewhere should be corrected instead of
making this description more vague. Specifically, probably XLogWrite
should be added to the WAL overview as part of this update and probably
even have the descriptive section of the documentation note that the number
of times that said function is executed is exposed as a counter in the wal
statistics table - thus closing the loop.

David J.

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-01-29 08:49:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-01-27 00:14, David G. Johnston wrote:
> On Mon, Jan 25, 2021 at 11:56 PM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>
>>> (wal_write)
>>> The number of times WAL buffers were written out to disk via
>> XLogWrite
>>>
>>
>> Thanks.
>>
>> I thought it's better to omit "The" and "XLogWrite" because other
>> views'
>> description
>> omits "The" and there is no description of "XlogWrite" in the
>> documents.
>> What do you think?
>
> The documentation for WAL does get into the public API level of detail
> and doing so here makes what this measures crystal clear. The
> potential absence of sufficient detail elsewhere should be corrected
> instead of making this description more vague. Specifically, probably
> XLogWrite should be added to the WAL overview as part of this update
> and probably even have the descriptive section of the documentation
> note that the number of times that said function is executed is
> exposed as a counter in the wal statistics table - thus closing the
> loop.

Thanks for your comments.

I added the descriptions in documents and separated the patch
into attached two patches. First is to add wal i/o activity statistics.
Second is to make the wal receiver report the wal statistics.

Please let me know if you have any comments.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v8-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	19.0 KB
v8-0002-Makes-the-wal-receiver-report-WAL-statistics.patch	text/x-diff	3.6 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-02-04 23:45:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I pgindented the patches.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v9-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	18.4 KB
v9-0002-Makes-the-wal-receiver-report-WAL-statistics.patch	text/x-diff	3.6 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-02-08 04:01:10
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/02/05 8:45, Masahiro Ikeda wrote:
> I pgindented the patches.

Thanks for updating the patches!

+ <function>XLogWrite</function>, which nomally called by an
+ <function>issue_xlog_fsync</function>, which nomally called by an

Typo: "nomally" should be "normally"?

+ <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+ <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),

Isn't it better to add a space character just after "request"?

+ INSTR_TIME_SET_CURRENT(duration);
+ INSTR_TIME_SUBTRACT(duration, start);
+ WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);

If several cycles happen in the do-while loop, m_wal_write_time should be
updated with the sum of "duration" in those cycles instead of "duration"
in the last cycle? If yes, "+=" should be used instead of "=" when updating
m_wal_write_time?

+ INSTR_TIME_SET_CURRENT(duration);
+ INSTR_TIME_SUBTRACT(duration, start);
+ WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);

Also "=" should be "+=" in the above?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-02-08 05:26:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/02/08 13:01, Fujii Masao wrote:
>
>
> On 2021/02/05 8:45, Masahiro Ikeda wrote:
>> I pgindented the patches.
>
> Thanks for updating the patches!
>
> +       <function>XLogWrite</function>, which nomally called by an
> +       <function>issue_xlog_fsync</function>, which nomally called by an
>
> Typo: "nomally" should be "normally"?
>
> +       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
> +       <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
>
> Isn't it better to add a space character just after "request"?
>
> +                    INSTR_TIME_SET_CURRENT(duration);
> +                    INSTR_TIME_SUBTRACT(duration, start);
> +                    WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
>
> If several cycles happen in the do-while loop, m_wal_write_time should be
> updated with the sum of "duration" in those cycles instead of "duration"
> in the last cycle? If yes, "+=" should be used instead of "=" when updating
> m_wal_write_time?
>
> +            INSTR_TIME_SET_CURRENT(duration);
> +            INSTR_TIME_SUBTRACT(duration, start);
> +            WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
>
> Also "=" should be "+=" in the above?

+ /* Send WAL statistics */
+ pgstat_send_wal();

This may cause overhead in WAL-writing by walwriter because it's called
every cycles even when walwriter needs to write more WAL next cycle
(don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal()
should be called only when WaitLatch() returns with WL_TIMEOUT?

- <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>)
+ <function>XLogFlush</function> request(see <xref linkend="wal-configuration"/>),
+ or WAL data written out to disk by WAL receiver.

So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and
wal_sync_time are updated even while the other values are not. Isn't this
confusing to users? If so, what about reporting those walreceiver stats in
pg_stat_wal_receiver?

if (endofwal)
+ {
+ /* Send WAL statistics to the stats collector */
+ pgstat_send_wal();
break;

You added pgstat_send_wal() so that it's called in some cases where
walreceiver exits. But ISTM that there are other walreceiver-exit cases.
For example, in the case where SIGTERM is received. Instead,
pgstat_send_wal() should be called in WalRcvDie() for those all cases?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-02-09 15:51:19
Message-ID:	CAKFQuwYHX-h3_LkrbNYUguN9fP-YfOVsKyVJ8EaZY2O4Z0upGA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
wrote:

> I pgindented the patches.
>
>
... <function>XLogWrite</function>, which is invoked during an
<function>XLogFlush</function> request (see ...). This is also incremented
by the WAL receiver during replication.

("which normally called" should be "which is normally called" or "which
normally is called" if you want to keep true to the original)

You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either

"This parameter is off by default as it will repeatedly query the operating
system..."
", because" -> "as"

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in....) This is undesirable because ..."

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics. This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite. Additionally, I observe that the
XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver
path does not. It seems technically straight-forward to refactor here to
avoid the almost-duplicated logic in the two places, though I suspect there
may be a trade-off for not adding another function call to the stack given
the importance of WAL processing (though that seems marginalized compared
to the cost of actually writing the WAL). Or, as Fujii noted, go the other
way and don't have any shared code between the two but instead implement
the WAL receiver one to use pg_stat_wal_receiver instead. In either case,
this half-and-half implementation seems undesirable.

David J.

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-02-15 02:32:12
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-02-08 13:01, Fujii Masao wrote:
> On 2021/02/05 8:45, Masahiro Ikeda wrote:
>> I pgindented the patches.
>
> Thanks for updating the patches!

Thanks for checking the patches.

> + <function>XLogWrite</function>, which nomally called by an
> + <function>issue_xlog_fsync</function>, which nomally called by
> an
>
> Typo: "nomally" should be "normally"?

Yes, I'll fix it.

> + <function>XLogFlush</function> request(see <xref
> linkend="wal-configuration"/>)
> + <function>XLogFlush</function> request(see <xref
> linkend="wal-configuration"/>),
>
> Isn't it better to add a space character just after "request"?

Thanks, I'll fix it.

> + INSTR_TIME_SET_CURRENT(duration);
> + INSTR_TIME_SUBTRACT(duration, start);
> + WalStats.m_wal_write_time = INSTR_TIME_GET_MICROSEC(duration);
>
> If several cycles happen in the do-while loop, m_wal_write_time should
> be
> updated with the sum of "duration" in those cycles instead of
> "duration"
> in the last cycle? If yes, "+=" should be used instead of "=" when
> updating
> m_wal_write_time?
> + INSTR_TIME_SET_CURRENT(duration);
> + INSTR_TIME_SUBTRACT(duration, start);
> + WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
>
> Also "=" should be "+=" in the above?

Yes, they are my mistake when changing the unit from milliseconds to
microseconds.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-02-15 02:42:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-02-08 14:26, Fujii Masao wrote:
> On 2021/02/08 13:01, Fujii Masao wrote:
>>
>>
>> On 2021/02/05 8:45, Masahiro Ikeda wrote:
>>> I pgindented the patches.
>>
>> Thanks for updating the patches!
>>
>> +       <function>XLogWrite</function>, which nomally called by an
>> +       <function>issue_xlog_fsync</function>, which nomally called by
>> an
>>
>> Typo: "nomally" should be "normally"?
>>
>> +       <function>XLogFlush</function> request(see <xref
>> linkend="wal-configuration"/>)
>> +       <function>XLogFlush</function> request(see <xref
>> linkend="wal-configuration"/>),
>>
>> Isn't it better to add a space character just after "request"?
>>
>> +                    INSTR_TIME_SET_CURRENT(duration);
>> +                    INSTR_TIME_SUBTRACT(duration, start);
>> +                    WalStats.m_wal_write_time =
>> INSTR_TIME_GET_MICROSEC(duration);
>>
>> If several cycles happen in the do-while loop, m_wal_write_time should
>> be
>> updated with the sum of "duration" in those cycles instead of
>> "duration"
>> in the last cycle? If yes, "+=" should be used instead of "=" when
>> updating
>> m_wal_write_time?
>>
>> +            INSTR_TIME_SET_CURRENT(duration);
>> +            INSTR_TIME_SUBTRACT(duration, start);
>> +            WalStats.m_wal_sync_time =
>> INSTR_TIME_GET_MICROSEC(duration);
>>
>> Also "=" should be "+=" in the above?
>
> + /* Send WAL statistics */
> + pgstat_send_wal();
>
> This may cause overhead in WAL-writing by walwriter because it's called
> every cycles even when walwriter needs to write more WAL next cycle
> (don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal()
> should be called only when WaitLatch() returns with WL_TIMEOUT?

Thanks, I didn't notice that.
I'll fix it.

> - <function>XLogFlush</function> request(see <xref
> linkend="wal-configuration"/>)
> + <function>XLogFlush</function> request(see <xref
> linkend="wal-configuration"/>),
> + or WAL data written out to disk by WAL receiver.
>
> So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and
> wal_sync_time are updated even while the other values are not. Isn't
> this
> confusing to users? If so, what about reporting those walreceiver stats
> in
> pg_stat_wal_receiver?

OK, I'll add new infrastructure code to interect with wal receiver
and stats collector and show the stats in pg_stat_wal_receiver.

> if (endofwal)
> + {
> + /* Send WAL statistics to the stats collector */
> + pgstat_send_wal();
> break;
>
> You added pgstat_send_wal() so that it's called in some cases where
> walreceiver exits. But ISTM that there are other walreceiver-exit
> cases.
> For example, in the case where SIGTERM is received. Instead,
> pgstat_send_wal() should be called in WalRcvDie() for those all cases?

Thanks, I forgot the case.
I'll fix it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-02-15 02:59:48
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-02-10 00:51, David G. Johnston wrote:
> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>
>> I pgindented the patches.
>
> ... <function>XLogWrite</function>, which is invoked during an
> <function>XLogFlush</function> request (see ...). This is also
> incremented by the WAL receiver during replication.
>
> ("which normally called" should be "which is normally called" or
> "which normally is called" if you want to keep true to the original)
> You missed the adding the space before an opening parenthesis here and
> elsewhere (probably copy-paste)
>
> is ether -> is either
> "This parameter is off by default as it will repeatedly query the
> operating system..."
> ", because" -> "as"

Thanks, I fixed them.

> wal_write_time and the sync items also need the note: "This is also
> incremented by the WAL receiver during replication."

I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.

> "The number of times it happened..." -> " (the tally of this event is
> reported in wal_buffers_full in....) This is undesirable because ..."

Thanks, I fixed it.

> I notice that the patch for WAL receiver doesn't require explicitly
> computing the sync statistics but does require computing the write
> statistics. This is because of the presence of issue_xlog_fsync but
> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
> receiver path does not. It seems technically straight-forward to
> refactor here to avoid the almost-duplicated logic in the two places,
> though I suspect there may be a trade-off for not adding another
> function call to the stack given the importance of WAL processing
> (though that seems marginalized compared to the cost of actually
> writing the WAL). Or, as Fujii noted, go the other way and don't have
> any shared code between the two but instead implement the WAL receiver
> one to use pg_stat_wal_receiver instead. In either case, this
> half-and-half implementation seems undesirable.

OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

I added the infrastructure code to communicate the WAL receiver stats
messages between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v10-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	19.0 KB
v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch	text/x-diff	26.3 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-02-24 07:14:05
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/02/15 11:59, Masahiro Ikeda wrote:
> On 2021-02-10 00:51, David G. Johnston wrote:
>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>
>>> I pgindented the patches.
>>
>> ... <function>XLogWrite</function>, which is invoked during an
>> <function>XLogFlush</function> request (see ...). This is also
>> incremented by the WAL receiver during replication.
>>
>> ("which normally called" should be "which is normally called" or
>> "which normally is called" if you want to keep true to the original)
>> You missed the adding the space before an opening parenthesis here and
>> elsewhere (probably copy-paste)
>>
>> is ether -> is either
>> "This parameter is off by default as it will repeatedly query the
>> operating system..."
>> ", because" -> "as"
>
> Thanks, I fixed them.
>
>> wal_write_time and the sync items also need the note: "This is also
>> incremented by the WAL receiver during replication."
>
> I skipped changing it since I separated the stats for the WAL receiver
> in pg_stat_wal_receiver.
>
>> "The number of times it happened..." -> " (the tally of this event is
>> reported in wal_buffers_full in....) This is undesirable because ..."
>
> Thanks, I fixed it.
>
>> I notice that the patch for WAL receiver doesn't require explicitly
>> computing the sync statistics but does require computing the write
>> statistics. This is because of the presence of issue_xlog_fsync but
>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>> receiver path does not. It seems technically straight-forward to
>> refactor here to avoid the almost-duplicated logic in the two places,
>> though I suspect there may be a trade-off for not adding another
>> function call to the stack given the importance of WAL processing
>> (though that seems marginalized compared to the cost of actually
>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>> any shared code between the two but instead implement the WAL receiver
>> one to use pg_stat_wal_receiver instead. In either case, this
>> half-and-half implementation seems undesirable.
>
> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the patches!

> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
> the stats for WAL receiver is counted in pg_stat_wal_receiver.
> What do you think?

On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-03 05:33:03
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-02-24 16:14, Fujii Masao wrote:
> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>> On 2021-02-10 00:51, David G. Johnston wrote:
>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>
>>>> I pgindented the patches.
>>>
>>> ... <function>XLogWrite</function>, which is invoked during an
>>> <function>XLogFlush</function> request (see ...). This is also
>>> incremented by the WAL receiver during replication.
>>>
>>> ("which normally called" should be "which is normally called" or
>>> "which normally is called" if you want to keep true to the original)
>>> You missed the adding the space before an opening parenthesis here
>>> and
>>> elsewhere (probably copy-paste)
>>>
>>> is ether -> is either
>>> "This parameter is off by default as it will repeatedly query the
>>> operating system..."
>>> ", because" -> "as"
>>
>> Thanks, I fixed them.
>>
>>> wal_write_time and the sync items also need the note: "This is also
>>> incremented by the WAL receiver during replication."
>>
>> I skipped changing it since I separated the stats for the WAL receiver
>> in pg_stat_wal_receiver.
>>
>>> "The number of times it happened..." -> " (the tally of this event is
>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>
>> Thanks, I fixed it.
>>
>>> I notice that the patch for WAL receiver doesn't require explicitly
>>> computing the sync statistics but does require computing the write
>>> statistics. This is because of the presence of issue_xlog_fsync but
>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe
>>> that
>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>> receiver path does not. It seems technically straight-forward to
>>> refactor here to avoid the almost-duplicated logic in the two places,
>>> though I suspect there may be a trade-off for not adding another
>>> function call to the stack given the importance of WAL processing
>>> (though that seems marginalized compared to the cost of actually
>>> writing the WAL). Or, as Fujii noted, go the other way and don't
>>> have
>>> any shared code between the two but instead implement the WAL
>>> receiver
>>> one to use pg_stat_wal_receiver instead. In either case, this
>>> half-and-half implementation seems undesirable.
>>
>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>
> Thanks for updating the patches!
>
>
>> I added the infrastructure code to communicate the WAL receiver stats
>> messages between the WAL receiver and the stats collector, and
>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>> What do you think?
>
> On second thought, this idea seems not good. Because those stats are
> collected between multiple walreceivers, but other values in
> pg_stat_wal_receiver is only related to the walreceiver process running
> at that moment. IOW, it seems strange that some values show dynamic
> stats and the others show collected stats, even though they are in
> the same view pg_stat_wal_receiver. Thought?

OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view
in v11 patch.

I refactored the logic to write xlog file to unify collecting the write
stats.
As David said, although pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE)
is not called in the WAL receiver's path,
I agreed that the cost to write the WAL is much bigger.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v11-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	19.3 KB
v11-0002-Makes-the-wal-receiver-report-WAL-statistics.patch	text/x-diff	7.6 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-03 07:30:04
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/03 14:33, Masahiro Ikeda wrote:
> On 2021-02-24 16:14, Fujii Masao wrote:
>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>
>>>>> I pgindented the patches.
>>>>
>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>> <function>XLogFlush</function> request (see ...). This is also
>>>> incremented by the WAL receiver during replication.
>>>>
>>>> ("which normally called" should be "which is normally called" or
>>>> "which normally is called" if you want to keep true to the original)
>>>> You missed the adding the space before an opening parenthesis here and
>>>> elsewhere (probably copy-paste)
>>>>
>>>> is ether -> is either
>>>> "This parameter is off by default as it will repeatedly query the
>>>> operating system..."
>>>> ", because" -> "as"
>>>
>>> Thanks, I fixed them.
>>>
>>>> wal_write_time and the sync items also need the note: "This is also
>>>> incremented by the WAL receiver during replication."
>>>
>>> I skipped changing it since I separated the stats for the WAL receiver
>>> in pg_stat_wal_receiver.
>>>
>>>> "The number of times it happened..." -> " (the tally of this event is
>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>
>>> Thanks, I fixed it.
>>>
>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>> computing the sync statistics but does require computing the write
>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>> receiver path does not. It seems technically straight-forward to
>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>> though I suspect there may be a trade-off for not adding another
>>>> function call to the stack given the importance of WAL processing
>>>> (though that seems marginalized compared to the cost of actually
>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>> any shared code between the two but instead implement the WAL receiver
>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>> half-and-half implementation seems undesirable.
>>>
>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>
>> Thanks for updating the patches!
>>
>>
>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>> What do you think?
>>
>> On second thought, this idea seems not good. Because those stats are
>> collected between multiple walreceivers, but other values in
>> pg_stat_wal_receiver is only related to the walreceiver process running
>> at that moment. IOW, it seems strange that some values show dynamic
>> stats and the others show collected stats, even though they are in
>> the same view pg_stat_wal_receiver. Thought?
>
> OK, I fixed it.
> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.

Thanks for updating the patches! I'm now reading 001 patch.

+ /* Check whether the WAL file was synced to disk right now */
+ if (enableFsync &&
+ (sync_method == SYNC_METHOD_FSYNC ||
+ sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+ sync_method == SYNC_METHOD_FDATASYNC))
+ {

Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?

+ /*
+ * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+ * the overhead in WAL-writing.
+ */
+ if (rc & WL_TIMEOUT)
+ pgstat_send_wal();

On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-03 11:27:29
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-03 16:30, Fujii Masao wrote:
> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>> On 2021-02-24 16:14, Fujii Masao wrote:
>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>
>>>>>> I pgindented the patches.
>>>>>
>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>> incremented by the WAL receiver during replication.
>>>>>
>>>>> ("which normally called" should be "which is normally called" or
>>>>> "which normally is called" if you want to keep true to the
>>>>> original)
>>>>> You missed the adding the space before an opening parenthesis here
>>>>> and
>>>>> elsewhere (probably copy-paste)
>>>>>
>>>>> is ether -> is either
>>>>> "This parameter is off by default as it will repeatedly query the
>>>>> operating system..."
>>>>> ", because" -> "as"
>>>>
>>>> Thanks, I fixed them.
>>>>
>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>> incremented by the WAL receiver during replication."
>>>>
>>>> I skipped changing it since I separated the stats for the WAL
>>>> receiver
>>>> in pg_stat_wal_receiver.
>>>>
>>>>> "The number of times it happened..." -> " (the tally of this event
>>>>> is
>>>>> reported in wal_buffers_full in....) This is undesirable because
>>>>> ..."
>>>>
>>>> Thanks, I fixed it.
>>>>
>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>> computing the sync statistics but does require computing the write
>>>>> statistics. This is because of the presence of issue_xlog_fsync
>>>>> but
>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe
>>>>> that
>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>> receiver path does not. It seems technically straight-forward to
>>>>> refactor here to avoid the almost-duplicated logic in the two
>>>>> places,
>>>>> though I suspect there may be a trade-off for not adding another
>>>>> function call to the stack given the importance of WAL processing
>>>>> (though that seems marginalized compared to the cost of actually
>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't
>>>>> have
>>>>> any shared code between the two but instead implement the WAL
>>>>> receiver
>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>> half-and-half implementation seems undesirable.
>>>>
>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>
>>> Thanks for updating the patches!
>>>
>>>
>>>> I added the infrastructure code to communicate the WAL receiver
>>>> stats messages between the WAL receiver and the stats collector, and
>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>> What do you think?
>>>
>>> On second thought, this idea seems not good. Because those stats are
>>> collected between multiple walreceivers, but other values in
>>> pg_stat_wal_receiver is only related to the walreceiver process
>>> running
>>> at that moment. IOW, it seems strange that some values show dynamic
>>> stats and the others show collected stats, even though they are in
>>> the same view pg_stat_wal_receiver. Thought?
>>
>> OK, I fixed it.
>> The stats collected in the WAL receiver is exposed in pg_stat_wal view
>> in v11 patch.
>
> Thanks for updating the patches! I'm now reading 001 patch.
>
> + /* Check whether the WAL file was synced to disk right now */
> + if (enableFsync &&
> + (sync_method == SYNC_METHOD_FSYNC ||
> + sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
> + sync_method == SYNC_METHOD_FDATASYNC))
> + {
>
> Isn't it better to make issue_xlog_fsync() return immediately
> if enableFsync is off, sync_method is open_sync or open_data_sync,
> to simplify the code more?

Thanks for the comments.
I added the above code in v12 patch.

>
> + /*
> + * Send WAL statistics only if WalWriterDelay has elapsed to
> minimize
> + * the overhead in WAL-writing.
> + */
> + if (rc & WL_TIMEOUT)
> + pgstat_send_wal();
>
> On second thought, this change means that it always takes
> wal_writer_delay
> before walwriter's WAL stats is sent after XLogBackgroundFlush() is
> called.
> For example, if wal_writer_delay is set to several seconds, some values
> in
> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
> So I'm thinking to withdraw my previous comment and it's ok to send
> the stats every after XLogBackgroundFlush() is called. Thought?

Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.

Why don't to make another way to check the timestamp?

+ /*
+ * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+ * msec since we last sent one
+ */
+ now = GetCurrentTimestamp();
+ if (TimestampDifferenceExceeds(last_report, now,
PGSTAT_STAT_INTERVAL))
+ {
+ pgstat_send_wal();
+ last_report = now;
+ }
+

Although I worried that it's better to add the check code in
pgstat_send_wal(),
I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
pgstat_send_wal() is invoked pg_report_stat() and it already checks the
PGSTAT_STAT_INTERVAL.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v12-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	19.9 KB
v11_v12_0001.diff	text/x-diff	6.7 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-04 07:14:42
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-03 20:27, Masahiro Ikeda wrote:
> On 2021-03-03 16:30, Fujii Masao wrote:
>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>
>>>>>>> I pgindented the patches.
>>>>>>
>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>> incremented by the WAL receiver during replication.
>>>>>>
>>>>>> ("which normally called" should be "which is normally called" or
>>>>>> "which normally is called" if you want to keep true to the
>>>>>> original)
>>>>>> You missed the adding the space before an opening parenthesis here
>>>>>> and
>>>>>> elsewhere (probably copy-paste)
>>>>>>
>>>>>> is ether -> is either
>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>> operating system..."
>>>>>> ", because" -> "as"
>>>>>
>>>>> Thanks, I fixed them.
>>>>>
>>>>>> wal_write_time and the sync items also need the note: "This is
>>>>>> also
>>>>>> incremented by the WAL receiver during replication."
>>>>>
>>>>> I skipped changing it since I separated the stats for the WAL
>>>>> receiver
>>>>> in pg_stat_wal_receiver.
>>>>>
>>>>>> "The number of times it happened..." -> " (the tally of this event
>>>>>> is
>>>>>> reported in wal_buffers_full in....) This is undesirable because
>>>>>> ..."
>>>>>
>>>>> Thanks, I fixed it.
>>>>>
>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>> explicitly
>>>>>> computing the sync statistics but does require computing the write
>>>>>> statistics. This is because of the presence of issue_xlog_fsync
>>>>>> but
>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe
>>>>>> that
>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>> refactor here to avoid the almost-duplicated logic in the two
>>>>>> places,
>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>> function call to the stack given the importance of WAL processing
>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't
>>>>>> have
>>>>>> any shared code between the two but instead implement the WAL
>>>>>> receiver
>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>> half-and-half implementation seems undesirable.
>>>>>
>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>
>>>> Thanks for updating the patches!
>>>>
>>>>
>>>>> I added the infrastructure code to communicate the WAL receiver
>>>>> stats messages between the WAL receiver and the stats collector,
>>>>> and
>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>> What do you think?
>>>>
>>>> On second thought, this idea seems not good. Because those stats are
>>>> collected between multiple walreceivers, but other values in
>>>> pg_stat_wal_receiver is only related to the walreceiver process
>>>> running
>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>> stats and the others show collected stats, even though they are in
>>>> the same view pg_stat_wal_receiver. Thought?
>>>
>>> OK, I fixed it.
>>> The stats collected in the WAL receiver is exposed in pg_stat_wal
>>> view in v11 patch.
>>
>> Thanks for updating the patches! I'm now reading 001 patch.
>>
>> + /* Check whether the WAL file was synced to disk right now */
>> + if (enableFsync &&
>> + (sync_method == SYNC_METHOD_FSYNC ||
>> + sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>> + sync_method == SYNC_METHOD_FDATASYNC))
>> + {
>>
>> Isn't it better to make issue_xlog_fsync() return immediately
>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>> to simplify the code more?
>
> Thanks for the comments.
> I added the above code in v12 patch.
>
>>
>> + /*
>> + * Send WAL statistics only if WalWriterDelay has elapsed to
>> minimize
>> + * the overhead in WAL-writing.
>> + */
>> + if (rc & WL_TIMEOUT)
>> + pgstat_send_wal();
>>
>> On second thought, this change means that it always takes
>> wal_writer_delay
>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is
>> called.
>> For example, if wal_writer_delay is set to several seconds, some
>> values in
>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>> So I'm thinking to withdraw my previous comment and it's ok to send
>> the stats every after XLogBackgroundFlush() is called. Thought?
>
> Thanks, I didn't notice that.
>
> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
> default value is 200msec and it may be set shorter time.
>
> Why don't to make another way to check the timestamp?
>
> + /*
> + * Don't send a message unless it's been at least
> PGSTAT_STAT_INTERVAL
> + * msec since we last sent one
> + */
> + now = GetCurrentTimestamp();
> + if (TimestampDifferenceExceeds(last_report, now,
> PGSTAT_STAT_INTERVAL))
> + {
> + pgstat_send_wal();
> + last_report = now;
> + }
> +
>
> Although I worried that it's better to add the check code in
> pgstat_send_wal(),
> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
> PGSTAT_STAT_INTERVAL.

I forgot to remove an unused variable.
The attached v13 patch is fixed.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v13-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	19.9 KB
v12_v13.diff	text/x-diff	2.3 KB

From:	Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-04 11:25:06
Message-ID:	CALtqXTedQoyY93OXSi0ogu+bh=jUrROyy+ODGdxPHrCz3yLDaQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 4, 2021 at 12:14 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
wrote:

> On 2021-03-03 20:27, Masahiro Ikeda wrote:
> > On 2021-03-03 16:30, Fujii Masao wrote:
> >> On 2021/03/03 14:33, Masahiro Ikeda wrote:
> >>> On 2021-02-24 16:14, Fujii Masao wrote:
> >>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
> >>>>> On 2021-02-10 00:51, David G. Johnston wrote:
> >>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
> >>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> >>>>>>
> >>>>>>> I pgindented the patches.
> >>>>>>
> >>>>>> ... <function>XLogWrite</function>, which is invoked during an
> >>>>>> <function>XLogFlush</function> request (see ...). This is also
> >>>>>> incremented by the WAL receiver during replication.
> >>>>>>
> >>>>>> ("which normally called" should be "which is normally called" or
> >>>>>> "which normally is called" if you want to keep true to the
> >>>>>> original)
> >>>>>> You missed the adding the space before an opening parenthesis here
> >>>>>> and
> >>>>>> elsewhere (probably copy-paste)
> >>>>>>
> >>>>>> is ether -> is either
> >>>>>> "This parameter is off by default as it will repeatedly query the
> >>>>>> operating system..."
> >>>>>> ", because" -> "as"
> >>>>>
> >>>>> Thanks, I fixed them.
> >>>>>
> >>>>>> wal_write_time and the sync items also need the note: "This is
> >>>>>> also
> >>>>>> incremented by the WAL receiver during replication."
> >>>>>
> >>>>> I skipped changing it since I separated the stats for the WAL
> >>>>> receiver
> >>>>> in pg_stat_wal_receiver.
> >>>>>
> >>>>>> "The number of times it happened..." -> " (the tally of this event
> >>>>>> is
> >>>>>> reported in wal_buffers_full in....) This is undesirable because
> >>>>>> ..."
> >>>>>
> >>>>> Thanks, I fixed it.
> >>>>>
> >>>>>> I notice that the patch for WAL receiver doesn't require
> >>>>>> explicitly
> >>>>>> computing the sync statistics but does require computing the write
> >>>>>> statistics. This is because of the presence of issue_xlog_fsync
> >>>>>> but
> >>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe
> >>>>>> that
> >>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
> >>>>>> receiver path does not. It seems technically straight-forward to
> >>>>>> refactor here to avoid the almost-duplicated logic in the two
> >>>>>> places,
> >>>>>> though I suspect there may be a trade-off for not adding another
> >>>>>> function call to the stack given the importance of WAL processing
> >>>>>> (though that seems marginalized compared to the cost of actually
> >>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't
> >>>>>> have
> >>>>>> any shared code between the two but instead implement the WAL
> >>>>>> receiver
> >>>>>> one to use pg_stat_wal_receiver instead. In either case, this
> >>>>>> half-and-half implementation seems undesirable.
> >>>>>
> >>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
> >>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
> >>>>
> >>>> Thanks for updating the patches!
> >>>>
> >>>>
> >>>>> I added the infrastructure code to communicate the WAL receiver
> >>>>> stats messages between the WAL receiver and the stats collector,
> >>>>> and
> >>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
> >>>>> What do you think?
> >>>>
> >>>> On second thought, this idea seems not good. Because those stats are
> >>>> collected between multiple walreceivers, but other values in
> >>>> pg_stat_wal_receiver is only related to the walreceiver process
> >>>> running
> >>>> at that moment. IOW, it seems strange that some values show dynamic
> >>>> stats and the others show collected stats, even though they are in
> >>>> the same view pg_stat_wal_receiver. Thought?
> >>>
> >>> OK, I fixed it.
> >>> The stats collected in the WAL receiver is exposed in pg_stat_wal
> >>> view in v11 patch.
> >>
> >> Thanks for updating the patches! I'm now reading 001 patch.
> >>
> >> + /* Check whether the WAL file was synced to disk right now */
> >> + if (enableFsync &&
> >> + (sync_method == SYNC_METHOD_FSYNC ||
> >> + sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
> >> + sync_method == SYNC_METHOD_FDATASYNC))
> >> + {
> >>
> >> Isn't it better to make issue_xlog_fsync() return immediately
> >> if enableFsync is off, sync_method is open_sync or open_data_sync,
> >> to simplify the code more?
> >
> > Thanks for the comments.
> > I added the above code in v12 patch.
> >
> >>
> >> + /*
> >> + * Send WAL statistics only if WalWriterDelay has elapsed
> to
> >> minimize
> >> + * the overhead in WAL-writing.
> >> + */
> >> + if (rc & WL_TIMEOUT)
> >> + pgstat_send_wal();
> >>
> >> On second thought, this change means that it always takes
> >> wal_writer_delay
> >> before walwriter's WAL stats is sent after XLogBackgroundFlush() is
> >> called.
> >> For example, if wal_writer_delay is set to several seconds, some
> >> values in
> >> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
> >> So I'm thinking to withdraw my previous comment and it's ok to send
> >> the stats every after XLogBackgroundFlush() is called. Thought?
> >
> > Thanks, I didn't notice that.
> >
> > Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
> > default value is 200msec and it may be set shorter time.
> >
> > Why don't to make another way to check the timestamp?
> >
> > + /*
> > + * Don't send a message unless it's been at least
> > PGSTAT_STAT_INTERVAL
> > + * msec since we last sent one
> > + */
> > + now = GetCurrentTimestamp();
> > + if (TimestampDifferenceExceeds(last_report, now,
> > PGSTAT_STAT_INTERVAL))
> > + {
> > + pgstat_send_wal();
> > + last_report = now;
> > + }
> > +
> >
> > Although I worried that it's better to add the check code in
> > pgstat_send_wal(),
> > I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
> > pgstat_send_wal() is invoked pg_report_stat() and it already checks the
> > PGSTAT_STAT_INTERVAL.
>
> I forgot to remove an unused variable.
> The attached v13 patch is fixed.
>
> Regards
> --
> Masahiro Ikeda
> NTT DATA CORPORATION

This patch set no longer applies
http://cfbot.cputube.org/patch_32_2859.log

Can we get a rebase?

I am marking the patch "Waiting on Author"

--
Ibrar Ahmed

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-04 16:02:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/04 16:14, Masahiro Ikeda wrote:
> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>> On 2021-03-03 16:30, Fujii Masao wrote:
>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>
>>>>>>>> I pgindented the patches.
>>>>>>>
>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>
>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>> elsewhere (probably copy-paste)
>>>>>>>
>>>>>>> is ether -> is either
>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>> operating system..."
>>>>>>> ", because" -> "as"
>>>>>>
>>>>>> Thanks, I fixed them.
>>>>>>
>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>> incremented by the WAL receiver during replication."
>>>>>>
>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>> in pg_stat_wal_receiver.
>>>>>>
>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>
>>>>>> Thanks, I fixed it.
>>>>>>
>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>> computing the sync statistics but does require computing the write
>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>> half-and-half implementation seems undesirable.
>>>>>>
>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>
>>>>> Thanks for updating the patches!
>>>>>
>>>>>
>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>> What do you think?
>>>>>
>>>>> On second thought, this idea seems not good. Because those stats are
>>>>> collected between multiple walreceivers, but other values in
>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>> stats and the others show collected stats, even though they are in
>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>
>>>> OK, I fixed it.
>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>
>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>
>>> +    /* Check whether the WAL file was synced to disk right now */
>>> +    if (enableFsync &&
>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>> +    {
>>>
>>> Isn't it better to make issue_xlog_fsync() return immediately
>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>> to simplify the code more?
>>
>> Thanks for the comments.
>> I added the above code in v12 patch.
>>
>>>
>>> +        /*
>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>> +         * the overhead in WAL-writing.
>>> +         */
>>> +        if (rc & WL_TIMEOUT)
>>> +            pgstat_send_wal();
>>>
>>> On second thought, this change means that it always takes wal_writer_delay
>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>> For example, if wal_writer_delay is set to several seconds, some values in
>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>
>> Thanks, I didn't notice that.
>>
>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>> default value is 200msec and it may be set shorter time.

Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.

>>
>> Why don't to make another way to check the timestamp?
>>
>> +               /*
>> +                * Don't send a message unless it's been at least
>> PGSTAT_STAT_INTERVAL
>> +                * msec since we last sent one
>> +                */
>> +               now = GetCurrentTimestamp();
>> +               if (TimestampDifferenceExceeds(last_report, now,
>> PGSTAT_STAT_INTERVAL))
>> +               {
>> +                       pgstat_send_wal();
>> +                       last_report = now;
>> +               }
>> +
>>
>> Although I worried that it's better to add the check code in pgstat_send_wal(),

Agreed.

>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>> PGSTAT_STAT_INTERVAL.

I think that we can do that. What about the attached patch?

> I forgot to remove an unused variable.
> The attached v13 patch is fixed.

Thanks for updating the patch!

+ w.wal_write,
+ w.wal_write_time,
+ w.wal_sync,
+ w.wal_sync_time,

It's more natural to put wal_write_time and wal_sync_time next to
each other? That is, what about the following order of columns?

wal_write
wal_sync
wal_write_time
wal_sync_time

- case SYNC_METHOD_OPEN:
- case SYNC_METHOD_OPEN_DSYNC:
- /* write synced it already */
- break;

IMO it's better to add Assert(false) here to ensure that we never reach
here, as follows. Thought?

+ case SYNC_METHOD_OPEN:
+ case SYNC_METHOD_OPEN_DSYNC:
+ /* not reachable */
+ Assert(false);

Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
On the other hand, walwriter doesn't do that. Walwriter also should send
the stats even at its exit? Otherwise some stats can fail to be collected.
But ISTM that this issue existed from before, for example checkpointer
doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
this issue in this patch?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment	Content-Type	Size
v13-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patch	text/plain	21.3 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-04 23:38:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-05 01:02, Fujii Masao wrote:
> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>
>>>>>>>>> I pgindented the patches.
>>>>>>>>
>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>
>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>> "which normally is called" if you want to keep true to the
>>>>>>>> original)
>>>>>>>> You missed the adding the space before an opening parenthesis
>>>>>>>> here and
>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>
>>>>>>>> is ether -> is either
>>>>>>>> "This parameter is off by default as it will repeatedly query
>>>>>>>> the
>>>>>>>> operating system..."
>>>>>>>> ", because" -> "as"
>>>>>>>
>>>>>>> Thanks, I fixed them.
>>>>>>>
>>>>>>>> wal_write_time and the sync items also need the note: "This is
>>>>>>>> also
>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>
>>>>>>> I skipped changing it since I separated the stats for the WAL
>>>>>>> receiver
>>>>>>> in pg_stat_wal_receiver.
>>>>>>>
>>>>>>>> "The number of times it happened..." -> " (the tally of this
>>>>>>>> event is
>>>>>>>> reported in wal_buffers_full in....) This is undesirable because
>>>>>>>> ..."
>>>>>>>
>>>>>>> Thanks, I fixed it.
>>>>>>>
>>>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>>>> explicitly
>>>>>>>> computing the sync statistics but does require computing the
>>>>>>>> write
>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync
>>>>>>>> but
>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I
>>>>>>>> observe that
>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the
>>>>>>>> WAL
>>>>>>>> receiver path does not. It seems technically straight-forward
>>>>>>>> to
>>>>>>>> refactor here to avoid the almost-duplicated logic in the two
>>>>>>>> places,
>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>> function call to the stack given the importance of WAL
>>>>>>>> processing
>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and
>>>>>>>> don't have
>>>>>>>> any shared code between the two but instead implement the WAL
>>>>>>>> receiver
>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>
>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>
>>>>>> Thanks for updating the patches!
>>>>>>
>>>>>>
>>>>>>> I added the infrastructure code to communicate the WAL receiver
>>>>>>> stats messages between the WAL receiver and the stats collector,
>>>>>>> and
>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>> What do you think?
>>>>>>
>>>>>> On second thought, this idea seems not good. Because those stats
>>>>>> are
>>>>>> collected between multiple walreceivers, but other values in
>>>>>> pg_stat_wal_receiver is only related to the walreceiver process
>>>>>> running
>>>>>> at that moment. IOW, it seems strange that some values show
>>>>>> dynamic
>>>>>> stats and the others show collected stats, even though they are in
>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>
>>>>> OK, I fixed it.
>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal
>>>>> view in v11 patch.
>>>>
>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>
>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>> +    if (enableFsync &&
>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>> +    {
>>>>
>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>> to simplify the code more?
>>>
>>> Thanks for the comments.
>>> I added the above code in v12 patch.
>>>
>>>>
>>>> +        /*
>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed
>>>> to minimize
>>>> +         * the overhead in WAL-writing.
>>>> +         */
>>>> +        if (rc & WL_TIMEOUT)
>>>> +            pgstat_send_wal();
>>>>
>>>> On second thought, this change means that it always takes
>>>> wal_writer_delay
>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is
>>>> called.
>>>> For example, if wal_writer_delay is set to several seconds, some
>>>> values in
>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>
>>> Thanks, I didn't notice that.
>>>
>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>> default value is 200msec and it may be set shorter time.
>
> Yeah, if wal_writer_delay is set to very small value, there is a risk
> that the WAL stats are sent too frequently. I agree that's a problem.
>
>>>
>>> Why don't to make another way to check the timestamp?
>>>
>>> +               /*
>>> +                * Don't send a message unless it's been at least
>>> PGSTAT_STAT_INTERVAL
>>> +                * msec since we last sent one
>>> +                */
>>> +               now = GetCurrentTimestamp();
>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>> PGSTAT_STAT_INTERVAL))
>>> +               {
>>> +                       pgstat_send_wal();
>>> +                       last_report = now;
>>> +               }
>>> +
>>>
>>> Although I worried that it's better to add the check code in
>>> pgstat_send_wal(),
>
> Agreed.
>
>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks
>>> the
>>> PGSTAT_STAT_INTERVAL.
>
> I think that we can do that. What about the attached patch?

Thanks, I thought it's better.

>> I forgot to remove an unused variable.
>> The attached v13 patch is fixed.
>
> Thanks for updating the patch!
>
> + w.wal_write,
> + w.wal_write_time,
> + w.wal_sync,
> + w.wal_sync_time,
>
> It's more natural to put wal_write_time and wal_sync_time next to
> each other? That is, what about the following order of columns?
>
> wal_write
> wal_sync
> wal_write_time
> wal_sync_time

Yes, I fixed it.

> - case SYNC_METHOD_OPEN:
> - case SYNC_METHOD_OPEN_DSYNC:
> - /* write synced it already */
> - break;
>
> IMO it's better to add Assert(false) here to ensure that we never reach
> here, as follows. Thought?
>
> + case SYNC_METHOD_OPEN:
> + case SYNC_METHOD_OPEN_DSYNC:
> + /* not reachable */
> + Assert(false);

I agree.

> Even when a backend exits, it sends the stats via
> pgstat_beshutdown_hook().
> On the other hand, walwriter doesn't do that. Walwriter also should
> send
> the stats even at its exit? Otherwise some stats can fail to be
> collected.
> But ISTM that this issue existed from before, for example checkpointer
> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to
> fix
> this issue in this patch?

Thanks, I thought it's better to do so.
I added the shutdown hook for the walwriter and the checkpointer in
v14-0003 patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v14-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	20.2 KB
v14-0003-Add-shutdown-hooks-to-send-statistics.patch	text/x-diff	3.7 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-05 03:47:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/05 8:38, Masahiro Ikeda wrote:
> On 2021-03-05 01:02, Fujii Masao wrote:
>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>
>>>>>>>>>> I pgindented the patches.
>>>>>>>>>
>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>
>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>
>>>>>>>>> is ether -> is either
>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>> operating system..."
>>>>>>>>> ", because" -> "as"
>>>>>>>>
>>>>>>>> Thanks, I fixed them.
>>>>>>>>
>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>
>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>
>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>
>>>>>>>> Thanks, I fixed it.
>>>>>>>>
>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>
>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>
>>>>>>> Thanks for updating the patches!
>>>>>>>
>>>>>>>
>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>> What do you think?
>>>>>>>
>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>
>>>>>> OK, I fixed it.
>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>
>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>
>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>> +    if (enableFsync &&
>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>> +    {
>>>>>
>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>> to simplify the code more?
>>>>
>>>> Thanks for the comments.
>>>> I added the above code in v12 patch.
>>>>
>>>>>
>>>>> +        /*
>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>> +         * the overhead in WAL-writing.
>>>>> +         */
>>>>> +        if (rc & WL_TIMEOUT)
>>>>> +            pgstat_send_wal();
>>>>>
>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>
>>>> Thanks, I didn't notice that.
>>>>
>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>> default value is 200msec and it may be set shorter time.
>>
>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>> that the WAL stats are sent too frequently. I agree that's a problem.
>>
>>>>
>>>> Why don't to make another way to check the timestamp?
>>>>
>>>> +               /*
>>>> +                * Don't send a message unless it's been at least
>>>> PGSTAT_STAT_INTERVAL
>>>> +                * msec since we last sent one
>>>> +                */
>>>> +               now = GetCurrentTimestamp();
>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>> PGSTAT_STAT_INTERVAL))
>>>> +               {
>>>> +                       pgstat_send_wal();
>>>> +                       last_report = now;
>>>> +               }
>>>> +
>>>>
>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>
>> Agreed.
>>
>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>> PGSTAT_STAT_INTERVAL.
>>
>> I think that we can do that. What about the attached patch?
>
> Thanks, I thought it's better.
>
>
>>> I forgot to remove an unused variable.
>>> The attached v13 patch is fixed.
>>
>> Thanks for updating the patch!
>>
>> +        w.wal_write,
>> +        w.wal_write_time,
>> +        w.wal_sync,
>> +        w.wal_sync_time,
>>
>> It's more natural to put wal_write_time and wal_sync_time next to
>> each other? That is, what about the following order of columns?
>>
>> wal_write
>> wal_sync
>> wal_write_time
>> wal_sync_time
>
> Yes, I fixed it.
>
>> -        case SYNC_METHOD_OPEN:
>> -        case SYNC_METHOD_OPEN_DSYNC:
>> -            /* write synced it already */
>> -            break;
>>
>> IMO it's better to add Assert(false) here to ensure that we never reach
>> here, as follows. Thought?
>>
>> +        case SYNC_METHOD_OPEN:
>> +        case SYNC_METHOD_OPEN_DSYNC:
>> +            /* not reachable */
>> +            Assert(false);
>
> I agree.
>
>
>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>> On the other hand, walwriter doesn't do that. Walwriter also should send
>> the stats even at its exit? Otherwise some stats can fail to be collected.
>> But ISTM that this issue existed from before, for example checkpointer
>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>> this issue in this patch?
>
> Thanks, I thought it's better to do so.
> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.

Thanks!

Seems you forgot to include the changes of expected/rules.out in 0001 patch,
and which caused the regression test to fail. Attached is the updated version
of the patch. I included expected/rules.out in it.

+ PgStat_Counter m_wal_write_time; /* time spend writing wal records in
+ * micro seconds */
+ PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
+ * seconds */

IMO "spend" should be "spent". Also "micro seconds" should be "microseconds"
in sake of consistent with other comments in pgstat.h. I fixed them.

Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug. Even
when pgstat_send_wal() returned without sending any message,
pgstat_report_wal() saved current pgWalUsage and that counter was used for
the subsequent calculation of WAL usage. This caused some counters not to
be sent to the collector. This is a bug that I added. I fixed this bug.

+ walStats.wal_write += msg->m_wal_write;
+ walStats.wal_write_time += msg->m_wal_write_time;
+ walStats.wal_sync += msg->m_wal_sync;
+ walStats.wal_sync_time += msg->m_wal_sync_time;

I changed the order of the above in pgstat.c so that wal_write_time and
wal_sync_time are placed in next to each other.

The followings are the comments for the docs part. I've not updated this
in the patch yet because I'm not sure how to change them for now.

+ Number of times WAL buffers were written out to disk via
+ <function>XLogWrite</function>, which is invoked during an
+ <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)
+ </para></entry>

XLogWrite() can be invoked during the functions other than XLogFlush().
For example, XLogBackgroundFlush(). So the above description might be
confusing?

+ Number of times WAL files were synced to disk via
+ <function>issue_xlog_fsync</function>, which is invoked during an
+ <function>XLogFlush</function> request (see <xref linkend="wal-configuration"/>)

Same as above.

+ while <xref linkend="guc-wal-sync-method"/> was set to one of the
+ "sync at commit" options (i.e., <literal>fdatasync</literal>,
+ <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).

Even open_sync and open_datasync do the sync at commit. No? I'm not sure
if "sync at commit" is right term to indicate fdatasync, fsync and
fsync_writethrough.

+ <literal>open_sync</literal>. Units are in milliseconds with microsecond resolution.

"with microsecond resolution" part is really necessary?

+ transaction records are flushed to permanent storage.
+ <function>XLogFlush</function> calls <function>XLogWrite</function> to write
+ and <function>issue_xlog_fsync</function> to flush them, which are counted as
+ <literal>wal_write</literal> and <literal>wal_sync</literal> in
+ <xref linkend="pg-stat-wal-view"/>. On systems with high log output,

This description might cause users to misread that XLogFlush() calls
issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
ISTM that this description needs to be updated.

Each line in the above seems to end with a space character.
This space character should be removed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment	Content-Type	Size
v14-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patch	text/plain	21.2 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-05 10:54:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-05 12:47, Fujii Masao wrote:
> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>> On 2021-03-05 01:02, Fujii Masao wrote:
>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>
>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is
>>>>>>>>>> also
>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>
>>>>>>>>>> ("which normally called" should be "which is normally called"
>>>>>>>>>> or
>>>>>>>>>> "which normally is called" if you want to keep true to the
>>>>>>>>>> original)
>>>>>>>>>> You missed the adding the space before an opening parenthesis
>>>>>>>>>> here and
>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>
>>>>>>>>>> is ether -> is either
>>>>>>>>>> "This parameter is off by default as it will repeatedly query
>>>>>>>>>> the
>>>>>>>>>> operating system..."
>>>>>>>>>> ", because" -> "as"
>>>>>>>>>
>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>
>>>>>>>>>> wal_write_time and the sync items also need the note: "This is
>>>>>>>>>> also
>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>
>>>>>>>>> I skipped changing it since I separated the stats for the WAL
>>>>>>>>> receiver
>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>
>>>>>>>>>> "The number of times it happened..." -> " (the tally of this
>>>>>>>>>> event is
>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable
>>>>>>>>>> because ..."
>>>>>>>>>
>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>
>>>>>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>>>>>> explicitly
>>>>>>>>>> computing the sync statistics but does require computing the
>>>>>>>>>> write
>>>>>>>>>> statistics. This is because of the presence of
>>>>>>>>>> issue_xlog_fsync but
>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I
>>>>>>>>>> observe that
>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the
>>>>>>>>>> WAL
>>>>>>>>>> receiver path does not. It seems technically straight-forward
>>>>>>>>>> to
>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two
>>>>>>>>>> places,
>>>>>>>>>> though I suspect there may be a trade-off for not adding
>>>>>>>>>> another
>>>>>>>>>> function call to the stack given the importance of WAL
>>>>>>>>>> processing
>>>>>>>>>> (though that seems marginalized compared to the cost of
>>>>>>>>>> actually
>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and
>>>>>>>>>> don't have
>>>>>>>>>> any shared code between the two but instead implement the WAL
>>>>>>>>>> receiver
>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>
>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>
>>>>>>>> Thanks for updating the patches!
>>>>>>>>
>>>>>>>>
>>>>>>>>> I added the infrastructure code to communicate the WAL receiver
>>>>>>>>> stats messages between the WAL receiver and the stats
>>>>>>>>> collector, and
>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> On second thought, this idea seems not good. Because those stats
>>>>>>>> are
>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process
>>>>>>>> running
>>>>>>>> at that moment. IOW, it seems strange that some values show
>>>>>>>> dynamic
>>>>>>>> stats and the others show collected stats, even though they are
>>>>>>>> in
>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>
>>>>>>> OK, I fixed it.
>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal
>>>>>>> view in v11 patch.
>>>>>>
>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>
>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>> +    if (enableFsync &&
>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>> +    {
>>>>>>
>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>> to simplify the code more?
>>>>>
>>>>> Thanks for the comments.
>>>>> I added the above code in v12 patch.
>>>>>
>>>>>>
>>>>>> +        /*
>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed
>>>>>> to minimize
>>>>>> +         * the overhead in WAL-writing.
>>>>>> +         */
>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>> +            pgstat_send_wal();
>>>>>>
>>>>>> On second thought, this change means that it always takes
>>>>>> wal_writer_delay
>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush()
>>>>>> is called.
>>>>>> For example, if wal_writer_delay is set to several seconds, some
>>>>>> values in
>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those
>>>>>> seconds.
>>>>>> So I'm thinking to withdraw my previous comment and it's ok to
>>>>>> send
>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>
>>>>> Thanks, I didn't notice that.
>>>>>
>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>> default value is 200msec and it may be set shorter time.
>>>
>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>
>>>>>
>>>>> Why don't to make another way to check the timestamp?
>>>>>
>>>>> +               /*
>>>>> +                * Don't send a message unless it's been at least
>>>>> PGSTAT_STAT_INTERVAL
>>>>> +                * msec since we last sent one
>>>>> +                */
>>>>> +               now = GetCurrentTimestamp();
>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>> PGSTAT_STAT_INTERVAL))
>>>>> +               {
>>>>> +                       pgstat_send_wal();
>>>>> +                       last_report = now;
>>>>> +               }
>>>>> +
>>>>>
>>>>> Although I worried that it's better to add the check code in
>>>>> pgstat_send_wal(),
>>>
>>> Agreed.
>>>
>>>>> I didn't do so because to avoid to double check
>>>>> PGSTAT_STAT_INTERVAL.
>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks
>>>>> the
>>>>> PGSTAT_STAT_INTERVAL.
>>>
>>> I think that we can do that. What about the attached patch?
>>
>> Thanks, I thought it's better.
>>
>>
>>>> I forgot to remove an unused variable.
>>>> The attached v13 patch is fixed.
>>>
>>> Thanks for updating the patch!
>>>
>>> +        w.wal_write,
>>> +        w.wal_write_time,
>>> +        w.wal_sync,
>>> +        w.wal_sync_time,
>>>
>>> It's more natural to put wal_write_time and wal_sync_time next to
>>> each other? That is, what about the following order of columns?
>>>
>>> wal_write
>>> wal_sync
>>> wal_write_time
>>> wal_sync_time
>>
>> Yes, I fixed it.
>>
>>> -        case SYNC_METHOD_OPEN:
>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>> -            /* write synced it already */
>>> -            break;
>>>
>>> IMO it's better to add Assert(false) here to ensure that we never
>>> reach
>>> here, as follows. Thought?
>>>
>>> +        case SYNC_METHOD_OPEN:
>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>> +            /* not reachable */
>>> +            Assert(false);
>>
>> I agree.
>>
>>
>>> Even when a backend exits, it sends the stats via
>>> pgstat_beshutdown_hook().
>>> On the other hand, walwriter doesn't do that. Walwriter also should
>>> send
>>> the stats even at its exit? Otherwise some stats can fail to be
>>> collected.
>>> But ISTM that this issue existed from before, for example
>>> checkpointer
>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to
>>> fix
>>> this issue in this patch?
>>
>> Thanks, I thought it's better to do so.
>> I added the shutdown hook for the walwriter and the checkpointer in
>> v14-0003 patch.
>
> Thanks!
>
> Seems you forgot to include the changes of expected/rules.out in 0001
> patch,
> and which caused the regression test to fail. Attached is the updated
> version
> of the patch. I included expected/rules.out in it.

Sorry.

> + PgStat_Counter m_wal_write_time; /* time spend writing wal records in
> + * micro seconds */
> + PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in
> micro
> + * seconds */
>
> IMO "spend" should be "spent". Also "micro seconds" should be
> "microseconds"
> in sake of consistent with other comments in pgstat.h. I fixed them.

Thanks.

> Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug.
> Even
> when pgstat_send_wal() returned without sending any message,
> pgstat_report_wal() saved current pgWalUsage and that counter was used
> for
> the subsequent calculation of WAL usage. This caused some counters not
> to
> be sent to the collector. This is a bug that I added. I fixed this bug.

Thanks.

> + walStats.wal_write += msg->m_wal_write;
> + walStats.wal_write_time += msg->m_wal_write_time;
> + walStats.wal_sync += msg->m_wal_sync;
> + walStats.wal_sync_time += msg->m_wal_sync_time;
>
> I changed the order of the above in pgstat.c so that wal_write_time and
> wal_sync_time are placed in next to each other.

I forgot to fix them, thanks.

> The followings are the comments for the docs part. I've not updated
> this
> in the patch yet because I'm not sure how to change them for now.
> + Number of times WAL buffers were written out to disk via
> + <function>XLogWrite</function>, which is invoked during an
> + <function>XLogFlush</function> request (see <xref
> linkend="wal-configuration"/>)
> + </para></entry>
>
> XLogWrite() can be invoked during the functions other than XLogFlush().
> For example, XLogBackgroundFlush(). So the above description might be
> confusing?
>
> + Number of times WAL files were synced to disk via
> + <function>issue_xlog_fsync</function>, which is invoked during
> an
> + <function>XLogFlush</function> request (see <xref
> linkend="wal-configuration"/>)
>
> Same as above.

Yes, why don't you remove "XLogFlush" in the above comments
because XLogWrite() description is covered in wal.sgml?

But, now it's mentioned only for backend,
I added the comments for the wal writer in the attached patch.

> + while <xref linkend="guc-wal-sync-method"/> was set to one of
> the
> + "sync at commit" options (i.e., <literal>fdatasync</literal>,
> + <literal>fsync</literal>, or
> <literal>fsync_writethrough</literal>).
>
> Even open_sync and open_datasync do the sync at commit. No? I'm not
> sure
> if "sync at commit" is right term to indicate fdatasync, fsync and
> fsync_writethrough.

Yes, why don't you change to the following comments?

```
while <xref linkend="guc-wal-sync-method"/> was set to one of the
options which specific fsync method is called (i.e.,
<literal>fdatasync</literal>,
<literal>fsync</literal>, or
<literal>fsync_writethrough</literal>)
```

> + <literal>open_sync</literal>. Units are in milliseconds with
> microsecond resolution.
>
> "with microsecond resolution" part is really necessary?

I removed it because blk_read_time in pg_stat_database is the same
above,
but it doesn't mention it.

> + transaction records are flushed to permanent storage.
> + <function>XLogFlush</function> calls <function>XLogWrite</function>
> to write
> + and <function>issue_xlog_fsync</function> to flush them, which are
> counted as
> + <literal>wal_write</literal> and <literal>wal_sync</literal> in
> + <xref linkend="pg-stat-wal-view"/>. On systems with high log
> output,
>
> This description might cause users to misread that XLogFlush() calls
> issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
> ISTM that this description needs to be updated.

I understood. I fixed to mention that XLogWrite()
calls issue_xlog_fsync().

> Each line in the above seems to end with a space character.
> This space character should be removed.

Sorry for that. I removed it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v15-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	22.5 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-08 04:44:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/05 19:54, Masahiro Ikeda wrote:
> On 2021-03-05 12:47, Fujii Masao wrote:
>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>
>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>
>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>
>>>>>>>>>>> is ether -> is either
>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>> operating system..."
>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>
>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>
>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>
>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>
>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>
>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>
>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>
>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>
>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>
>>>>>>>> OK, I fixed it.
>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>
>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>
>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>> +    if (enableFsync &&
>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>> +    {
>>>>>>>
>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>> to simplify the code more?
>>>>>>
>>>>>> Thanks for the comments.
>>>>>> I added the above code in v12 patch.
>>>>>>
>>>>>>>
>>>>>>> +        /*
>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>> +         * the overhead in WAL-writing.
>>>>>>> +         */
>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>> +            pgstat_send_wal();
>>>>>>>
>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>
>>>>>> Thanks, I didn't notice that.
>>>>>>
>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>> default value is 200msec and it may be set shorter time.
>>>>
>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>
>>>>>>
>>>>>> Why don't to make another way to check the timestamp?
>>>>>>
>>>>>> +               /*
>>>>>> +                * Don't send a message unless it's been at least
>>>>>> PGSTAT_STAT_INTERVAL
>>>>>> +                * msec since we last sent one
>>>>>> +                */
>>>>>> +               now = GetCurrentTimestamp();
>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>> +               {
>>>>>> +                       pgstat_send_wal();
>>>>>> +                       last_report = now;
>>>>>> +               }
>>>>>> +
>>>>>>
>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>
>>>> Agreed.
>>>>
>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>> PGSTAT_STAT_INTERVAL.
>>>>
>>>> I think that we can do that. What about the attached patch?
>>>
>>> Thanks, I thought it's better.
>>>
>>>
>>>>> I forgot to remove an unused variable.
>>>>> The attached v13 patch is fixed.
>>>>
>>>> Thanks for updating the patch!
>>>>
>>>> +        w.wal_write,
>>>> +        w.wal_write_time,
>>>> +        w.wal_sync,
>>>> +        w.wal_sync_time,
>>>>
>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>> each other? That is, what about the following order of columns?
>>>>
>>>> wal_write
>>>> wal_sync
>>>> wal_write_time
>>>> wal_sync_time
>>>
>>> Yes, I fixed it.
>>>
>>>> -        case SYNC_METHOD_OPEN:
>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>> -            /* write synced it already */
>>>> -            break;
>>>>
>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>> here, as follows. Thought?
>>>>
>>>> +        case SYNC_METHOD_OPEN:
>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>> +            /* not reachable */
>>>> +            Assert(false);
>>>
>>> I agree.
>>>
>>>
>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>> But ISTM that this issue existed from before, for example checkpointer
>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>> this issue in this patch?
>>>
>>> Thanks, I thought it's better to do so.
>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>
>> Thanks!
>>
>> Seems you forgot to include the changes of expected/rules.out in 0001 patch,
>> and which caused the regression test to fail. Attached is the updated version
>> of the patch. I included expected/rules.out in it.
>
> Sorry.
>
>> +    PgStat_Counter m_wal_write_time;    /* time spend writing wal records in
>> +                                         * micro seconds */
>> +    PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
>> +                                     * seconds */
>>
>> IMO "spend" should be "spent". Also "micro seconds" should be "microseconds"
>> in sake of consistent with other comments in pgstat.h. I fixed them.
>
> Thanks.
>
>> Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug. Even
>> when pgstat_send_wal() returned without sending any message,
>> pgstat_report_wal() saved current pgWalUsage and that counter was used for
>> the subsequent calculation of WAL usage. This caused some counters not to
>> be sent to the collector. This is a bug that I added. I fixed this bug.
>
> Thanks.
>
>
>> +    walStats.wal_write += msg->m_wal_write;
>> +    walStats.wal_write_time += msg->m_wal_write_time;
>> +    walStats.wal_sync += msg->m_wal_sync;
>> +    walStats.wal_sync_time += msg->m_wal_sync_time;
>>
>> I changed the order of the above in pgstat.c so that wal_write_time and
>> wal_sync_time are placed in next to each other.
>
> I forgot to fix them, thanks.
>
>
>> The followings are the comments for the docs part. I've not updated this
>> in the patch yet because I'm not sure how to change them for now.
>> +       Number of times WAL buffers were written out to disk via
>> +       <function>XLogWrite</function>, which is invoked during an
>> +       <function>XLogFlush</function> request (see <xref
>> linkend="wal-configuration"/>)
>> +      </para></entry>
>>
>> XLogWrite() can be invoked during the functions other than XLogFlush().
>> For example, XLogBackgroundFlush(). So the above description might be
>> confusing?
>>
>> +       Number of times WAL files were synced to disk via
>> +       <function>issue_xlog_fsync</function>, which is invoked during an
>> +       <function>XLogFlush</function> request (see <xref
>> linkend="wal-configuration"/>)
>>
>> Same as above.
>
> Yes, why don't you remove "XLogFlush" in the above comments
> because XLogWrite() description is covered in wal.sgml?
>
> But, now it's mentioned only for backend,
> I added the comments for the wal writer in the attached patch.
>
>
>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of the
>> +       "sync at commit" options (i.e., <literal>fdatasync</literal>,
>> +       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>>
>> Even open_sync and open_datasync do the sync at commit. No? I'm not sure
>> if "sync at commit" is right term to indicate fdatasync, fsync and
>> fsync_writethrough.
>
> Yes, why don't you change to the following comments?
>
> ```
>        while <xref linkend="guc-wal-sync-method"/> was set to one of the
>        options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
>        <literal>fsync</literal>, or <literal>fsync_writethrough</literal>)
> ```
>
>> +       <literal>open_sync</literal>. Units are in milliseconds with
>> microsecond resolution.
>>
>> "with microsecond resolution" part is really necessary?
>
> I removed it because blk_read_time in pg_stat_database is the same above,
> but it doesn't mention it.
>
>
>> +   transaction records are flushed to permanent storage.
>> +   <function>XLogFlush</function> calls <function>XLogWrite</function> to write
>> +   and <function>issue_xlog_fsync</function> to flush them, which are
>> counted as
>> +   <literal>wal_write</literal> and <literal>wal_sync</literal> in
>> +   <xref linkend="pg-stat-wal-view"/>. On systems with high log output,
>>
>> This description might cause users to misread that XLogFlush() calls
>> issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
>> ISTM that this description needs to be updated.
>
> I understood. I fixed to mention that XLogWrite()
> calls issue_xlog_fsync().
>
>
>> Each line in the above seems to end with a space character.
>> This space character should be removed.
>
> Sorry for that. I removed it.

Thanks for updating the patch! I think it's getting good shape!

- pid | wait_event_type | wait_event
+ pid | wait_event_type | wait_event

This change is not necessary?

- every <xref linkend="guc-wal-writer-delay"/> milliseconds.
+ every <xref linkend="guc-wal-writer-delay"/> milliseconds, which calls
+ <function>XLogWrite</function> to write and <function>XLogWrite</function>
+ <function>issue_xlog_fsync</function> to flush them. They are counted as
+ <literal>wal_write</literal> and <literal>wal_sync</literal> in
+ <xref linkend="pg-stat-wal-view"/>.

Isn't it better to avoid using the terms like XLogWrite or issue_xlog_fsync
before explaining what they are? They are explained later. At least for me
I'm ok without this change.

- to write (move to kernel cache) a few filled <acronym>WAL</acronym>
- buffers. This is undesirable because <function>XLogInsertRecord</function>
+ to call <function>XLogWrite</function> to write (move to kernel cache) a
+ few filled <acronym>WAL</acronym> buffers (the tally of this event is reported in
+ <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
+ This is undesirable because <function>XLogInsertRecord</function>

This paragraph explains the relationshp between WAL writes and WAL buffers. I don't think it's good to add different context to this paragraph. Instead, what about adding new paragraph like the follwing?

----------------------------------
When track_wal_io_timing is enabled, the total amounts of time XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are counted as wal_write_time and wal_sync_time in pg_stat_wal view, respectively. XLogWrite is normally called by XLogInsertRecord (when there is no space for the new record in WAL buffers), XLogFlush and the WAL writer, to write WAL buffers to disk and call issue_xlog_fsync. If wal_sync_method is either open_datasync or open_sync, a write operation in XLogWrite guarantees to sync written WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method is either fdatasync, fsync, or fsync_writethrough, the write operation moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files to disk. Regardless of the setting of track_wal_io_timing, the numbers of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are also counted as wal_write and wal_sync in pg_stat_wal, respectively.
----------------------------------

+ <function>issue_xlog_fsync</function> (see <xref linkend="wal-configuration"/>)

"request" should be place just before "(see"?

+ Number of times WAL files were synced to disk via
+ <function>issue_xlog_fsync</function> (see <xref linkend="wal-configuration"/>)
+ while <xref linkend="guc-wal-sync-method"/> was set to one of the
+ options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
+ <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).

Isn't it better to mention the case of fsync=off? What about the following?

----------------------------------
Number of times WAL files were synced to disk via issue_xlog_fsync (see ...). This is zero when fsync is off or wal_sync_method is either open_datasync or open_sync.
----------------------------------

+ Total amount of time spent writing WAL buffers were written out to disk via

"were written out" is not necessary?

+ Total amount of time spent syncing WAL files to disk via
+ <function>issue_xlog_fsync</function> request (see <xref linkend="wal-configuration"/>)
+ while <xref linkend="guc-wal-sync-method"/> was set to one of the
+ options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
+ <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
+ Units are in milliseconds.
+ This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.

Isn't it better to explain the case where this counter is zero a bit more clearly as follows?

---------------------
This is zero when track_wal_io_timing is disabled, fsync is off, or wal_sync_method is either open_datasync or open_sync.
---------------------

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-08 10:42:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-08 13:44, Fujii Masao wrote:
> On 2021/03/05 19:54, Masahiro Ikeda wrote:
>> On 2021-03-05 12:47, Fujii Masao wrote:
>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>
>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during
>>>>>>>>>>>> an
>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is
>>>>>>>>>>>> also
>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>
>>>>>>>>>>>> ("which normally called" should be "which is normally
>>>>>>>>>>>> called" or
>>>>>>>>>>>> "which normally is called" if you want to keep true to the
>>>>>>>>>>>> original)
>>>>>>>>>>>> You missed the adding the space before an opening
>>>>>>>>>>>> parenthesis here and
>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>
>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>> "This parameter is off by default as it will repeatedly
>>>>>>>>>>>> query the
>>>>>>>>>>>> operating system..."
>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>
>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This
>>>>>>>>>>>> is also
>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>
>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL
>>>>>>>>>>> receiver
>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>
>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this
>>>>>>>>>>>> event is
>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable
>>>>>>>>>>>> because ..."
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>
>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>>>>>>>> explicitly
>>>>>>>>>>>> computing the sync statistics but does require computing the
>>>>>>>>>>>> write
>>>>>>>>>>>> statistics. This is because of the presence of
>>>>>>>>>>>> issue_xlog_fsync but
>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I
>>>>>>>>>>>> observe that
>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while
>>>>>>>>>>>> the WAL
>>>>>>>>>>>> receiver path does not. It seems technically
>>>>>>>>>>>> straight-forward to
>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the
>>>>>>>>>>>> two places,
>>>>>>>>>>>> though I suspect there may be a trade-off for not adding
>>>>>>>>>>>> another
>>>>>>>>>>>> function call to the stack given the importance of WAL
>>>>>>>>>>>> processing
>>>>>>>>>>>> (though that seems marginalized compared to the cost of
>>>>>>>>>>>> actually
>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and
>>>>>>>>>>>> don't have
>>>>>>>>>>>> any shared code between the two but instead implement the
>>>>>>>>>>>> WAL receiver
>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case,
>>>>>>>>>>>> this
>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>
>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver
>>>>>>>>>>> stats.
>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>
>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I added the infrastructure code to communicate the WAL
>>>>>>>>>>> receiver stats messages between the WAL receiver and the
>>>>>>>>>>> stats collector, and
>>>>>>>>>>> the stats for WAL receiver is counted in
>>>>>>>>>>> pg_stat_wal_receiver.
>>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> On second thought, this idea seems not good. Because those
>>>>>>>>>> stats are
>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver
>>>>>>>>>> process running
>>>>>>>>>> at that moment. IOW, it seems strange that some values show
>>>>>>>>>> dynamic
>>>>>>>>>> stats and the others show collected stats, even though they
>>>>>>>>>> are in
>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>
>>>>>>>>> OK, I fixed it.
>>>>>>>>> The stats collected in the WAL receiver is exposed in
>>>>>>>>> pg_stat_wal view in v11 patch.
>>>>>>>>
>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>
>>>>>>>> +    /* Check whether the WAL file was synced to disk right now
>>>>>>>> */
>>>>>>>> +    if (enableFsync &&
>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>> +    {
>>>>>>>>
>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>> if enableFsync is off, sync_method is open_sync or
>>>>>>>> open_data_sync,
>>>>>>>> to simplify the code more?
>>>>>>>
>>>>>>> Thanks for the comments.
>>>>>>> I added the above code in v12 patch.
>>>>>>>
>>>>>>>>
>>>>>>>> +        /*
>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has
>>>>>>>> elapsed to minimize
>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>> +         */
>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>> +            pgstat_send_wal();
>>>>>>>>
>>>>>>>> On second thought, this change means that it always takes
>>>>>>>> wal_writer_delay
>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush()
>>>>>>>> is called.
>>>>>>>> For example, if wal_writer_delay is set to several seconds, some
>>>>>>>> values in
>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those
>>>>>>>> seconds.
>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to
>>>>>>>> send
>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>
>>>>>>> Thanks, I didn't notice that.
>>>>>>>
>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>
>>>>> Yeah, if wal_writer_delay is set to very small value, there is a
>>>>> risk
>>>>> that the WAL stats are sent too frequently. I agree that's a
>>>>> problem.
>>>>>
>>>>>>>
>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>
>>>>>>> +               /*
>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>> +                * msec since we last sent one
>>>>>>> +                */
>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>> +               {
>>>>>>> +                       pgstat_send_wal();
>>>>>>> +                       last_report = now;
>>>>>>> +               }
>>>>>>> +
>>>>>>>
>>>>>>> Although I worried that it's better to add the check code in
>>>>>>> pgstat_send_wal(),
>>>>>
>>>>> Agreed.
>>>>>
>>>>>>> I didn't do so because to avoid to double check
>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already
>>>>>>> checks the
>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>
>>>>> I think that we can do that. What about the attached patch?
>>>>
>>>> Thanks, I thought it's better.
>>>>
>>>>
>>>>>> I forgot to remove an unused variable.
>>>>>> The attached v13 patch is fixed.
>>>>>
>>>>> Thanks for updating the patch!
>>>>>
>>>>> +        w.wal_write,
>>>>> +        w.wal_write_time,
>>>>> +        w.wal_sync,
>>>>> +        w.wal_sync_time,
>>>>>
>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>> each other? That is, what about the following order of columns?
>>>>>
>>>>> wal_write
>>>>> wal_sync
>>>>> wal_write_time
>>>>> wal_sync_time
>>>>
>>>> Yes, I fixed it.
>>>>
>>>>> -        case SYNC_METHOD_OPEN:
>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>> -            /* write synced it already */
>>>>> -            break;
>>>>>
>>>>> IMO it's better to add Assert(false) here to ensure that we never
>>>>> reach
>>>>> here, as follows. Thought?
>>>>>
>>>>> +        case SYNC_METHOD_OPEN:
>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>> +            /* not reachable */
>>>>> +            Assert(false);
>>>>
>>>> I agree.
>>>>
>>>>
>>>>> Even when a backend exits, it sends the stats via
>>>>> pgstat_beshutdown_hook().
>>>>> On the other hand, walwriter doesn't do that. Walwriter also should
>>>>> send
>>>>> the stats even at its exit? Otherwise some stats can fail to be
>>>>> collected.
>>>>> But ISTM that this issue existed from before, for example
>>>>> checkpointer
>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill
>>>>> to fix
>>>>> this issue in this patch?
>>>>
>>>> Thanks, I thought it's better to do so.
>>>> I added the shutdown hook for the walwriter and the checkpointer in
>>>> v14-0003 patch.
>>>
>>> Thanks!
>>>
>>> Seems you forgot to include the changes of expected/rules.out in 0001
>>> patch,
>>> and which caused the regression test to fail. Attached is the updated
>>> version
>>> of the patch. I included expected/rules.out in it.
>>
>> Sorry.
>>
>>> +    PgStat_Counter m_wal_write_time;    /* time spend writing wal
>>> records in
>>> +                                         * micro seconds */
>>> +    PgStat_Counter m_wal_sync_time; /* time spend syncing wal
>>> records in micro
>>> +                                     * seconds */
>>>
>>> IMO "spend" should be "spent". Also "micro seconds" should be
>>> "microseconds"
>>> in sake of consistent with other comments in pgstat.h. I fixed them.
>>
>> Thanks.
>>
>>> Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug.
>>> Even
>>> when pgstat_send_wal() returned without sending any message,
>>> pgstat_report_wal() saved current pgWalUsage and that counter was
>>> used for
>>> the subsequent calculation of WAL usage. This caused some counters
>>> not to
>>> be sent to the collector. This is a bug that I added. I fixed this
>>> bug.
>>
>> Thanks.
>>
>>
>>> +    walStats.wal_write += msg->m_wal_write;
>>> +    walStats.wal_write_time += msg->m_wal_write_time;
>>> +    walStats.wal_sync += msg->m_wal_sync;
>>> +    walStats.wal_sync_time += msg->m_wal_sync_time;
>>>
>>> I changed the order of the above in pgstat.c so that wal_write_time
>>> and
>>> wal_sync_time are placed in next to each other.
>>
>> I forgot to fix them, thanks.
>>
>>
>>> The followings are the comments for the docs part. I've not updated
>>> this
>>> in the patch yet because I'm not sure how to change them for now.
>>> +       Number of times WAL buffers were written out to disk via
>>> +       <function>XLogWrite</function>, which is invoked during an
>>> +       <function>XLogFlush</function> request (see <xref
>>> linkend="wal-configuration"/>)
>>> +      </para></entry>
>>>
>>> XLogWrite() can be invoked during the functions other than
>>> XLogFlush().
>>> For example, XLogBackgroundFlush(). So the above description might be
>>> confusing?
>>>
>>> +       Number of times WAL files were synced to disk via
>>> +       <function>issue_xlog_fsync</function>, which is invoked
>>> during an
>>> +       <function>XLogFlush</function> request (see <xref
>>> linkend="wal-configuration"/>)
>>>
>>> Same as above.
>>
>> Yes, why don't you remove "XLogFlush" in the above comments
>> because XLogWrite() description is covered in wal.sgml?
>>
>> But, now it's mentioned only for backend,
>> I added the comments for the wal writer in the attached patch.
>>
>>
>>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of
>>> the
>>> +       "sync at commit" options (i.e., <literal>fdatasync</literal>,
>>> +       <literal>fsync</literal>, or
>>> <literal>fsync_writethrough</literal>).
>>>
>>> Even open_sync and open_datasync do the sync at commit. No? I'm not
>>> sure
>>> if "sync at commit" is right term to indicate fdatasync, fsync and
>>> fsync_writethrough.
>>
>> Yes, why don't you change to the following comments?
>>
>> ```
>>        while <xref linkend="guc-wal-sync-method"/> was set to one of
>> the
>>        options which specific fsync method is called (i.e.,
>> <literal>fdatasync</literal>,
>>        <literal>fsync</literal>, or
>> <literal>fsync_writethrough</literal>)
>> ```
>>
>>> +       <literal>open_sync</literal>. Units are in milliseconds with
>>> microsecond resolution.
>>>
>>> "with microsecond resolution" part is really necessary?
>>
>> I removed it because blk_read_time in pg_stat_database is the same
>> above,
>> but it doesn't mention it.
>>
>>
>>> +   transaction records are flushed to permanent storage.
>>> +   <function>XLogFlush</function> calls
>>> <function>XLogWrite</function> to write
>>> +   and <function>issue_xlog_fsync</function> to flush them, which
>>> are
>>> counted as
>>> +   <literal>wal_write</literal> and <literal>wal_sync</literal> in
>>> +   <xref linkend="pg-stat-wal-view"/>. On systems with high log
>>> output,
>>>
>>> This description might cause users to misread that XLogFlush() calls
>>> issue_xlog_fsync(). Since issue_xlog_fsync() is called by
>>> XLogWrite(),
>>> ISTM that this description needs to be updated.
>>
>> I understood. I fixed to mention that XLogWrite()
>> calls issue_xlog_fsync().
>>
>>
>>> Each line in the above seems to end with a space character.
>>> This space character should be removed.
>>
>> Sorry for that. I removed it.
>
> Thanks for updating the patch! I think it's getting good shape!
> - pid | wait_event_type | wait_event
> + pid | wait_event_type | wait_event
>
> This change is not necessary?

No, sorry.
I removed it by mistake when I remove trailing space characters.

> - every <xref linkend="guc-wal-writer-delay"/> milliseconds.
> + every <xref linkend="guc-wal-writer-delay"/> milliseconds, which
> calls
> + <function>XLogWrite</function> to write and
> <function>XLogWrite</function>
> + <function>issue_xlog_fsync</function> to flush them. They are
> counted as
> + <literal>wal_write</literal> and <literal>wal_sync</literal> in
> + <xref linkend="pg-stat-wal-view"/>.
>
> Isn't it better to avoid using the terms like XLogWrite or
> issue_xlog_fsync
> before explaining what they are? They are explained later. At least for
> me
> I'm ok without this change.

OK. I removed them and add a new paragraph.

> - to write (move to kernel cache) a few filled <acronym>WAL</acronym>
> - buffers. This is undesirable because
> <function>XLogInsertRecord</function>
> + to call <function>XLogWrite</function> to write (move to kernel
> cache) a
> + few filled <acronym>WAL</acronym> buffers (the tally of this event
> is reported in
> + <literal>wal_buffers_full</literal> in <xref
> linkend="pg-stat-wal-view"/>).
> + This is undesirable because <function>XLogInsertRecord</function>
>
> This paragraph explains the relationshp between WAL writes and WAL
> buffers. I don't think it's good to add different context to this
> paragraph. Instead, what about adding new paragraph like the follwing?
>
> ----------------------------------
> When track_wal_io_timing is enabled, the total amounts of time
> XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are
> counted as wal_write_time and wal_sync_time in pg_stat_wal view,
> respectively. XLogWrite is normally called by XLogInsertRecord (when
> there is no space for the new record in WAL buffers), XLogFlush and
> the WAL writer, to write WAL buffers to disk and call
> issue_xlog_fsync. If wal_sync_method is either open_datasync or
> open_sync, a write operation in XLogWrite guarantees to sync written
> WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method
> is either fdatasync, fsync, or fsync_writethrough, the write operation
> moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files
> to disk. Regardless of the setting of track_wal_io_timing, the numbers
> of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk
> are also counted as wal_write and wal_sync in pg_stat_wal,
> respectively.
> ----------------------------------

Thanks, I agree it's better.

> + <function>issue_xlog_fsync</function> (see <xref
> linkend="wal-configuration"/>)
>
> "request" should be place just before "(see"?

Yes, thanks.

> + Number of times WAL files were synced to disk via
> + <function>issue_xlog_fsync</function> (see <xref
> linkend="wal-configuration"/>)
> + while <xref linkend="guc-wal-sync-method"/> was set to one of
> the
> + options which specific fsync method is called (i.e.,
> <literal>fdatasync</literal>,
> + <literal>fsync</literal>, or
> <literal>fsync_writethrough</literal>).
>
> Isn't it better to mention the case of fsync=off? What about the
> following?
>
> ----------------------------------
> Number of times WAL files were synced to disk via issue_xlog_fsync
> (see ...). This is zero when fsync is off or wal_sync_method is either
> open_datasync or open_sync.
> ----------------------------------

Yes.

> + Total amount of time spent writing WAL buffers were written
> out to disk via
>
> "were written out" is not necessary?

Yes, removed it.

> + Total amount of time spent syncing WAL files to disk via
> + <function>issue_xlog_fsync</function> request (see <xref
> linkend="wal-configuration"/>)
> + while <xref linkend="guc-wal-sync-method"/> was set to one of
> the
> + options which specific fsync method is called (i.e.,
> <literal>fdatasync</literal>,
> + <literal>fsync</literal>, or
> <literal>fsync_writethrough</literal>).
> + Units are in milliseconds.
> + This is zero when <xref linkend="guc-track-wal-io-timing"/> is
> disabled.
>
> Isn't it better to explain the case where this counter is zero a bit
> more clearly as follows?
>
> ---------------------
> This is zero when track_wal_io_timing is disabled, fsync is off, or
> wal_sync_method is either open_datasync or open_sync.
> ---------------------

Yes, thanks.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v16-0001-Add-statistics-related-to-write-sync-wal-records.patch	text/x-diff	20.8 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-08 15:48:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/08 19:42, Masahiro Ikeda wrote:
> On 2021-03-08 13:44, Fujii Masao wrote:
>> On 2021/03/05 19:54, Masahiro Ikeda wrote:
>>> On 2021-03-05 12:47, Fujii Masao wrote:
>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>
>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>
>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>
>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>
>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>
>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>
>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>
>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>
>>>>>>>>>> OK, I fixed it.
>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>
>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>
>>>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>>>> +    if (enableFsync &&
>>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>> +    {
>>>>>>>>>
>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>> to simplify the code more?
>>>>>>>>
>>>>>>>> Thanks for the comments.
>>>>>>>> I added the above code in v12 patch.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> +        /*
>>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>>> +         */
>>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>>> +            pgstat_send_wal();
>>>>>>>>>
>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>
>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>
>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>
>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>
>>>>>>>>
>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>
>>>>>>>> +               /*
>>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>> +                * msec since we last sent one
>>>>>>>> +                */
>>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>> +               {
>>>>>>>> +                       pgstat_send_wal();
>>>>>>>> +                       last_report = now;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>>
>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>
>>>>>> Agreed.
>>>>>>
>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>
>>>>>> I think that we can do that. What about the attached patch?
>>>>>
>>>>> Thanks, I thought it's better.
>>>>>
>>>>>
>>>>>>> I forgot to remove an unused variable.
>>>>>>> The attached v13 patch is fixed.
>>>>>>
>>>>>> Thanks for updating the patch!
>>>>>>
>>>>>> +        w.wal_write,
>>>>>> +        w.wal_write_time,
>>>>>> +        w.wal_sync,
>>>>>> +        w.wal_sync_time,
>>>>>>
>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>> each other? That is, what about the following order of columns?
>>>>>>
>>>>>> wal_write
>>>>>> wal_sync
>>>>>> wal_write_time
>>>>>> wal_sync_time
>>>>>
>>>>> Yes, I fixed it.
>>>>>
>>>>>> -        case SYNC_METHOD_OPEN:
>>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>>> -            /* write synced it already */
>>>>>> -            break;
>>>>>>
>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>> here, as follows. Thought?
>>>>>>
>>>>>> +        case SYNC_METHOD_OPEN:
>>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>>> +            /* not reachable */
>>>>>> +            Assert(false);
>>>>>
>>>>> I agree.
>>>>>
>>>>>
>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>> this issue in this patch?
>>>>>
>>>>> Thanks, I thought it's better to do so.
>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>
>>>> Thanks!
>>>>
>>>> Seems you forgot to include the changes of expected/rules.out in 0001 patch,
>>>> and which caused the regression test to fail. Attached is the updated version
>>>> of the patch. I included expected/rules.out in it.
>>>
>>> Sorry.
>>>
>>>> +    PgStat_Counter m_wal_write_time;    /* time spend writing wal records in
>>>> +                                         * micro seconds */
>>>> +    PgStat_Counter m_wal_sync_time; /* time spend syncing wal records in micro
>>>> +                                     * seconds */
>>>>
>>>> IMO "spend" should be "spent". Also "micro seconds" should be "microseconds"
>>>> in sake of consistent with other comments in pgstat.h. I fixed them.
>>>
>>> Thanks.
>>>
>>>> Regarding pgstat_report_wal() and pgstat_send_wal(), I found one bug. Even
>>>> when pgstat_send_wal() returned without sending any message,
>>>> pgstat_report_wal() saved current pgWalUsage and that counter was used for
>>>> the subsequent calculation of WAL usage. This caused some counters not to
>>>> be sent to the collector. This is a bug that I added. I fixed this bug.
>>>
>>> Thanks.
>>>
>>>
>>>> +    walStats.wal_write += msg->m_wal_write;
>>>> +    walStats.wal_write_time += msg->m_wal_write_time;
>>>> +    walStats.wal_sync += msg->m_wal_sync;
>>>> +    walStats.wal_sync_time += msg->m_wal_sync_time;
>>>>
>>>> I changed the order of the above in pgstat.c so that wal_write_time and
>>>> wal_sync_time are placed in next to each other.
>>>
>>> I forgot to fix them, thanks.
>>>
>>>
>>>> The followings are the comments for the docs part. I've not updated this
>>>> in the patch yet because I'm not sure how to change them for now.
>>>> +       Number of times WAL buffers were written out to disk via
>>>> +       <function>XLogWrite</function>, which is invoked during an
>>>> +       <function>XLogFlush</function> request (see <xref
>>>> linkend="wal-configuration"/>)
>>>> +      </para></entry>
>>>>
>>>> XLogWrite() can be invoked during the functions other than XLogFlush().
>>>> For example, XLogBackgroundFlush(). So the above description might be
>>>> confusing?
>>>>
>>>> +       Number of times WAL files were synced to disk via
>>>> +       <function>issue_xlog_fsync</function>, which is invoked during an
>>>> +       <function>XLogFlush</function> request (see <xref
>>>> linkend="wal-configuration"/>)
>>>>
>>>> Same as above.
>>>
>>> Yes, why don't you remove "XLogFlush" in the above comments
>>> because XLogWrite() description is covered in wal.sgml?
>>>
>>> But, now it's mentioned only for backend,
>>> I added the comments for the wal writer in the attached patch.
>>>
>>>
>>>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of the
>>>> +       "sync at commit" options (i.e., <literal>fdatasync</literal>,
>>>> +       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>>>>
>>>> Even open_sync and open_datasync do the sync at commit. No? I'm not sure
>>>> if "sync at commit" is right term to indicate fdatasync, fsync and
>>>> fsync_writethrough.
>>>
>>> Yes, why don't you change to the following comments?
>>>
>>> ```
>>>         while <xref linkend="guc-wal-sync-method"/> was set to one of the
>>>         options which specific fsync method is called (i.e., <literal>fdatasync</literal>,
>>>         <literal>fsync</literal>, or <literal>fsync_writethrough</literal>)
>>> ```
>>>
>>>> +       <literal>open_sync</literal>. Units are in milliseconds with
>>>> microsecond resolution.
>>>>
>>>> "with microsecond resolution" part is really necessary?
>>>
>>> I removed it because blk_read_time in pg_stat_database is the same above,
>>> but it doesn't mention it.
>>>
>>>
>>>> +   transaction records are flushed to permanent storage.
>>>> +   <function>XLogFlush</function> calls <function>XLogWrite</function> to write
>>>> +   and <function>issue_xlog_fsync</function> to flush them, which are
>>>> counted as
>>>> +   <literal>wal_write</literal> and <literal>wal_sync</literal> in
>>>> +   <xref linkend="pg-stat-wal-view"/>. On systems with high log output,
>>>>
>>>> This description might cause users to misread that XLogFlush() calls
>>>> issue_xlog_fsync(). Since issue_xlog_fsync() is called by XLogWrite(),
>>>> ISTM that this description needs to be updated.
>>>
>>> I understood. I fixed to mention that XLogWrite()
>>> calls issue_xlog_fsync().
>>>
>>>
>>>> Each line in the above seems to end with a space character.
>>>> This space character should be removed.
>>>
>>> Sorry for that. I removed it.
>>
>> Thanks for updating the patch! I think it's getting good shape!
>> - pid | wait_event_type | wait_event
>> + pid | wait_event_type | wait_event
>>
>> This change is not necessary?
>
> No, sorry.
> I removed it by mistake when I remove trailing space characters.
>
>
>> -   every <xref linkend="guc-wal-writer-delay"/> milliseconds.
>> +   every <xref linkend="guc-wal-writer-delay"/> milliseconds, which calls
>> +   <function>XLogWrite</function> to write and <function>XLogWrite</function>
>> +   <function>issue_xlog_fsync</function> to flush them. They are counted as
>> +   <literal>wal_write</literal> and <literal>wal_sync</literal> in
>> +   <xref linkend="pg-stat-wal-view"/>.
>>
>> Isn't it better to avoid using the terms like XLogWrite or issue_xlog_fsync
>> before explaining what they are? They are explained later. At least for me
>> I'm ok without this change.
>
> OK. I removed them and add a new paragraph.
>
>
>> -   to write (move to kernel cache) a few filled <acronym>WAL</acronym>
>> -   buffers. This is undesirable because <function>XLogInsertRecord</function>
>> +   to call <function>XLogWrite</function> to write (move to kernel cache) a
>> +   few filled <acronym>WAL</acronym> buffers (the tally of this event
>> is reported in
>> +   <literal>wal_buffers_full</literal> in <xref linkend="pg-stat-wal-view"/>).
>> +   This is undesirable because <function>XLogInsertRecord</function>
>>
>> This paragraph explains the relationshp between WAL writes and WAL
>> buffers. I don't think it's good to add different context to this
>> paragraph. Instead, what about adding new paragraph like the follwing?
>>
>> ----------------------------------
>> When track_wal_io_timing is enabled, the total amounts of time
>> XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are
>> counted as wal_write_time and wal_sync_time in pg_stat_wal view,
>> respectively. XLogWrite is normally called by XLogInsertRecord (when
>> there is no space for the new record in WAL buffers), XLogFlush and
>> the WAL writer, to write WAL buffers to disk and call
>> issue_xlog_fsync. If wal_sync_method is either open_datasync or
>> open_sync, a write operation in XLogWrite guarantees to sync written
>> WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method
>> is either fdatasync, fsync, or fsync_writethrough, the write operation
>> moves WAL buffer to kernel cache and issue_xlog_fsync syncs WAL files
>> to disk. Regardless of the setting of track_wal_io_timing, the numbers
>> of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk
>> are also counted as wal_write and wal_sync in pg_stat_wal,
>> respectively.
>> ----------------------------------
>
> Thanks, I agree it's better.
>
>
>> +       <function>issue_xlog_fsync</function> (see <xref
>> linkend="wal-configuration"/>)
>>
>> "request" should be place just before "(see"?
>
> Yes, thanks.
>
>
>
>> +       Number of times WAL files were synced to disk via
>> +       <function>issue_xlog_fsync</function> (see <xref
>> linkend="wal-configuration"/>)
>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of the
>> +       options which specific fsync method is called (i.e.,
>> <literal>fdatasync</literal>,
>> +       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>>
>> Isn't it better to mention the case of fsync=off? What about the following?
>>
>> ----------------------------------
>> Number of times WAL files were synced to disk via issue_xlog_fsync
>> (see ...). This is zero when fsync is off or wal_sync_method is either
>> open_datasync or open_sync.
>> ----------------------------------
>
> Yes.
>
>
>> +       Total amount of time spent writing WAL buffers were written
>> out to disk via
>>
>> "were written out" is not necessary?
>
> Yes, removed it.
>
>> +       Total amount of time spent syncing WAL files to disk via
>> +       <function>issue_xlog_fsync</function> request (see <xref
>> linkend="wal-configuration"/>)
>> +       while <xref linkend="guc-wal-sync-method"/> was set to one of the
>> +       options which specific fsync method is called (i.e.,
>> <literal>fdatasync</literal>,
>> +       <literal>fsync</literal>, or <literal>fsync_writethrough</literal>).
>> +       Units are in milliseconds.
>> +       This is zero when <xref linkend="guc-track-wal-io-timing"/> is disabled.
>>
>> Isn't it better to explain the case where this counter is zero a bit
>> more clearly as follows?
>>
>> ---------------------
>> This is zero when track_wal_io_timing is disabled, fsync is off, or
>> wal_sync_method is either open_datasync or open_sync.
>> ---------------------
>
> Yes, thanks.

Thanks for updating the patch! I applied cosmetic changes to that.
Patch attached. Barring any objection, I will commit this version.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment	Content-Type	Size
v16-0001-Add-statistics-related-to-write-sync-wal-records_fujii.patch	text/plain	21.7 KB

From:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-08 19:47:31
Message-ID:	CAKFQuwYjiohvS9C5Uiq+jn7eQPFtqEpPLa7-M2HQZq7dh3K+PA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 8, 2021 at 8:48 AM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
wrote:

>
> Thanks for updating the patch! I applied cosmetic changes to that.
> Patch attached. Barring any objection, I will commit this version.
>

Read over the patch and it looks good.

One minor "the" omission (in a couple of places, copy-paste style):

+ See <xref linkend="wal-configuration"/> for more information about
+ internal WAL function <function>XLogWrite</function>.

"about *the* internal WAL function"

Also, I'm not sure why you find omitting documentation that the millisecond
field has a fractional part out to microseconds to be helpful.

David J.

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-09 08:02:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/09 4:47, David G. Johnston wrote:
> On Mon, Mar 8, 2021 at 8:48 AM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com <mailto:masao(dot)fujii(at)oss(dot)nttdata(dot)com>> wrote:
>
>
> Thanks for updating the patch! I applied cosmetic changes to that.
> Patch attached. Barring any objection, I will commit this version.
>
>
> Read over the patch and it looks good.

Thanks for the review! I committed the patch.

>
> One minor "the" omission (in a couple of places, copy-paste style):
>
> + See <xref linkend="wal-configuration"/> for more information about
> + internal WAL function <function>XLogWrite</function>.
>
> "about *the* internal WAL function"

I added "the" in such two places. Thanks!

>
> Also, I'm not sure why you find omitting documentation that the millisecond field has a fractional part out to microseconds to be helpful.

If this information should be documented, we should do that for
not only wal_write/sync_time but also other several columns,
for example, pg_stat_database.blk_write_time?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-09 08:51:29
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for 0003 patch!

Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
just send the stats only when ShutdownRequestPending is true in the walwriter
main loop (maybe just before calling HandleMainLoopInterrupts()).
If we do this, we cannot send the stats when walwriter throws FATAL error.
But that's ok because FATAL error on walwriter causes the server to crash.
Thought?

Also ISTM that we don't need to use the callback for that purpose in
checkpointer because of the same reason. That is, we can send the stats
just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
Thought?

I'm now not sure how much useful these changes are. As far as I read pgstat.c,
when shutdown is requested, the stats collector seems to exit even when
there are outstanding stats messages. So if checkpointer and walwriter send
the stats in their last cycles, those stats might not be collected.

On the other hand, I can think that sending the stats in the last cycles would
improve the situation a bit than now. So I'm inclined to apply those changes...

Of course, there is another direction; we can improve the stats collector so
that it guarantees to collect all the sent stats messages. But I'm afraid
this change might be big.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-10 05:11:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-09 17:51, Fujii Masao wrote:
> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>> On 2021-03-05 01:02, Fujii Masao wrote:
>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>
>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is
>>>>>>>>>> also
>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>
>>>>>>>>>> ("which normally called" should be "which is normally called"
>>>>>>>>>> or
>>>>>>>>>> "which normally is called" if you want to keep true to the
>>>>>>>>>> original)
>>>>>>>>>> You missed the adding the space before an opening parenthesis
>>>>>>>>>> here and
>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>
>>>>>>>>>> is ether -> is either
>>>>>>>>>> "This parameter is off by default as it will repeatedly query
>>>>>>>>>> the
>>>>>>>>>> operating system..."
>>>>>>>>>> ", because" -> "as"
>>>>>>>>>
>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>
>>>>>>>>>> wal_write_time and the sync items also need the note: "This is
>>>>>>>>>> also
>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>
>>>>>>>>> I skipped changing it since I separated the stats for the WAL
>>>>>>>>> receiver
>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>
>>>>>>>>>> "The number of times it happened..." -> " (the tally of this
>>>>>>>>>> event is
>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable
>>>>>>>>>> because ..."
>>>>>>>>>
>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>
>>>>>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>>>>>> explicitly
>>>>>>>>>> computing the sync statistics but does require computing the
>>>>>>>>>> write
>>>>>>>>>> statistics. This is because of the presence of
>>>>>>>>>> issue_xlog_fsync but
>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I
>>>>>>>>>> observe that
>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the
>>>>>>>>>> WAL
>>>>>>>>>> receiver path does not. It seems technically straight-forward
>>>>>>>>>> to
>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two
>>>>>>>>>> places,
>>>>>>>>>> though I suspect there may be a trade-off for not adding
>>>>>>>>>> another
>>>>>>>>>> function call to the stack given the importance of WAL
>>>>>>>>>> processing
>>>>>>>>>> (though that seems marginalized compared to the cost of
>>>>>>>>>> actually
>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and
>>>>>>>>>> don't have
>>>>>>>>>> any shared code between the two but instead implement the WAL
>>>>>>>>>> receiver
>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>
>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>
>>>>>>>> Thanks for updating the patches!
>>>>>>>>
>>>>>>>>
>>>>>>>>> I added the infrastructure code to communicate the WAL receiver
>>>>>>>>> stats messages between the WAL receiver and the stats
>>>>>>>>> collector, and
>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> On second thought, this idea seems not good. Because those stats
>>>>>>>> are
>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process
>>>>>>>> running
>>>>>>>> at that moment. IOW, it seems strange that some values show
>>>>>>>> dynamic
>>>>>>>> stats and the others show collected stats, even though they are
>>>>>>>> in
>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>
>>>>>>> OK, I fixed it.
>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal
>>>>>>> view in v11 patch.
>>>>>>
>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>
>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>> +    if (enableFsync &&
>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>> +    {
>>>>>>
>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>> to simplify the code more?
>>>>>
>>>>> Thanks for the comments.
>>>>> I added the above code in v12 patch.
>>>>>
>>>>>>
>>>>>> +        /*
>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed
>>>>>> to minimize
>>>>>> +         * the overhead in WAL-writing.
>>>>>> +         */
>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>> +            pgstat_send_wal();
>>>>>>
>>>>>> On second thought, this change means that it always takes
>>>>>> wal_writer_delay
>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush()
>>>>>> is called.
>>>>>> For example, if wal_writer_delay is set to several seconds, some
>>>>>> values in
>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those
>>>>>> seconds.
>>>>>> So I'm thinking to withdraw my previous comment and it's ok to
>>>>>> send
>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>
>>>>> Thanks, I didn't notice that.
>>>>>
>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>> default value is 200msec and it may be set shorter time.
>>>
>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>
>>>>>
>>>>> Why don't to make another way to check the timestamp?
>>>>>
>>>>> +               /*
>>>>> +                * Don't send a message unless it's been at least
>>>>> PGSTAT_STAT_INTERVAL
>>>>> +                * msec since we last sent one
>>>>> +                */
>>>>> +               now = GetCurrentTimestamp();
>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>> PGSTAT_STAT_INTERVAL))
>>>>> +               {
>>>>> +                       pgstat_send_wal();
>>>>> +                       last_report = now;
>>>>> +               }
>>>>> +
>>>>>
>>>>> Although I worried that it's better to add the check code in
>>>>> pgstat_send_wal(),
>>>
>>> Agreed.
>>>
>>>>> I didn't do so because to avoid to double check
>>>>> PGSTAT_STAT_INTERVAL.
>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks
>>>>> the
>>>>> PGSTAT_STAT_INTERVAL.
>>>
>>> I think that we can do that. What about the attached patch?
>>
>> Thanks, I thought it's better.
>>
>>
>>>> I forgot to remove an unused variable.
>>>> The attached v13 patch is fixed.
>>>
>>> Thanks for updating the patch!
>>>
>>> +        w.wal_write,
>>> +        w.wal_write_time,
>>> +        w.wal_sync,
>>> +        w.wal_sync_time,
>>>
>>> It's more natural to put wal_write_time and wal_sync_time next to
>>> each other? That is, what about the following order of columns?
>>>
>>> wal_write
>>> wal_sync
>>> wal_write_time
>>> wal_sync_time
>>
>> Yes, I fixed it.
>>
>>> -        case SYNC_METHOD_OPEN:
>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>> -            /* write synced it already */
>>> -            break;
>>>
>>> IMO it's better to add Assert(false) here to ensure that we never
>>> reach
>>> here, as follows. Thought?
>>>
>>> +        case SYNC_METHOD_OPEN:
>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>> +            /* not reachable */
>>> +            Assert(false);
>>
>> I agree.
>>
>>
>>> Even when a backend exits, it sends the stats via
>>> pgstat_beshutdown_hook().
>>> On the other hand, walwriter doesn't do that. Walwriter also should
>>> send
>>> the stats even at its exit? Otherwise some stats can fail to be
>>> collected.
>>> But ISTM that this issue existed from before, for example
>>> checkpointer
>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to
>>> fix
>>> this issue in this patch?
>>
>> Thanks, I thought it's better to do so.
>> I added the shutdown hook for the walwriter and the checkpointer in
>> v14-0003 patch.
>
> Thanks for 0003 patch!
>
> Isn't it overkill to send the stats in the walwriter-exit-callback? IMO
> we can
> just send the stats only when ShutdownRequestPending is true in the
> walwriter
> main loop (maybe just before calling HandleMainLoopInterrupts()).
> If we do this, we cannot send the stats when walwriter throws FATAL
> error.
> But that's ok because FATAL error on walwriter causes the server to
> crash.
> Thought?

Thanks for your comments!
Yes, I agree.

> Also ISTM that we don't need to use the callback for that purpose in
> checkpointer because of the same reason. That is, we can send the stats
> just after calling ShutdownXLOG(0, 0) in
> HandleCheckpointerInterrupts().
> Thought?

Yes, I think so too.

Since ShutdownXLOG() may create restartpoint or checkpoint,
it might generate WAL records.

> I'm now not sure how much useful these changes are. As far as I read
> pgstat.c,
> when shutdown is requested, the stats collector seems to exit even when
> there are outstanding stats messages. So if checkpointer and walwriter
> send
> the stats in their last cycles, those stats might not be collected.
>
> On the other hand, I can think that sending the stats in the last
> cycles would
> improve the situation a bit than now. So I'm inclined to apply those
> changes...

I didn't notice that. I agree this is an important aspect.
I understood there is a case that the stats collector exits before the
checkpointer
or the walwriter exits and some stats might not be collected.

> Of course, there is another direction; we can improve the stats
> collector so
> that it guarantees to collect all the sent stats messages. But I'm
> afraid
> this change might be big.

For example, implement to manage background process status in shared
memory and
the stats collector collects the stats until another background process
exits?

In my understanding, the statistics are not required high accuracy,
it's ok to ignore them if the impact is not big.

If we guarantee high accuracy, another background process like
autovacuum launcher
must send the WAL stats because it accesses the system catalog and might
generate
WAL records due to HOT update even though the possibility is low.

I thought the impact is small because the time uncollected stats are
generated is
short compared to the time from startup. So, it's ok to ignore the
remaining stats
when the process exists.

BTW, I found BgWriterStats.m_timed_checkpoints is not counted in
ShutdownLOG()
and we need to count it if to collect stats before it exits.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-10 08:08:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/10 14:11, Masahiro Ikeda wrote:
> On 2021-03-09 17:51, Fujii Masao wrote:
>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>
>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>
>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>
>>>>>>>>>>> is ether -> is either
>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>> operating system..."
>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>
>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>
>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>
>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>
>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>
>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>
>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>
>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>
>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>
>>>>>>>> OK, I fixed it.
>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>
>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>
>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>> +    if (enableFsync &&
>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>> +    {
>>>>>>>
>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>> to simplify the code more?
>>>>>>
>>>>>> Thanks for the comments.
>>>>>> I added the above code in v12 patch.
>>>>>>
>>>>>>>
>>>>>>> +        /*
>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>> +         * the overhead in WAL-writing.
>>>>>>> +         */
>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>> +            pgstat_send_wal();
>>>>>>>
>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>
>>>>>> Thanks, I didn't notice that.
>>>>>>
>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>> default value is 200msec and it may be set shorter time.
>>>>
>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>
>>>>>>
>>>>>> Why don't to make another way to check the timestamp?
>>>>>>
>>>>>> +               /*
>>>>>> +                * Don't send a message unless it's been at least
>>>>>> PGSTAT_STAT_INTERVAL
>>>>>> +                * msec since we last sent one
>>>>>> +                */
>>>>>> +               now = GetCurrentTimestamp();
>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>> +               {
>>>>>> +                       pgstat_send_wal();
>>>>>> +                       last_report = now;
>>>>>> +               }
>>>>>> +
>>>>>>
>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>
>>>> Agreed.
>>>>
>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>> PGSTAT_STAT_INTERVAL.
>>>>
>>>> I think that we can do that. What about the attached patch?
>>>
>>> Thanks, I thought it's better.
>>>
>>>
>>>>> I forgot to remove an unused variable.
>>>>> The attached v13 patch is fixed.
>>>>
>>>> Thanks for updating the patch!
>>>>
>>>> +        w.wal_write,
>>>> +        w.wal_write_time,
>>>> +        w.wal_sync,
>>>> +        w.wal_sync_time,
>>>>
>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>> each other? That is, what about the following order of columns?
>>>>
>>>> wal_write
>>>> wal_sync
>>>> wal_write_time
>>>> wal_sync_time
>>>
>>> Yes, I fixed it.
>>>
>>>> -        case SYNC_METHOD_OPEN:
>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>> -            /* write synced it already */
>>>> -            break;
>>>>
>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>> here, as follows. Thought?
>>>>
>>>> +        case SYNC_METHOD_OPEN:
>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>> +            /* not reachable */
>>>> +            Assert(false);
>>>
>>> I agree.
>>>
>>>
>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>> But ISTM that this issue existed from before, for example checkpointer
>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>> this issue in this patch?
>>>
>>> Thanks, I thought it's better to do so.
>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>
>> Thanks for 0003 patch!
>>
>> Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
>> just send the stats only when ShutdownRequestPending is true in the walwriter
>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>> If we do this, we cannot send the stats when walwriter throws FATAL error.
>> But that's ok because FATAL error on walwriter causes the server to crash.
>> Thought?
>
> Thanks for your comments!
> Yes, I agree.
>
>
>> Also ISTM that we don't need to use the callback for that purpose in
>> checkpointer because of the same reason. That is, we can send the stats
>> just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
>> Thought?
>
> Yes, I think so too.
>
> Since ShutdownXLOG() may create restartpoint or checkpoint,
> it might generate WAL records.
>
>
>> I'm now not sure how much useful these changes are. As far as I read pgstat.c,
>> when shutdown is requested, the stats collector seems to exit even when
>> there are outstanding stats messages. So if checkpointer and walwriter send
>> the stats in their last cycles, those stats might not be collected.
>>
>> On the other hand, I can think that sending the stats in the last cycles would
>> improve the situation a bit than now. So I'm inclined to apply those changes...
>
> I didn't notice that. I agree this is an important aspect.
> I understood there is a case that the stats collector exits before the checkpointer
> or the walwriter exits and some stats might not be collected.

IIUC the stats collector basically exits after checkpointer and walwriter exit.
But there seems no guarantee that the stats collector processes
all the messages that other processes have sent during the shutdown of
the server.

>
>
>> Of course, there is another direction; we can improve the stats collector so
>> that it guarantees to collect all the sent stats messages. But I'm afraid
>> this change might be big.
>
> For example, implement to manage background process status in shared memory and
> the stats collector collects the stats until another background process exits?
>
> In my understanding, the statistics are not required high accuracy,
> it's ok to ignore them if the impact is not big.
>
> If we guarantee high accuracy, another background process like autovacuum launcher
> must send the WAL stats because it accesses the system catalog and might generate
> WAL records due to HOT update even though the possibility is low.
>
> I thought the impact is small because the time uncollected stats are generated is
> short compared to the time from startup. So, it's ok to ignore the remaining stats
> when the process exists.

I agree that it's not worth changing lots of code to collect such stats.
But if we can implement that very simply, isn't it more worth doing
that than current situation because we may be able to collect more
accurate stats.

> BTW, I found BgWriterStats.m_timed_checkpoints is not counted in ShutdownLOG()
> and we need to count it if to collect stats before it exits.

Maybe m_requested_checkpoints should be incremented in that case?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-11 00:38:43
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-10 17:08, Fujii Masao wrote:
> On 2021/03/10 14:11, Masahiro Ikeda wrote:
>> On 2021-03-09 17:51, Fujii Masao wrote:
>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>
>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during
>>>>>>>>>>>> an
>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is
>>>>>>>>>>>> also
>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>
>>>>>>>>>>>> ("which normally called" should be "which is normally
>>>>>>>>>>>> called" or
>>>>>>>>>>>> "which normally is called" if you want to keep true to the
>>>>>>>>>>>> original)
>>>>>>>>>>>> You missed the adding the space before an opening
>>>>>>>>>>>> parenthesis here and
>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>
>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>> "This parameter is off by default as it will repeatedly
>>>>>>>>>>>> query the
>>>>>>>>>>>> operating system..."
>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>
>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This
>>>>>>>>>>>> is also
>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>
>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL
>>>>>>>>>>> receiver
>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>
>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this
>>>>>>>>>>>> event is
>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable
>>>>>>>>>>>> because ..."
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>
>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>>>>>>>> explicitly
>>>>>>>>>>>> computing the sync statistics but does require computing the
>>>>>>>>>>>> write
>>>>>>>>>>>> statistics. This is because of the presence of
>>>>>>>>>>>> issue_xlog_fsync but
>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I
>>>>>>>>>>>> observe that
>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while
>>>>>>>>>>>> the WAL
>>>>>>>>>>>> receiver path does not. It seems technically
>>>>>>>>>>>> straight-forward to
>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the
>>>>>>>>>>>> two places,
>>>>>>>>>>>> though I suspect there may be a trade-off for not adding
>>>>>>>>>>>> another
>>>>>>>>>>>> function call to the stack given the importance of WAL
>>>>>>>>>>>> processing
>>>>>>>>>>>> (though that seems marginalized compared to the cost of
>>>>>>>>>>>> actually
>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and
>>>>>>>>>>>> don't have
>>>>>>>>>>>> any shared code between the two but instead implement the
>>>>>>>>>>>> WAL receiver
>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case,
>>>>>>>>>>>> this
>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>
>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver
>>>>>>>>>>> stats.
>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>
>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I added the infrastructure code to communicate the WAL
>>>>>>>>>>> receiver stats messages between the WAL receiver and the
>>>>>>>>>>> stats collector, and
>>>>>>>>>>> the stats for WAL receiver is counted in
>>>>>>>>>>> pg_stat_wal_receiver.
>>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> On second thought, this idea seems not good. Because those
>>>>>>>>>> stats are
>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver
>>>>>>>>>> process running
>>>>>>>>>> at that moment. IOW, it seems strange that some values show
>>>>>>>>>> dynamic
>>>>>>>>>> stats and the others show collected stats, even though they
>>>>>>>>>> are in
>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>
>>>>>>>>> OK, I fixed it.
>>>>>>>>> The stats collected in the WAL receiver is exposed in
>>>>>>>>> pg_stat_wal view in v11 patch.
>>>>>>>>
>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>
>>>>>>>> +    /* Check whether the WAL file was synced to disk right now
>>>>>>>> */
>>>>>>>> +    if (enableFsync &&
>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>> +    {
>>>>>>>>
>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>> if enableFsync is off, sync_method is open_sync or
>>>>>>>> open_data_sync,
>>>>>>>> to simplify the code more?
>>>>>>>
>>>>>>> Thanks for the comments.
>>>>>>> I added the above code in v12 patch.
>>>>>>>
>>>>>>>>
>>>>>>>> +        /*
>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has
>>>>>>>> elapsed to minimize
>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>> +         */
>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>> +            pgstat_send_wal();
>>>>>>>>
>>>>>>>> On second thought, this change means that it always takes
>>>>>>>> wal_writer_delay
>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush()
>>>>>>>> is called.
>>>>>>>> For example, if wal_writer_delay is set to several seconds, some
>>>>>>>> values in
>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those
>>>>>>>> seconds.
>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to
>>>>>>>> send
>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>
>>>>>>> Thanks, I didn't notice that.
>>>>>>>
>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>
>>>>> Yeah, if wal_writer_delay is set to very small value, there is a
>>>>> risk
>>>>> that the WAL stats are sent too frequently. I agree that's a
>>>>> problem.
>>>>>
>>>>>>>
>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>
>>>>>>> +               /*
>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>> +                * msec since we last sent one
>>>>>>> +                */
>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>> +               {
>>>>>>> +                       pgstat_send_wal();
>>>>>>> +                       last_report = now;
>>>>>>> +               }
>>>>>>> +
>>>>>>>
>>>>>>> Although I worried that it's better to add the check code in
>>>>>>> pgstat_send_wal(),
>>>>>
>>>>> Agreed.
>>>>>
>>>>>>> I didn't do so because to avoid to double check
>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already
>>>>>>> checks the
>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>
>>>>> I think that we can do that. What about the attached patch?
>>>>
>>>> Thanks, I thought it's better.
>>>>
>>>>
>>>>>> I forgot to remove an unused variable.
>>>>>> The attached v13 patch is fixed.
>>>>>
>>>>> Thanks for updating the patch!
>>>>>
>>>>> +        w.wal_write,
>>>>> +        w.wal_write_time,
>>>>> +        w.wal_sync,
>>>>> +        w.wal_sync_time,
>>>>>
>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>> each other? That is, what about the following order of columns?
>>>>>
>>>>> wal_write
>>>>> wal_sync
>>>>> wal_write_time
>>>>> wal_sync_time
>>>>
>>>> Yes, I fixed it.
>>>>
>>>>> -        case SYNC_METHOD_OPEN:
>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>> -            /* write synced it already */
>>>>> -            break;
>>>>>
>>>>> IMO it's better to add Assert(false) here to ensure that we never
>>>>> reach
>>>>> here, as follows. Thought?
>>>>>
>>>>> +        case SYNC_METHOD_OPEN:
>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>> +            /* not reachable */
>>>>> +            Assert(false);
>>>>
>>>> I agree.
>>>>
>>>>
>>>>> Even when a backend exits, it sends the stats via
>>>>> pgstat_beshutdown_hook().
>>>>> On the other hand, walwriter doesn't do that. Walwriter also should
>>>>> send
>>>>> the stats even at its exit? Otherwise some stats can fail to be
>>>>> collected.
>>>>> But ISTM that this issue existed from before, for example
>>>>> checkpointer
>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill
>>>>> to fix
>>>>> this issue in this patch?
>>>>
>>>> Thanks, I thought it's better to do so.
>>>> I added the shutdown hook for the walwriter and the checkpointer in
>>>> v14-0003 patch.
>>>
>>> Thanks for 0003 patch!
>>>
>>> Isn't it overkill to send the stats in the walwriter-exit-callback?
>>> IMO we can
>>> just send the stats only when ShutdownRequestPending is true in the
>>> walwriter
>>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>>> If we do this, we cannot send the stats when walwriter throws FATAL
>>> error.
>>> But that's ok because FATAL error on walwriter causes the server to
>>> crash.
>>> Thought?
>>
>> Thanks for your comments!
>> Yes, I agree.
>>
>>
>>> Also ISTM that we don't need to use the callback for that purpose in
>>> checkpointer because of the same reason. That is, we can send the
>>> stats
>>> just after calling ShutdownXLOG(0, 0) in
>>> HandleCheckpointerInterrupts().
>>> Thought?
>>
>> Yes, I think so too.
>>
>> Since ShutdownXLOG() may create restartpoint or checkpoint,
>> it might generate WAL records.
>>
>>
>>> I'm now not sure how much useful these changes are. As far as I read
>>> pgstat.c,
>>> when shutdown is requested, the stats collector seems to exit even
>>> when
>>> there are outstanding stats messages. So if checkpointer and
>>> walwriter send
>>> the stats in their last cycles, those stats might not be collected.
>>>
>>> On the other hand, I can think that sending the stats in the last
>>> cycles would
>>> improve the situation a bit than now. So I'm inclined to apply those
>>> changes...
>>
>> I didn't notice that. I agree this is an important aspect.
>> I understood there is a case that the stats collector exits before the
>> checkpointer
>> or the walwriter exits and some stats might not be collected.
>
> IIUC the stats collector basically exits after checkpointer and
> walwriter exit.
> But there seems no guarantee that the stats collector processes
> all the messages that other processes have sent during the shutdown of
> the server.

Thanks, I understood the above postmaster behaviors.

PMState manages the status and after checkpointer is exited, the
postmaster sends
SIGQUIT signal to the stats collector if the shutdown mode is smart or
fast.
(IIUC, although the postmaster kill the walsender, the archiver and
the stats collector at the same time, it's ok because the walsender
and the archiver doesn't send stats to the stats collector now.)

But, there might be a corner case to lose stats sent by background
workers like
the checkpointer before they exit (although this is not implemented
yet.)

For example,

1. checkpointer send the stats before it exit
2. stats collector receive the signal and break before processing
the stats message from checkpointer. In this case, 1's message is
lost.
3. stats collector writes the stats in the statsfiles and exit

Why don't you recheck the coming message is zero just before the 2th
procedure?
(v17-0004-guarantee-to-collect-last-stats-messages.patch)

I measured the timing of the above in my linux laptop using
v17-measure-timing.patch.
I don't have any strong opinion to handle this case since this result
shows to receive and processes
the messages takes too short time (less than 1ms) although the stats
collector receives the shutdown
signal in 5msec(099->104) after the checkpointer process exits.

```
1615421204.556 [checkpointer] DEBUG: received shutdown request signal
1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make
# exit and send the messages
1615421208.099 [stats collector] DEBUG: process BGWRITER stats message
# receive and process the messages
1615421208.099 [stats collector] DEBUG: process WAL stats message
1615421208.104 [postmaster] DEBUG: reaping dead processes
1615421208.104 [stats collector] DEBUG: received shutdown request
signal # receive shutdown request from the postmaster
```

>>> Of course, there is another direction; we can improve the stats
>>> collector so
>>> that it guarantees to collect all the sent stats messages. But I'm
>>> afraid
>>> this change might be big.
>>
>> For example, implement to manage background process status in shared
>> memory and
>> the stats collector collects the stats until another background
>> process exits?
>>
>> In my understanding, the statistics are not required high accuracy,
>> it's ok to ignore them if the impact is not big.
>>
>> If we guarantee high accuracy, another background process like
>> autovacuum launcher
>> must send the WAL stats because it accesses the system catalog and
>> might generate
>> WAL records due to HOT update even though the possibility is low.
>>
>> I thought the impact is small because the time uncollected stats are
>> generated is
>> short compared to the time from startup. So, it's ok to ignore the
>> remaining stats
>> when the process exists.
>
> I agree that it's not worth changing lots of code to collect such
> stats.
> But if we can implement that very simply, isn't it more worth doing
> that than current situation because we may be able to collect more
> accurate stats.

Yes, I agree.
I attached the patch to send the stats before the wal writer and the
checkpointer exit.
(v17-0001-send-stats-for-walwriter-when-shutdown.patch,
v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

>> BTW, I found BgWriterStats.m_timed_checkpoints is not counted in
>> ShutdownLOG()
>> and we need to count it if to collect stats before it exits.
>
> Maybe m_requested_checkpoints should be incremented in that case?

I thought this should be incremented
because it invokes the methods with CHECKPOINT_IS_SHUTDOWN.

```ShutdownXLOG()
CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
```

I fixed in v17-0002-send-stats-for-checkpointer-when-shutdown.patch.

In addition, I rebased the patch for WAL receiver.
(v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v17-0001-send-stats-for-walwriter-when-shutdown.patch	text/x-diff	1.4 KB
v17-0002-send-stats-for-checkpointer-when-shutdown.patch	text/x-diff	2.5 KB
v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch	text/x-diff	4.2 KB
v17-0004-guarantee-to-collect-last-stats-messages.patch	text/x-diff	9.6 KB
v17-measure-timing.patch	text/x-diff	18.3 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-11 02:52:07
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/11 9:38, Masahiro Ikeda wrote:
> On 2021-03-10 17:08, Fujii Masao wrote:
>> On 2021/03/10 14:11, Masahiro Ikeda wrote:
>>> On 2021-03-09 17:51, Fujii Masao wrote:
>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>
>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>
>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>
>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>
>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>
>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>
>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>
>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>
>>>>>>>>>> OK, I fixed it.
>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>
>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>
>>>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>>>> +    if (enableFsync &&
>>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>> +    {
>>>>>>>>>
>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>> to simplify the code more?
>>>>>>>>
>>>>>>>> Thanks for the comments.
>>>>>>>> I added the above code in v12 patch.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> +        /*
>>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>>> +         */
>>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>>> +            pgstat_send_wal();
>>>>>>>>>
>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>
>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>
>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>
>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>
>>>>>>>>
>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>
>>>>>>>> +               /*
>>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>> +                * msec since we last sent one
>>>>>>>> +                */
>>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>> +               {
>>>>>>>> +                       pgstat_send_wal();
>>>>>>>> +                       last_report = now;
>>>>>>>> +               }
>>>>>>>> +
>>>>>>>>
>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>
>>>>>> Agreed.
>>>>>>
>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>
>>>>>> I think that we can do that. What about the attached patch?
>>>>>
>>>>> Thanks, I thought it's better.
>>>>>
>>>>>
>>>>>>> I forgot to remove an unused variable.
>>>>>>> The attached v13 patch is fixed.
>>>>>>
>>>>>> Thanks for updating the patch!
>>>>>>
>>>>>> +        w.wal_write,
>>>>>> +        w.wal_write_time,
>>>>>> +        w.wal_sync,
>>>>>> +        w.wal_sync_time,
>>>>>>
>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>> each other? That is, what about the following order of columns?
>>>>>>
>>>>>> wal_write
>>>>>> wal_sync
>>>>>> wal_write_time
>>>>>> wal_sync_time
>>>>>
>>>>> Yes, I fixed it.
>>>>>
>>>>>> -        case SYNC_METHOD_OPEN:
>>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>>> -            /* write synced it already */
>>>>>> -            break;
>>>>>>
>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>> here, as follows. Thought?
>>>>>>
>>>>>> +        case SYNC_METHOD_OPEN:
>>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>>> +            /* not reachable */
>>>>>> +            Assert(false);
>>>>>
>>>>> I agree.
>>>>>
>>>>>
>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>> this issue in this patch?
>>>>>
>>>>> Thanks, I thought it's better to do so.
>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>
>>>> Thanks for 0003 patch!
>>>>
>>>> Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
>>>> just send the stats only when ShutdownRequestPending is true in the walwriter
>>>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>>>> If we do this, we cannot send the stats when walwriter throws FATAL error.
>>>> But that's ok because FATAL error on walwriter causes the server to crash.
>>>> Thought?
>>>
>>> Thanks for your comments!
>>> Yes, I agree.
>>>
>>>
>>>> Also ISTM that we don't need to use the callback for that purpose in
>>>> checkpointer because of the same reason. That is, we can send the stats
>>>> just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
>>>> Thought?
>>>
>>> Yes, I think so too.
>>>
>>> Since ShutdownXLOG() may create restartpoint or checkpoint,
>>> it might generate WAL records.
>>>
>>>
>>>> I'm now not sure how much useful these changes are. As far as I read pgstat.c,
>>>> when shutdown is requested, the stats collector seems to exit even when
>>>> there are outstanding stats messages. So if checkpointer and walwriter send
>>>> the stats in their last cycles, those stats might not be collected.
>>>>
>>>> On the other hand, I can think that sending the stats in the last cycles would
>>>> improve the situation a bit than now. So I'm inclined to apply those changes...
>>>
>>> I didn't notice that. I agree this is an important aspect.
>>> I understood there is a case that the stats collector exits before the checkpointer
>>> or the walwriter exits and some stats might not be collected.
>>
>> IIUC the stats collector basically exits after checkpointer and walwriter exit.
>> But there seems no guarantee that the stats collector processes
>> all the messages that other processes have sent during the shutdown of
>> the server.
>
> Thanks, I understood the above postmaster behaviors.
>
> PMState manages the status and after checkpointer is exited, the postmaster sends
> SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
> (IIUC, although the postmaster kill the walsender, the archiver and
> the stats collector at the same time, it's ok because the walsender
> and the archiver doesn't send stats to the stats collector now.)
>
> But, there might be a corner case to lose stats sent by background workers like
> the checkpointer before they exit (although this is not implemented yet.)
>
> For example,
>
> 1. checkpointer send the stats before it exit
> 2. stats collector receive the signal and break before processing
>    the stats message from checkpointer. In this case, 1's message is lost.
> 3. stats collector writes the stats in the statsfiles and exit
>
> Why don't you recheck the coming message is zero just before the 2th procedure?
> (v17-0004-guarantee-to-collect-last-stats-messages.patch)

Yes, I was thinking the same. This is the straight-forward fix for this issue.
The stats collector should process all the outstanding messages when
normal shutdown is requested, as the patch does. On the other hand,
if immediate shutdown is requested or emergency bailout (by postmaster death)
is requested, maybe the stats collector should skip those processings
and exit immediately.

But if we implement that, we would need to teach the stats collector
the shutdown type (i.e., normal shutdown or immediate one). Because
currently SIGQUIT is sent to the collector whichever shutdown is requested,
and so the collector cannot distinguish the shutdown type. I'm afraid that
change is a bit overkill for now.

BTW, I found that the collector calls pgstat_write_statsfiles() even at
emergency bailout case, before exiting. It's not necessary to save
the stats to the file in that case because subsequent server startup does
crash recovery and clears that stats file. So it's better to make
the collector exit immediately without calling pgstat_write_statsfiles()
at emergency bailout case? Probably this should be discussed in other
thread because it's different topic from the feature we're discussing here,
though.

>
>
> I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
> I don't have any strong opinion to handle this case since this result shows to receive and processes
> the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
> signal in 5msec(099->104) after the checkpointer process exits.

Agreed.

>
> ```
> 1615421204.556 [checkpointer] DEBUG: received shutdown request signal
> 1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make              # exit and send the messages
> 1615421208.099 [stats collector] DEBUG: process BGWRITER stats message              # receive and process the messages
> 1615421208.099 [stats collector] DEBUG: process WAL stats message
> 1615421208.104 [postmaster] DEBUG: reaping dead processes
> 1615421208.104 [stats collector] DEBUG: received shutdown request signal             # receive shutdown request from the postmaster
> ```
>
>>>> Of course, there is another direction; we can improve the stats collector so
>>>> that it guarantees to collect all the sent stats messages. But I'm afraid
>>>> this change might be big.
>>>
>>> For example, implement to manage background process status in shared memory and
>>> the stats collector collects the stats until another background process exits?
>>>
>>> In my understanding, the statistics are not required high accuracy,
>>> it's ok to ignore them if the impact is not big.
>>>
>>> If we guarantee high accuracy, another background process like autovacuum launcher
>>> must send the WAL stats because it accesses the system catalog and might generate
>>> WAL records due to HOT update even though the possibility is low.
>>>
>>> I thought the impact is small because the time uncollected stats are generated is
>>> short compared to the time from startup. So, it's ok to ignore the remaining stats
>>> when the process exists.
>>
>> I agree that it's not worth changing lots of code to collect such stats.
>> But if we can implement that very simply, isn't it more worth doing
>> that than current situation because we may be able to collect more
>> accurate stats.
>
> Yes, I agree.
> I attached the patch to send the stats before the wal writer and the checkpointer exit.
> (v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)

Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.

Here is the review comments for 0001 patch.

+/* Prototypes for private functions */
+static void HandleWalWriterInterrupts(void);

HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
if ShutdownRequestPending is true in the main loop. Attached is the patch
I implemented that way. Thought?

Here is the review comments for 0002 patch.

+static void pgstat_send_checkpointer(void);

I'm inclined to avoid adding the function with the prefix "pgstat_" outside
pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.

>
>
>>> BTW, I found BgWriterStats.m_timed_checkpoints is not counted in ShutdownLOG()
>>> and we need to count it if to collect stats before it exits.
>>
>> Maybe m_requested_checkpoints should be incremented in that case?
>
> I thought this should be incremented
> because it invokes the methods with CHECKPOINT_IS_SHUTDOWN.

Yes.

>
> ```ShutdownXLOG()
> CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
> CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
> ```
>
> I fixed in v17-0002-send-stats-for-checkpointer-when-shutdown.patch.
>
>
> In addition, I rebased the patch for WAL receiver.
> (v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks! Will review this later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment	Content-Type	Size
v17-0001-send-stats-for-walwriter-when-shutdown_fujii.patch	text/plain	713 bytes
v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch	text/plain	818 bytes

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-11 12:29:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-11 11:52, Fujii Masao wrote:
> On 2021/03/11 9:38, Masahiro Ikeda wrote:
>> On 2021-03-10 17:08, Fujii Masao wrote:
>>> On 2021/03/10 14:11, Masahiro Ikeda wrote:
>>>> On 2021-03-09 17:51, Fujii Masao wrote:
>>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked
>>>>>>>>>>>>>> during an
>>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ("which normally called" should be "which is normally
>>>>>>>>>>>>>> called" or
>>>>>>>>>>>>>> "which normally is called" if you want to keep true to the
>>>>>>>>>>>>>> original)
>>>>>>>>>>>>>> You missed the adding the space before an opening
>>>>>>>>>>>>>> parenthesis here and
>>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly
>>>>>>>>>>>>>> query the
>>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> wal_write_time and the sync items also need the note:
>>>>>>>>>>>>>> "This is also
>>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>>
>>>>>>>>>>>>> I skipped changing it since I separated the stats for the
>>>>>>>>>>>>> WAL receiver
>>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of
>>>>>>>>>>>>>> this event is
>>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable
>>>>>>>>>>>>>> because ..."
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require
>>>>>>>>>>>>>> explicitly
>>>>>>>>>>>>>> computing the sync statistics but does require computing
>>>>>>>>>>>>>> the write
>>>>>>>>>>>>>> statistics. This is because of the presence of
>>>>>>>>>>>>>> issue_xlog_fsync but
>>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I
>>>>>>>>>>>>>> observe that
>>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while
>>>>>>>>>>>>>> the WAL
>>>>>>>>>>>>>> receiver path does not. It seems technically
>>>>>>>>>>>>>> straight-forward to
>>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the
>>>>>>>>>>>>>> two places,
>>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding
>>>>>>>>>>>>>> another
>>>>>>>>>>>>>> function call to the stack given the importance of WAL
>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>> (though that seems marginalized compared to the cost of
>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way
>>>>>>>>>>>>>> and don't have
>>>>>>>>>>>>>> any shared code between the two but instead implement the
>>>>>>>>>>>>>> WAL receiver
>>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case,
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver
>>>>>>>>>>>>> stats.
>>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I added the infrastructure code to communicate the WAL
>>>>>>>>>>>>> receiver stats messages between the WAL receiver and the
>>>>>>>>>>>>> stats collector, and
>>>>>>>>>>>>> the stats for WAL receiver is counted in
>>>>>>>>>>>>> pg_stat_wal_receiver.
>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> On second thought, this idea seems not good. Because those
>>>>>>>>>>>> stats are
>>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver
>>>>>>>>>>>> process running
>>>>>>>>>>>> at that moment. IOW, it seems strange that some values show
>>>>>>>>>>>> dynamic
>>>>>>>>>>>> stats and the others show collected stats, even though they
>>>>>>>>>>>> are in
>>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>>
>>>>>>>>>>> OK, I fixed it.
>>>>>>>>>>> The stats collected in the WAL receiver is exposed in
>>>>>>>>>>> pg_stat_wal view in v11 patch.
>>>>>>>>>>
>>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>>
>>>>>>>>>> +    /* Check whether the WAL file was synced to disk right
>>>>>>>>>> now */
>>>>>>>>>> +    if (enableFsync &&
>>>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>>> +    {
>>>>>>>>>>
>>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>>> if enableFsync is off, sync_method is open_sync or
>>>>>>>>>> open_data_sync,
>>>>>>>>>> to simplify the code more?
>>>>>>>>>
>>>>>>>>> Thanks for the comments.
>>>>>>>>> I added the above code in v12 patch.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +        /*
>>>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has
>>>>>>>>>> elapsed to minimize
>>>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>>>> +         */
>>>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>>>> +            pgstat_send_wal();
>>>>>>>>>>
>>>>>>>>>> On second thought, this change means that it always takes
>>>>>>>>>> wal_writer_delay
>>>>>>>>>> before walwriter's WAL stats is sent after
>>>>>>>>>> XLogBackgroundFlush() is called.
>>>>>>>>>> For example, if wal_writer_delay is set to several seconds,
>>>>>>>>>> some values in
>>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those
>>>>>>>>>> seconds.
>>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to
>>>>>>>>>> send
>>>>>>>>>> the stats every after XLogBackgroundFlush() is called.
>>>>>>>>>> Thought?
>>>>>>>>>
>>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>>
>>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>>
>>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a
>>>>>>> risk
>>>>>>> that the WAL stats are sent too frequently. I agree that's a
>>>>>>> problem.
>>>>>>>
>>>>>>>>>
>>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>>
>>>>>>>>> +               /*
>>>>>>>>> +                * Don't send a message unless it's been at
>>>>>>>>> least
>>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>>> +                * msec since we last sent one
>>>>>>>>> +                */
>>>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>>>> +               if (TimestampDifferenceExceeds(last_report,
>>>>>>>>> now,
>>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>>> +               {
>>>>>>>>> +                       pgstat_send_wal();
>>>>>>>>> +                       last_report = now;
>>>>>>>>> +               }
>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Although I worried that it's better to add the check code in
>>>>>>>>> pgstat_send_wal(),
>>>>>>>
>>>>>>> Agreed.
>>>>>>>
>>>>>>>>> I didn't do so because to avoid to double check
>>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already
>>>>>>>>> checks the
>>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>>
>>>>>>> I think that we can do that. What about the attached patch?
>>>>>>
>>>>>> Thanks, I thought it's better.
>>>>>>
>>>>>>
>>>>>>>> I forgot to remove an unused variable.
>>>>>>>> The attached v13 patch is fixed.
>>>>>>>
>>>>>>> Thanks for updating the patch!
>>>>>>>
>>>>>>> +        w.wal_write,
>>>>>>> +        w.wal_write_time,
>>>>>>> +        w.wal_sync,
>>>>>>> +        w.wal_sync_time,
>>>>>>>
>>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>>> each other? That is, what about the following order of columns?
>>>>>>>
>>>>>>> wal_write
>>>>>>> wal_sync
>>>>>>> wal_write_time
>>>>>>> wal_sync_time
>>>>>>
>>>>>> Yes, I fixed it.
>>>>>>
>>>>>>> -        case SYNC_METHOD_OPEN:
>>>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>> -            /* write synced it already */
>>>>>>> -            break;
>>>>>>>
>>>>>>> IMO it's better to add Assert(false) here to ensure that we never
>>>>>>> reach
>>>>>>> here, as follows. Thought?
>>>>>>>
>>>>>>> +        case SYNC_METHOD_OPEN:
>>>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>> +            /* not reachable */
>>>>>>> +            Assert(false);
>>>>>>
>>>>>> I agree.
>>>>>>
>>>>>>
>>>>>>> Even when a backend exits, it sends the stats via
>>>>>>> pgstat_beshutdown_hook().
>>>>>>> On the other hand, walwriter doesn't do that. Walwriter also
>>>>>>> should send
>>>>>>> the stats even at its exit? Otherwise some stats can fail to be
>>>>>>> collected.
>>>>>>> But ISTM that this issue existed from before, for example
>>>>>>> checkpointer
>>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill
>>>>>>> to fix
>>>>>>> this issue in this patch?
>>>>>>
>>>>>> Thanks, I thought it's better to do so.
>>>>>> I added the shutdown hook for the walwriter and the checkpointer
>>>>>> in v14-0003 patch.
>>>>>
>>>>> Thanks for 0003 patch!
>>>>>
>>>>> Isn't it overkill to send the stats in the walwriter-exit-callback?
>>>>> IMO we can
>>>>> just send the stats only when ShutdownRequestPending is true in the
>>>>> walwriter
>>>>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>>>>> If we do this, we cannot send the stats when walwriter throws FATAL
>>>>> error.
>>>>> But that's ok because FATAL error on walwriter causes the server to
>>>>> crash.
>>>>> Thought?
>>>>
>>>> Thanks for your comments!
>>>> Yes, I agree.
>>>>
>>>>
>>>>> Also ISTM that we don't need to use the callback for that purpose
>>>>> in
>>>>> checkpointer because of the same reason. That is, we can send the
>>>>> stats
>>>>> just after calling ShutdownXLOG(0, 0) in
>>>>> HandleCheckpointerInterrupts().
>>>>> Thought?
>>>>
>>>> Yes, I think so too.
>>>>
>>>> Since ShutdownXLOG() may create restartpoint or checkpoint,
>>>> it might generate WAL records.
>>>>
>>>>
>>>>> I'm now not sure how much useful these changes are. As far as I
>>>>> read pgstat.c,
>>>>> when shutdown is requested, the stats collector seems to exit even
>>>>> when
>>>>> there are outstanding stats messages. So if checkpointer and
>>>>> walwriter send
>>>>> the stats in their last cycles, those stats might not be collected.
>>>>>
>>>>> On the other hand, I can think that sending the stats in the last
>>>>> cycles would
>>>>> improve the situation a bit than now. So I'm inclined to apply
>>>>> those changes...
>>>>
>>>> I didn't notice that. I agree this is an important aspect.
>>>> I understood there is a case that the stats collector exits before
>>>> the checkpointer
>>>> or the walwriter exits and some stats might not be collected.
>>>
>>> IIUC the stats collector basically exits after checkpointer and
>>> walwriter exit.
>>> But there seems no guarantee that the stats collector processes
>>> all the messages that other processes have sent during the shutdown
>>> of
>>> the server.
>>
>> Thanks, I understood the above postmaster behaviors.
>>
>> PMState manages the status and after checkpointer is exited, the
>> postmaster sends
>> SIGQUIT signal to the stats collector if the shutdown mode is smart or
>> fast.
>> (IIUC, although the postmaster kill the walsender, the archiver and
>> the stats collector at the same time, it's ok because the walsender
>> and the archiver doesn't send stats to the stats collector now.)
>>
>> But, there might be a corner case to lose stats sent by background
>> workers like
>> the checkpointer before they exit (although this is not implemented
>> yet.)
>>
>> For example,
>>
>> 1. checkpointer send the stats before it exit
>> 2. stats collector receive the signal and break before processing
>>    the stats message from checkpointer. In this case, 1's message is
>> lost.
>> 3. stats collector writes the stats in the statsfiles and exit
>>
>> Why don't you recheck the coming message is zero just before the 2th
>> procedure?
>> (v17-0004-guarantee-to-collect-last-stats-messages.patch)
>
> Yes, I was thinking the same. This is the straight-forward fix for this
> issue.
> The stats collector should process all the outstanding messages when
> normal shutdown is requested, as the patch does. On the other hand,
> if immediate shutdown is requested or emergency bailout (by postmaster
> death)
> is requested, maybe the stats collector should skip those processings
> and exit immediately.
>
> But if we implement that, we would need to teach the stats collector
> the shutdown type (i.e., normal shutdown or immediate one). Because
> currently SIGQUIT is sent to the collector whichever shutdown is
> requested,
> and so the collector cannot distinguish the shutdown type. I'm afraid
> that
> change is a bit overkill for now.
>
> BTW, I found that the collector calls pgstat_write_statsfiles() even at
> emergency bailout case, before exiting. It's not necessary to save
> the stats to the file in that case because subsequent server startup
> does
> crash recovery and clears that stats file. So it's better to make
> the collector exit immediately without calling
> pgstat_write_statsfiles()
> at emergency bailout case? Probably this should be discussed in other
> thread because it's different topic from the feature we're discussing
> here,
> though.

IIUC, only the stats collector has another hander for SIGQUIT although
other background processes have a common hander for it and just call
_exit(2).
I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't
make stats
collector shutdown before other background processes are shutdown.

I will make another thread to discuss that the stats collector should
know the shutdown type or not.
If it should be, it's better to make the stats collector exit as soon as
possible if the shutdown type
is an immediate, and avoid losing the remaining stats if it's normal.

>> I measured the timing of the above in my linux laptop using
>> v17-measure-timing.patch.
>> I don't have any strong opinion to handle this case since this result
>> shows to receive and processes
>> the messages takes too short time (less than 1ms) although the stats
>> collector receives the shutdown
>> signal in 5msec(099->104) after the checkpointer process exits.
>
> Agreed.
>
>>
>> ```
>> 1615421204.556 [checkpointer] DEBUG: received shutdown request signal
>> 1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to
>> make              # exit and send the messages
>> 1615421208.099 [stats collector] DEBUG: process BGWRITER stats
>> message              # receive and process the messages
>> 1615421208.099 [stats collector] DEBUG: process WAL stats message
>> 1615421208.104 [postmaster] DEBUG: reaping dead processes
>> 1615421208.104 [stats collector] DEBUG: received shutdown request
>> signal             # receive shutdown request from the postmaster
>> ```
>>
>>>>> Of course, there is another direction; we can improve the stats
>>>>> collector so
>>>>> that it guarantees to collect all the sent stats messages. But I'm
>>>>> afraid
>>>>> this change might be big.
>>>>
>>>> For example, implement to manage background process status in shared
>>>> memory and
>>>> the stats collector collects the stats until another background
>>>> process exits?
>>>>
>>>> In my understanding, the statistics are not required high accuracy,
>>>> it's ok to ignore them if the impact is not big.
>>>>
>>>> If we guarantee high accuracy, another background process like
>>>> autovacuum launcher
>>>> must send the WAL stats because it accesses the system catalog and
>>>> might generate
>>>> WAL records due to HOT update even though the possibility is low.
>>>>
>>>> I thought the impact is small because the time uncollected stats are
>>>> generated is
>>>> short compared to the time from startup. So, it's ok to ignore the
>>>> remaining stats
>>>> when the process exists.
>>>
>>> I agree that it's not worth changing lots of code to collect such
>>> stats.
>>> But if we can implement that very simply, isn't it more worth doing
>>> that than current situation because we may be able to collect more
>>> accurate stats.
>>
>> Yes, I agree.
>> I attached the patch to send the stats before the wal writer and the
>> checkpointer exit.
>> (v17-0001-send-stats-for-walwriter-when-shutdown.patch,
>> v17-0002-send-stats-for-checkpointer-when-shutdown.patch)
>
> Thanks for making those patches! Firstly I'm reading 0001 and 0002
> patches.

Thanks for your comments and for making patches.

> Here is the review comments for 0001 patch.
>
> +/* Prototypes for private functions */
> +static void HandleWalWriterInterrupts(void);
>
> HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost
> the same.
> So I don't think that we need to introduce HandleWalWriterInterrupts().
> Instead,
> we can just call pgstat_send_wal(true) before
> HandleMainLoopInterrupts()
> if ShutdownRequestPending is true in the main loop. Attached is the
> patch
> I implemented that way. Thought?

I thought there is a corner case that can't send the stats like

```
// First, ShutdownRequstPending = false

if (ShutdownRequestPending) // don't send the stats
pgstat_send_wal(true);

// receive signal and ShutdownRequestPending became true

HandleMainLoopInterrupts(); // proc exit without sending the stats

```

Is it ok because it almost never occurs?

> Here is the review comments for 0002 patch.
>
> +static void pgstat_send_checkpointer(void);
>
> I'm inclined to avoid adding the function with the prefix "pgstat_"
> outside
> pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
> pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch
> attached.

Thanks. I agree.

>>>> BTW, I found BgWriterStats.m_timed_checkpoints is not counted in
>>>> ShutdownLOG()
>>>> and we need to count it if to collect stats before it exits.
>>>
>>> Maybe m_requested_checkpoints should be incremented in that case?
>>
>> I thought this should be incremented
>> because it invokes the methods with CHECKPOINT_IS_SHUTDOWN.
>
> Yes.

OK, thanks.

>>
>> ```ShutdownXLOG()
>> CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
>> CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
>> ```
>>
>> I fixed in v17-0002-send-stats-for-checkpointer-when-shutdown.patch.
>>
>>
>> In addition, I rebased the patch for WAL receiver.
>> (v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
>
> Thanks! Will review this later.

Thanks a lot!

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-11 14:33:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/11 21:29, Masahiro Ikeda wrote:
> On 2021-03-11 11:52, Fujii Masao wrote:
>> On 2021/03/11 9:38, Masahiro Ikeda wrote:
>>> On 2021-03-10 17:08, Fujii Masao wrote:
>>>> On 2021/03/10 14:11, Masahiro Ikeda wrote:
>>>>> On 2021-03-09 17:51, Fujii Masao wrote:
>>>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>>>
>>>>>>>>>>>> OK, I fixed it.
>>>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>>>
>>>>>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>>>>>> +    if (enableFsync &&
>>>>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>>>> +    {
>>>>>>>>>>>
>>>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>>>> to simplify the code more?
>>>>>>>>>>
>>>>>>>>>> Thanks for the comments.
>>>>>>>>>> I added the above code in v12 patch.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +        /*
>>>>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>>>>> +         */
>>>>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>>>>> +            pgstat_send_wal();
>>>>>>>>>>>
>>>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>>>
>>>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>>>
>>>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>>>
>>>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>>>
>>>>>>>>>> +               /*
>>>>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>>>> +                * msec since we last sent one
>>>>>>>>>> +                */
>>>>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>>>> +               {
>>>>>>>>>> +                       pgstat_send_wal();
>>>>>>>>>> +                       last_report = now;
>>>>>>>>>> +               }
>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>>>
>>>>>>>> Agreed.
>>>>>>>>
>>>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>>>
>>>>>>>> I think that we can do that. What about the attached patch?
>>>>>>>
>>>>>>> Thanks, I thought it's better.
>>>>>>>
>>>>>>>
>>>>>>>>> I forgot to remove an unused variable.
>>>>>>>>> The attached v13 patch is fixed.
>>>>>>>>
>>>>>>>> Thanks for updating the patch!
>>>>>>>>
>>>>>>>> +        w.wal_write,
>>>>>>>> +        w.wal_write_time,
>>>>>>>> +        w.wal_sync,
>>>>>>>> +        w.wal_sync_time,
>>>>>>>>
>>>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>>>> each other? That is, what about the following order of columns?
>>>>>>>>
>>>>>>>> wal_write
>>>>>>>> wal_sync
>>>>>>>> wal_write_time
>>>>>>>> wal_sync_time
>>>>>>>
>>>>>>> Yes, I fixed it.
>>>>>>>
>>>>>>>> -        case SYNC_METHOD_OPEN:
>>>>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>> -            /* write synced it already */
>>>>>>>> -            break;
>>>>>>>>
>>>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>>>> here, as follows. Thought?
>>>>>>>>
>>>>>>>> +        case SYNC_METHOD_OPEN:
>>>>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>> +            /* not reachable */
>>>>>>>> +            Assert(false);
>>>>>>>
>>>>>>> I agree.
>>>>>>>
>>>>>>>
>>>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>>>> this issue in this patch?
>>>>>>>
>>>>>>> Thanks, I thought it's better to do so.
>>>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>>>
>>>>>> Thanks for 0003 patch!
>>>>>>
>>>>>> Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
>>>>>> just send the stats only when ShutdownRequestPending is true in the walwriter
>>>>>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>>>>>> If we do this, we cannot send the stats when walwriter throws FATAL error.
>>>>>> But that's ok because FATAL error on walwriter causes the server to crash.
>>>>>> Thought?
>>>>>
>>>>> Thanks for your comments!
>>>>> Yes, I agree.
>>>>>
>>>>>
>>>>>> Also ISTM that we don't need to use the callback for that purpose in
>>>>>> checkpointer because of the same reason. That is, we can send the stats
>>>>>> just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
>>>>>> Thought?
>>>>>
>>>>> Yes, I think so too.
>>>>>
>>>>> Since ShutdownXLOG() may create restartpoint or checkpoint,
>>>>> it might generate WAL records.
>>>>>
>>>>>
>>>>>> I'm now not sure how much useful these changes are. As far as I read pgstat.c,
>>>>>> when shutdown is requested, the stats collector seems to exit even when
>>>>>> there are outstanding stats messages. So if checkpointer and walwriter send
>>>>>> the stats in their last cycles, those stats might not be collected.
>>>>>>
>>>>>> On the other hand, I can think that sending the stats in the last cycles would
>>>>>> improve the situation a bit than now. So I'm inclined to apply those changes...
>>>>>
>>>>> I didn't notice that. I agree this is an important aspect.
>>>>> I understood there is a case that the stats collector exits before the checkpointer
>>>>> or the walwriter exits and some stats might not be collected.
>>>>
>>>> IIUC the stats collector basically exits after checkpointer and walwriter exit.
>>>> But there seems no guarantee that the stats collector processes
>>>> all the messages that other processes have sent during the shutdown of
>>>> the server.
>>>
>>> Thanks, I understood the above postmaster behaviors.
>>>
>>> PMState manages the status and after checkpointer is exited, the postmaster sends
>>> SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
>>> (IIUC, although the postmaster kill the walsender, the archiver and
>>> the stats collector at the same time, it's ok because the walsender
>>> and the archiver doesn't send stats to the stats collector now.)
>>>
>>> But, there might be a corner case to lose stats sent by background workers like
>>> the checkpointer before they exit (although this is not implemented yet.)
>>>
>>> For example,
>>>
>>> 1. checkpointer send the stats before it exit
>>> 2. stats collector receive the signal and break before processing
>>>     the stats message from checkpointer. In this case, 1's message is lost.
>>> 3. stats collector writes the stats in the statsfiles and exit
>>>
>>> Why don't you recheck the coming message is zero just before the 2th procedure?
>>> (v17-0004-guarantee-to-collect-last-stats-messages.patch)
>>
>> Yes, I was thinking the same. This is the straight-forward fix for this issue.
>> The stats collector should process all the outstanding messages when
>> normal shutdown is requested, as the patch does. On the other hand,
>> if immediate shutdown is requested or emergency bailout (by postmaster death)
>> is requested, maybe the stats collector should skip those processings
>> and exit immediately.
>>
>> But if we implement that, we would need to teach the stats collector
>> the shutdown type (i.e., normal shutdown or immediate one). Because
>> currently SIGQUIT is sent to the collector whichever shutdown is requested,
>> and so the collector cannot distinguish the shutdown type. I'm afraid that
>> change is a bit overkill for now.
>>
>> BTW, I found that the collector calls pgstat_write_statsfiles() even at
>> emergency bailout case, before exiting. It's not necessary to save
>> the stats to the file in that case because subsequent server startup does
>> crash recovery and clears that stats file. So it's better to make
>> the collector exit immediately without calling pgstat_write_statsfiles()
>> at emergency bailout case? Probably this should be discussed in other
>> thread because it's different topic from the feature we're discussing here,
>> though.
>
> IIUC, only the stats collector has another hander for SIGQUIT although
> other background processes have a common hander for it and just call _exit(2).
> I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats
> collector shutdown before other background processes are shutdown.
>
> I will make another thread to discuss that the stats collector should know the shutdown type or not.
> If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type
> is an immediate, and avoid losing the remaining stats if it's normal.

>
>
>
>>> I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
>>> I don't have any strong opinion to handle this case since this result shows to receive and processes
>>> the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
>>> signal in 5msec(099->104) after the checkpointer process exits.
>>
>> Agreed.
>>
>>>
>>> ```
>>> 1615421204.556 [checkpointer] DEBUG: received shutdown request signal
>>> 1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make              # exit and send the messages
>>> 1615421208.099 [stats collector] DEBUG: process BGWRITER stats message              # receive and process the messages
>>> 1615421208.099 [stats collector] DEBUG: process WAL stats message
>>> 1615421208.104 [postmaster] DEBUG: reaping dead processes
>>> 1615421208.104 [stats collector] DEBUG: received shutdown request signal             # receive shutdown request from the postmaster
>>> ```
>>>
>>>>>> Of course, there is another direction; we can improve the stats collector so
>>>>>> that it guarantees to collect all the sent stats messages. But I'm afraid
>>>>>> this change might be big.
>>>>>
>>>>> For example, implement to manage background process status in shared memory and
>>>>> the stats collector collects the stats until another background process exits?
>>>>>
>>>>> In my understanding, the statistics are not required high accuracy,
>>>>> it's ok to ignore them if the impact is not big.
>>>>>
>>>>> If we guarantee high accuracy, another background process like autovacuum launcher
>>>>> must send the WAL stats because it accesses the system catalog and might generate
>>>>> WAL records due to HOT update even though the possibility is low.
>>>>>
>>>>> I thought the impact is small because the time uncollected stats are generated is
>>>>> short compared to the time from startup. So, it's ok to ignore the remaining stats
>>>>> when the process exists.
>>>>
>>>> I agree that it's not worth changing lots of code to collect such stats.
>>>> But if we can implement that very simply, isn't it more worth doing
>>>> that than current situation because we may be able to collect more
>>>> accurate stats.
>>>
>>> Yes, I agree.
>>> I attached the patch to send the stats before the wal writer and the checkpointer exit.
>>> (v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)
>>
>> Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.
>
> Thanks for your comments and for making patches.
>
>
>> Here is the review comments for 0001 patch.
>>
>> +/* Prototypes for private functions */
>> +static void HandleWalWriterInterrupts(void);
>>
>> HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
>> So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
>> we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
>> if ShutdownRequestPending is true in the main loop. Attached is the patch
>> I implemented that way. Thought?
>
> I thought there is a corner case that can't send the stats like

You're right! So IMO your patch (v17-0001-send-stats-for-walwriter-when-shutdown.patch) is better.

>
> ```
> // First, ShutdownRequstPending = false
>
>     if (ShutdownRequestPending)    // don't send the stats
>         pgstat_send_wal(true);
>
> // receive signal and ShutdownRequestPending became true
>
>     HandleMainLoopInterrupts();   // proc exit without sending the stats
>
> ```
>
> Is it ok because it almost never occurs?
>
>
>> Here is the review comments for 0002 patch.
>>
>> +static void pgstat_send_checkpointer(void);
>>
>> I'm inclined to avoid adding the function with the prefix "pgstat_" outside
>> pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
>> pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.
>
> Thanks. I agree.

Thanks for the review!

So, barring any objection, I will commit the changes for
walwriter and checkpointer. That is,

v17-0001-send-stats-for-walwriter-when-shutdown.patch
v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-12 03:39:22
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I read through the 0003 patch. Here are some comments for that.

With the patch, walreceiver's stats are counted as wal_write, wal_sync, wal_write_time and wal_sync_time in pg_stat_wal. But they should be counted as different columns because WAL IO is different between walreceiver and other processes like a backend? For example, open_sync or open_datasync is chosen as wal_sync_method, those other processes use O_DIRECT flag to open WAL files, but walreceiver does not. For example, those other procesess write WAL data in block units, but walreceiver does not. So I'm concerned that mixing different WAL IO stats in the same columns would confuse the users. Thought? I'd like to hear more opinions about how to expose walreceiver's stats to users.

+int
+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)

This common function writes WAL data and measures IO timing. IMO we can refactor the code furthermore by making this function handle the case where pg_write() reports an error. In other words, I think that the function should do what do-while loop block in XLogWrite() does. Thought?

BTW, currently XLogWrite() increments IO timing even when pg_pwrite() reports an error. But this is useless. Probably IO timing should be incremented after the return code of pg_pwrite() is checked, instead?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-12 05:25:27
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/11 23:33, Fujii Masao wrote:
>
>
> On 2021/03/11 21:29, Masahiro Ikeda wrote:
>> On 2021-03-11 11:52, Fujii Masao wrote:
>>> On 2021/03/11 9:38, Masahiro Ikeda wrote:
>>>> On 2021-03-10 17:08, Fujii Masao wrote:
>>>>> On 2021/03/10 14:11, Masahiro Ikeda wrote:
>>>>>> On 2021-03-09 17:51, Fujii Masao wrote:
>>>>>>> On 2021/03/05 8:38, Masahiro Ikeda wrote:
>>>>>>>> On 2021-03-05 01:02, Fujii Masao wrote:
>>>>>>>>> On 2021/03/04 16:14, Masahiro Ikeda wrote:
>>>>>>>>>> On 2021-03-03 20:27, Masahiro Ikeda wrote:
>>>>>>>>>>> On 2021-03-03 16:30, Fujii Masao wrote:
>>>>>>>>>>>> On 2021/03/03 14:33, Masahiro Ikeda wrote:
>>>>>>>>>>>>> On 2021-02-24 16:14, Fujii Masao wrote:
>>>>>>>>>>>>>> On 2021/02/15 11:59, Masahiro Ikeda wrote:
>>>>>>>>>>>>>>> On 2021-02-10 00:51, David G. Johnston wrote:
>>>>>>>>>>>>>>>> On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
>>>>>>>>>>>>>>>> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I pgindented the patches.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ... <function>XLogWrite</function>, which is invoked during an
>>>>>>>>>>>>>>>> <function>XLogFlush</function> request (see ...). This is also
>>>>>>>>>>>>>>>> incremented by the WAL receiver during replication.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ("which normally called" should be "which is normally called" or
>>>>>>>>>>>>>>>> "which normally is called" if you want to keep true to the original)
>>>>>>>>>>>>>>>> You missed the adding the space before an opening parenthesis here and
>>>>>>>>>>>>>>>> elsewhere (probably copy-paste)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is ether -> is either
>>>>>>>>>>>>>>>> "This parameter is off by default as it will repeatedly query the
>>>>>>>>>>>>>>>> operating system..."
>>>>>>>>>>>>>>>> ", because" -> "as"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, I fixed them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wal_write_time and the sync items also need the note: "This is also
>>>>>>>>>>>>>>>> incremented by the WAL receiver during replication."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I skipped changing it since I separated the stats for the WAL receiver
>>>>>>>>>>>>>>> in pg_stat_wal_receiver.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> "The number of times it happened..." -> " (the tally of this event is
>>>>>>>>>>>>>>>> reported in wal_buffers_full in....) This is undesirable because ..."
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, I fixed it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I notice that the patch for WAL receiver doesn't require explicitly
>>>>>>>>>>>>>>>> computing the sync statistics but does require computing the write
>>>>>>>>>>>>>>>> statistics. This is because of the presence of issue_xlog_fsync but
>>>>>>>>>>>>>>>> absence of an equivalent pg_xlog_pwrite. Additionally, I observe that
>>>>>>>>>>>>>>>> the XLogWrite code path calls pgstat_report_wait_*() while the WAL
>>>>>>>>>>>>>>>> receiver path does not. It seems technically straight-forward to
>>>>>>>>>>>>>>>> refactor here to avoid the almost-duplicated logic in the two places,
>>>>>>>>>>>>>>>> though I suspect there may be a trade-off for not adding another
>>>>>>>>>>>>>>>> function call to the stack given the importance of WAL processing
>>>>>>>>>>>>>>>> (though that seems marginalized compared to the cost of actually
>>>>>>>>>>>>>>>> writing the WAL). Or, as Fujii noted, go the other way and don't have
>>>>>>>>>>>>>>>> any shared code between the two but instead implement the WAL receiver
>>>>>>>>>>>>>>>> one to use pg_stat_wal_receiver instead. In either case, this
>>>>>>>>>>>>>>>> half-and-half implementation seems undesirable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> OK, as Fujii-san mentioned, I separated the WAL receiver stats.
>>>>>>>>>>>>>>> (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for updating the patches!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and
>>>>>>>>>>>>>>> the stats for WAL receiver is counted in pg_stat_wal_receiver.
>>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On second thought, this idea seems not good. Because those stats are
>>>>>>>>>>>>>> collected between multiple walreceivers, but other values in
>>>>>>>>>>>>>> pg_stat_wal_receiver is only related to the walreceiver process running
>>>>>>>>>>>>>> at that moment. IOW, it seems strange that some values show dynamic
>>>>>>>>>>>>>> stats and the others show collected stats, even though they are in
>>>>>>>>>>>>>> the same view pg_stat_wal_receiver. Thought?
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK, I fixed it.
>>>>>>>>>>>>> The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for updating the patches! I'm now reading 001 patch.
>>>>>>>>>>>>
>>>>>>>>>>>> +    /* Check whether the WAL file was synced to disk right now */
>>>>>>>>>>>> +    if (enableFsync &&
>>>>>>>>>>>> +        (sync_method == SYNC_METHOD_FSYNC ||
>>>>>>>>>>>> +         sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
>>>>>>>>>>>> +         sync_method == SYNC_METHOD_FDATASYNC))
>>>>>>>>>>>> +    {
>>>>>>>>>>>>
>>>>>>>>>>>> Isn't it better to make issue_xlog_fsync() return immediately
>>>>>>>>>>>> if enableFsync is off, sync_method is open_sync or open_data_sync,
>>>>>>>>>>>> to simplify the code more?
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the comments.
>>>>>>>>>>> I added the above code in v12 patch.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> +        /*
>>>>>>>>>>>> +         * Send WAL statistics only if WalWriterDelay has elapsed to minimize
>>>>>>>>>>>> +         * the overhead in WAL-writing.
>>>>>>>>>>>> +         */
>>>>>>>>>>>> +        if (rc & WL_TIMEOUT)
>>>>>>>>>>>> +            pgstat_send_wal();
>>>>>>>>>>>>
>>>>>>>>>>>> On second thought, this change means that it always takes wal_writer_delay
>>>>>>>>>>>> before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
>>>>>>>>>>>> For example, if wal_writer_delay is set to several seconds, some values in
>>>>>>>>>>>> pg_stat_wal would be not up-to-date meaninglessly for those seconds.
>>>>>>>>>>>> So I'm thinking to withdraw my previous comment and it's ok to send
>>>>>>>>>>>> the stats every after XLogBackgroundFlush() is called. Thought?
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I didn't notice that.
>>>>>>>>>>>
>>>>>>>>>>> Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
>>>>>>>>>>> default value is 200msec and it may be set shorter time.
>>>>>>>>>
>>>>>>>>> Yeah, if wal_writer_delay is set to very small value, there is a risk
>>>>>>>>> that the WAL stats are sent too frequently. I agree that's a problem.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Why don't to make another way to check the timestamp?
>>>>>>>>>>>
>>>>>>>>>>> +               /*
>>>>>>>>>>> +                * Don't send a message unless it's been at least
>>>>>>>>>>> PGSTAT_STAT_INTERVAL
>>>>>>>>>>> +                * msec since we last sent one
>>>>>>>>>>> +                */
>>>>>>>>>>> +               now = GetCurrentTimestamp();
>>>>>>>>>>> +               if (TimestampDifferenceExceeds(last_report, now,
>>>>>>>>>>> PGSTAT_STAT_INTERVAL))
>>>>>>>>>>> +               {
>>>>>>>>>>> +                       pgstat_send_wal();
>>>>>>>>>>> +                       last_report = now;
>>>>>>>>>>> +               }
>>>>>>>>>>> +
>>>>>>>>>>>
>>>>>>>>>>> Although I worried that it's better to add the check code in pgstat_send_wal(),
>>>>>>>>>
>>>>>>>>> Agreed.
>>>>>>>>>
>>>>>>>>>>> I didn't do so because to avoid to double check PGSTAT_STAT_INTERVAL.
>>>>>>>>>>> pgstat_send_wal() is invoked pg_report_stat() and it already checks the
>>>>>>>>>>> PGSTAT_STAT_INTERVAL.
>>>>>>>>>
>>>>>>>>> I think that we can do that. What about the attached patch?
>>>>>>>>
>>>>>>>> Thanks, I thought it's better.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> I forgot to remove an unused variable.
>>>>>>>>>> The attached v13 patch is fixed.
>>>>>>>>>
>>>>>>>>> Thanks for updating the patch!
>>>>>>>>>
>>>>>>>>> +        w.wal_write,
>>>>>>>>> +        w.wal_write_time,
>>>>>>>>> +        w.wal_sync,
>>>>>>>>> +        w.wal_sync_time,
>>>>>>>>>
>>>>>>>>> It's more natural to put wal_write_time and wal_sync_time next to
>>>>>>>>> each other? That is, what about the following order of columns?
>>>>>>>>>
>>>>>>>>> wal_write
>>>>>>>>> wal_sync
>>>>>>>>> wal_write_time
>>>>>>>>> wal_sync_time
>>>>>>>>
>>>>>>>> Yes, I fixed it.
>>>>>>>>
>>>>>>>>> -        case SYNC_METHOD_OPEN:
>>>>>>>>> -        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>>> -            /* write synced it already */
>>>>>>>>> -            break;
>>>>>>>>>
>>>>>>>>> IMO it's better to add Assert(false) here to ensure that we never reach
>>>>>>>>> here, as follows. Thought?
>>>>>>>>>
>>>>>>>>> +        case SYNC_METHOD_OPEN:
>>>>>>>>> +        case SYNC_METHOD_OPEN_DSYNC:
>>>>>>>>> +            /* not reachable */
>>>>>>>>> +            Assert(false);
>>>>>>>>
>>>>>>>> I agree.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Even when a backend exits, it sends the stats via pgstat_beshutdown_hook().
>>>>>>>>> On the other hand, walwriter doesn't do that. Walwriter also should send
>>>>>>>>> the stats even at its exit? Otherwise some stats can fail to be collected.
>>>>>>>>> But ISTM that this issue existed from before, for example checkpointer
>>>>>>>>> doesn't call pgstat_send_bgwriter() at its exit, so it's overkill to fix
>>>>>>>>> this issue in this patch?
>>>>>>>>
>>>>>>>> Thanks, I thought it's better to do so.
>>>>>>>> I added the shutdown hook for the walwriter and the checkpointer in v14-0003 patch.
>>>>>>>
>>>>>>> Thanks for 0003 patch!
>>>>>>>
>>>>>>> Isn't it overkill to send the stats in the walwriter-exit-callback? IMO we can
>>>>>>> just send the stats only when ShutdownRequestPending is true in the walwriter
>>>>>>> main loop (maybe just before calling HandleMainLoopInterrupts()).
>>>>>>> If we do this, we cannot send the stats when walwriter throws FATAL error.
>>>>>>> But that's ok because FATAL error on walwriter causes the server to crash.
>>>>>>> Thought?
>>>>>>
>>>>>> Thanks for your comments!
>>>>>> Yes, I agree.
>>>>>>
>>>>>>
>>>>>>> Also ISTM that we don't need to use the callback for that purpose in
>>>>>>> checkpointer because of the same reason. That is, we can send the stats
>>>>>>> just after calling ShutdownXLOG(0, 0) in HandleCheckpointerInterrupts().
>>>>>>> Thought?
>>>>>>
>>>>>> Yes, I think so too.
>>>>>>
>>>>>> Since ShutdownXLOG() may create restartpoint or checkpoint,
>>>>>> it might generate WAL records.
>>>>>>
>>>>>>
>>>>>>> I'm now not sure how much useful these changes are. As far as I read pgstat.c,
>>>>>>> when shutdown is requested, the stats collector seems to exit even when
>>>>>>> there are outstanding stats messages. So if checkpointer and walwriter send
>>>>>>> the stats in their last cycles, those stats might not be collected.
>>>>>>>
>>>>>>> On the other hand, I can think that sending the stats in the last cycles would
>>>>>>> improve the situation a bit than now. So I'm inclined to apply those changes...
>>>>>>
>>>>>> I didn't notice that. I agree this is an important aspect.
>>>>>> I understood there is a case that the stats collector exits before the checkpointer
>>>>>> or the walwriter exits and some stats might not be collected.
>>>>>
>>>>> IIUC the stats collector basically exits after checkpointer and walwriter exit.
>>>>> But there seems no guarantee that the stats collector processes
>>>>> all the messages that other processes have sent during the shutdown of
>>>>> the server.
>>>>
>>>> Thanks, I understood the above postmaster behaviors.
>>>>
>>>> PMState manages the status and after checkpointer is exited, the postmaster sends
>>>> SIGQUIT signal to the stats collector if the shutdown mode is smart or fast.
>>>> (IIUC, although the postmaster kill the walsender, the archiver and
>>>> the stats collector at the same time, it's ok because the walsender
>>>> and the archiver doesn't send stats to the stats collector now.)
>>>>
>>>> But, there might be a corner case to lose stats sent by background workers like
>>>> the checkpointer before they exit (although this is not implemented yet.)
>>>>
>>>> For example,
>>>>
>>>> 1. checkpointer send the stats before it exit
>>>> 2. stats collector receive the signal and break before processing
>>>>     the stats message from checkpointer. In this case, 1's message is lost.
>>>> 3. stats collector writes the stats in the statsfiles and exit
>>>>
>>>> Why don't you recheck the coming message is zero just before the 2th procedure?
>>>> (v17-0004-guarantee-to-collect-last-stats-messages.patch)
>>>
>>> Yes, I was thinking the same. This is the straight-forward fix for this issue.
>>> The stats collector should process all the outstanding messages when
>>> normal shutdown is requested, as the patch does. On the other hand,
>>> if immediate shutdown is requested or emergency bailout (by postmaster death)
>>> is requested, maybe the stats collector should skip those processings
>>> and exit immediately.
>>>
>>> But if we implement that, we would need to teach the stats collector
>>> the shutdown type (i.e., normal shutdown or immediate one). Because
>>> currently SIGQUIT is sent to the collector whichever shutdown is requested,
>>> and so the collector cannot distinguish the shutdown type. I'm afraid that
>>> change is a bit overkill for now.
>>>
>>> BTW, I found that the collector calls pgstat_write_statsfiles() even at
>>> emergency bailout case, before exiting. It's not necessary to save
>>> the stats to the file in that case because subsequent server startup does
>>> crash recovery and clears that stats file. So it's better to make
>>> the collector exit immediately without calling pgstat_write_statsfiles()
>>> at emergency bailout case? Probably this should be discussed in other
>>> thread because it's different topic from the feature we're discussing here,
>>> though.
>>
>> IIUC, only the stats collector has another hander for SIGQUIT although
>> other background processes have a common hander for it and just call _exit(2).
>> I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats
>> collector shutdown before other background processes are shutdown.
>>
>> I will make another thread to discuss that the stats collector should know the shutdown type or not.
>> If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type
>> is an immediate, and avoid losing the remaining stats if it's normal.
>
> +1
>
>
>>
>>
>>
>>>> I measured the timing of the above in my linux laptop using v17-measure-timing.patch.
>>>> I don't have any strong opinion to handle this case since this result shows to receive and processes
>>>> the messages takes too short time (less than 1ms) although the stats collector receives the shutdown
>>>> signal in 5msec(099->104) after the checkpointer process exits.
>>>
>>> Agreed.
>>>
>>>>
>>>> ```
>>>> 1615421204.556 [checkpointer] DEBUG: received shutdown request signal
>>>> 1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make              # exit and send the messages
>>>> 1615421208.099 [stats collector] DEBUG: process BGWRITER stats message              # receive and process the messages
>>>> 1615421208.099 [stats collector] DEBUG: process WAL stats message
>>>> 1615421208.104 [postmaster] DEBUG: reaping dead processes
>>>> 1615421208.104 [stats collector] DEBUG: received shutdown request signal             # receive shutdown request from the postmaster
>>>> ```
>>>>
>>>>>>> Of course, there is another direction; we can improve the stats collector so
>>>>>>> that it guarantees to collect all the sent stats messages. But I'm afraid
>>>>>>> this change might be big.
>>>>>>
>>>>>> For example, implement to manage background process status in shared memory and
>>>>>> the stats collector collects the stats until another background process exits?
>>>>>>
>>>>>> In my understanding, the statistics are not required high accuracy,
>>>>>> it's ok to ignore them if the impact is not big.
>>>>>>
>>>>>> If we guarantee high accuracy, another background process like autovacuum launcher
>>>>>> must send the WAL stats because it accesses the system catalog and might generate
>>>>>> WAL records due to HOT update even though the possibility is low.
>>>>>>
>>>>>> I thought the impact is small because the time uncollected stats are generated is
>>>>>> short compared to the time from startup. So, it's ok to ignore the remaining stats
>>>>>> when the process exists.
>>>>>
>>>>> I agree that it's not worth changing lots of code to collect such stats.
>>>>> But if we can implement that very simply, isn't it more worth doing
>>>>> that than current situation because we may be able to collect more
>>>>> accurate stats.
>>>>
>>>> Yes, I agree.
>>>> I attached the patch to send the stats before the wal writer and the checkpointer exit.
>>>> (v17-0001-send-stats-for-walwriter-when-shutdown.patch, v17-0002-send-stats-for-checkpointer-when-shutdown.patch)
>>>
>>> Thanks for making those patches! Firstly I'm reading 0001 and 0002 patches.
>>
>> Thanks for your comments and for making patches.
>>
>>
>>> Here is the review comments for 0001 patch.
>>>
>>> +/* Prototypes for private functions */
>>> +static void HandleWalWriterInterrupts(void);
>>>
>>> HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are almost the same.
>>> So I don't think that we need to introduce HandleWalWriterInterrupts(). Instead,
>>> we can just call pgstat_send_wal(true) before HandleMainLoopInterrupts()
>>> if ShutdownRequestPending is true in the main loop. Attached is the patch
>>> I implemented that way. Thought?
>>
>> I thought there is a corner case that can't send the stats like
>
> You're right! So IMO your patch (v17-0001-send-stats-for-walwriter-when-shutdown.patch) is better.
>
>
>>
>> ```
>> // First, ShutdownRequstPending = false
>>
>>      if (ShutdownRequestPending)    // don't send the stats
>>          pgstat_send_wal(true);
>>
>> // receive signal and ShutdownRequestPending became true
>>
>>      HandleMainLoopInterrupts();   // proc exit without sending the stats
>>
>> ```
>>
>> Is it ok because it almost never occurs?
>>
>>
>>> Here is the review comments for 0002 patch.
>>>
>>> +static void pgstat_send_checkpointer(void);
>>>
>>> I'm inclined to avoid adding the function with the prefix "pgstat_" outside
>>> pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter() and
>>> pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch attached.
>>
>> Thanks. I agree.
>
> Thanks for the review!
>
>
> So, barring any objection, I will commit the changes for
> walwriter and checkpointer. That is,
>
> v17-0001-send-stats-for-walwriter-when-shutdown.patch
> v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch

I pushed these two patches.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-15 01:39:06
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-12 12:39, Fujii Masao wrote:
> On 2021/03/11 21:29, Masahiro Ikeda wrote:
>>>> In addition, I rebased the patch for WAL receiver.
>>>> (v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>>
>>> Thanks! Will review this later.
>>
>> Thanks a lot!
>
> I read through the 0003 patch. Here are some comments for that.
>
> With the patch, walreceiver's stats are counted as wal_write,
> wal_sync, wal_write_time and wal_sync_time in pg_stat_wal. But they
> should be counted as different columns because WAL IO is different
> between walreceiver and other processes like a backend? For example,
> open_sync or open_datasync is chosen as wal_sync_method, those other
> processes use O_DIRECT flag to open WAL files, but walreceiver does
> not. For example, those other procesess write WAL data in block units,
> but walreceiver does not. So I'm concerned that mixing different WAL
> IO stats in the same columns would confuse the users. Thought? I'd
> like to hear more opinions about how to expose walreceiver's stats to
> users.

Thanks, I understood get_sync_bit() checks the sync flags and
the write unit of generated wal data and replicated wal data is
different.
(It's interesting optimization whether to use kernel cache or not.)

OK. Although I agree to separate the stats for the walrecever,
I want to hear opinions from other people too. I didn't change the
patch.

Please feel to your comments.

> +int
> +XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset)
>
> This common function writes WAL data and measures IO timing. IMO we
> can refactor the code furthermore by making this function handle the
> case where pg_write() reports an error. In other words, I think that
> the function should do what do-while loop block in XLogWrite() does.
> Thought?

OK. I agree.

I wonder to change the error check ways depending on who calls this
function?
Now, only the walreceiver checks (1)errno==0 and doesn't check
(2)errno==ENITR.
Other processes are the opposite.

IIUC, it's appropriate that every process checks (1)(2).
Please let me know my understanding is wrong.

> BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
> reports an error. But this is useless. Probably IO timing should be
> incremented after the return code of pg_pwrite() is checked, instead?

Yes, I agree. I fixed it.
(v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

BTW, thanks for your comments in person that the bgwriter process will
generate wal data.
I checked that it generates the WAL to take a snapshot via
LogStandySnapshot().
I attached the patch for bgwriter to send the wal stats.
(v18-0005-send-stats-for-bgwriter-when-shutdown.patch)

This patch includes the following changes.

(1) introduce pgstat_send_bgwriter() the mechanism to send the stats
if PGSTAT_STAT_INTERVAL msec has passed like pgstat_send_wal()
to avoid overloading to stats collector because "bgwriter_delay"
can be set for 10msec or more.

(2) remove pgstat_report_wal() and integrate with pgstat_send_wal()
because bgwriter sends WalStats.m_wal_records and to avoid
overloading (see (1)).
IIUC, although the pros to separate them is to reduce the
calculation cost of
WalUsageAccumDiff(), the impact is limited.

(3) make a new signal handler for bgwriter to force sending remaining
stats during shutdown
because of (1) and remove HandleMainLoopInterrupts() because there
are no processes to use it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch	text/x-diff	6.7 KB
v18-0005-send-stats-for-bgwriter-when-shutdown.patch	text/x-diff	10.4 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-15 01:54:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> On 2021/03/11 21:29, Masahiro Ikeda wrote:
>>> On 2021-03-11 11:52, Fujii Masao wrote:
>>>> On 2021/03/11 9:38, Masahiro Ikeda wrote:
>>>>> On 2021-03-10 17:08, Fujii Masao wrote:
>>>>>> IIUC the stats collector basically exits after checkpointer and
>>>>>> walwriter exit.
>>>>>> But there seems no guarantee that the stats collector processes
>>>>>> all the messages that other processes have sent during the
>>>>>> shutdown of
>>>>>> the server.
>>>>>
>>>>> Thanks, I understood the above postmaster behaviors.
>>>>>
>>>>> PMState manages the status and after checkpointer is exited, the
>>>>> postmaster sends
>>>>> SIGQUIT signal to the stats collector if the shutdown mode is smart
>>>>> or fast.
>>>>> (IIUC, although the postmaster kill the walsender, the archiver and
>>>>> the stats collector at the same time, it's ok because the walsender
>>>>> and the archiver doesn't send stats to the stats collector now.)
>>>>>
>>>>> But, there might be a corner case to lose stats sent by background
>>>>> workers like
>>>>> the checkpointer before they exit (although this is not implemented
>>>>> yet.)
>>>>>
>>>>> For example,
>>>>>
>>>>> 1. checkpointer send the stats before it exit
>>>>> 2. stats collector receive the signal and break before processing
>>>>> the stats message from checkpointer. In this case, 1's message
>>>>> is lost.
>>>>> 3. stats collector writes the stats in the statsfiles and exit
>>>>>
>>>>> Why don't you recheck the coming message is zero just before the
>>>>> 2th procedure?
>>>>> (v17-0004-guarantee-to-collect-last-stats-messages.patch)
>>>>
>>>> Yes, I was thinking the same. This is the straight-forward fix for
>>>> this issue.
>>>> The stats collector should process all the outstanding messages when
>>>> normal shutdown is requested, as the patch does. On the other hand,
>>>> if immediate shutdown is requested or emergency bailout (by
>>>> postmaster death)
>>>> is requested, maybe the stats collector should skip those
>>>> processings
>>>> and exit immediately.
>>>>
>>>> But if we implement that, we would need to teach the stats collector
>>>> the shutdown type (i.e., normal shutdown or immediate one). Because
>>>> currently SIGQUIT is sent to the collector whichever shutdown is
>>>> requested,
>>>> and so the collector cannot distinguish the shutdown type. I'm
>>>> afraid that
>>>> change is a bit overkill for now.
>>>>
>>>> BTW, I found that the collector calls pgstat_write_statsfiles() even
>>>> at
>>>> emergency bailout case, before exiting. It's not necessary to save
>>>> the stats to the file in that case because subsequent server startup
>>>> does
>>>> crash recovery and clears that stats file. So it's better to make
>>>> the collector exit immediately without calling
>>>> pgstat_write_statsfiles()
>>>> at emergency bailout case? Probably this should be discussed in
>>>> other
>>>> thread because it's different topic from the feature we're
>>>> discussing here,
>>>> though.
>>>
>>> IIUC, only the stats collector has another hander for SIGQUIT
>>> although
>>> other background processes have a common hander for it and just call
>>> _exit(2).
>>> I thought to guarantee when TerminateChildren(SIGTERM) is invoked,
>>> don't make stats
>>> collector shutdown before other background processes are shutdown.
>>>
>>> I will make another thread to discuss that the stats collector should
>>> know the shutdown type or not.
>>> If it should be, it's better to make the stats collector exit as soon
>>> as possible if the shutdown type
>>> is an immediate, and avoid losing the remaining stats if it's normal.
>>
>> +1

I researched the past discussion related to writing the stats files when
the immediate
shutdown is requested. And I found the following thread([1]) although
the discussion is
stopped on 12/1/2016.

IIUC, the thread's consensus are

(1) To kill the stats collector soon before writing the stats file is
needed in some case
because there is a possibility that it takes a long time until the
failover happens.
The possible reasons are that disk write speed is slow, stats files
are big, and so on.

(2) It needs to change the behavior from removing all stats files when
the startup does
crash recovery because autovacuum uses the stats.

(3) It's ok that the stats collector exit without calling
pgstat_write_statsfiles() if
the stats file is written every X minutes (using wal or another
mechanism) and startup
process can restore the stats with slightly low freshness.

(4) It needs to find the way how to handle the (2)'s stats file when
deleting on PITR
rewind or stats collector crash happens.

So, I need to ping the threads. But I don't have any idea to handle (4)
yet...

[1]
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F5EF25A%40G01JPEXMBYT05

>>
>>>
>>>
>>>
>>>>> I measured the timing of the above in my linux laptop using
>>>>> v17-measure-timing.patch.
>>>>> I don't have any strong opinion to handle this case since this
>>>>> result shows to receive and processes
>>>>> the messages takes too short time (less than 1ms) although the
>>>>> stats collector receives the shutdown
>>>>> signal in 5msec(099->104) after the checkpointer process exits.
>>>>
>>>> Agreed.
>>>>
>>>>>
>>>>> ```
>>>>> 1615421204.556 [checkpointer] DEBUG: received shutdown request
>>>>> signal
>>>>> 1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to
>>>>> make              # exit and send the messages
>>>>> 1615421208.099 [stats collector] DEBUG: process BGWRITER stats
>>>>> message              # receive and process the messages
>>>>> 1615421208.099 [stats collector] DEBUG: process WAL stats message
>>>>> 1615421208.104 [postmaster] DEBUG: reaping dead processes
>>>>> 1615421208.104 [stats collector] DEBUG: received shutdown request
>>>>> signal             # receive shutdown request from the postmaster
>>>>> ```
>>>>>
>>>>>>>> Of course, there is another direction; we can improve the stats
>>>>>>>> collector so
>>>>>>>> that it guarantees to collect all the sent stats messages. But
>>>>>>>> I'm afraid
>>>>>>>> this change might be big.
>>>>>>>
>>>>>>> For example, implement to manage background process status in
>>>>>>> shared memory and
>>>>>>> the stats collector collects the stats until another background
>>>>>>> process exits?
>>>>>>>
>>>>>>> In my understanding, the statistics are not required high
>>>>>>> accuracy,
>>>>>>> it's ok to ignore them if the impact is not big.
>>>>>>>
>>>>>>> If we guarantee high accuracy, another background process like
>>>>>>> autovacuum launcher
>>>>>>> must send the WAL stats because it accesses the system catalog
>>>>>>> and might generate
>>>>>>> WAL records due to HOT update even though the possibility is low.
>>>>>>>
>>>>>>> I thought the impact is small because the time uncollected stats
>>>>>>> are generated is
>>>>>>> short compared to the time from startup. So, it's ok to ignore
>>>>>>> the remaining stats
>>>>>>> when the process exists.
>>>>>>
>>>>>> I agree that it's not worth changing lots of code to collect such
>>>>>> stats.
>>>>>> But if we can implement that very simply, isn't it more worth
>>>>>> doing
>>>>>> that than current situation because we may be able to collect more
>>>>>> accurate stats.
>>>>>
>>>>> Yes, I agree.
>>>>> I attached the patch to send the stats before the wal writer and
>>>>> the checkpointer exit.
>>>>> (v17-0001-send-stats-for-walwriter-when-shutdown.patch,
>>>>> v17-0002-send-stats-for-checkpointer-when-shutdown.patch)
>>>>
>>>> Thanks for making those patches! Firstly I'm reading 0001 and 0002
>>>> patches.
>>>
>>> Thanks for your comments and for making patches.
>>>
>>>
>>>> Here is the review comments for 0001 patch.
>>>>
>>>> +/* Prototypes for private functions */
>>>> +static void HandleWalWriterInterrupts(void);
>>>>
>>>> HandleWalWriterInterrupts() and HandleMainLoopInterrupts() are
>>>> almost the same.
>>>> So I don't think that we need to introduce
>>>> HandleWalWriterInterrupts(). Instead,
>>>> we can just call pgstat_send_wal(true) before
>>>> HandleMainLoopInterrupts()
>>>> if ShutdownRequestPending is true in the main loop. Attached is the
>>>> patch
>>>> I implemented that way. Thought?
>>>
>>> I thought there is a corner case that can't send the stats like
>>
>> You're right! So IMO your patch
>> (v17-0001-send-stats-for-walwriter-when-shutdown.patch) is better.
>>>
>>> ```
>>> // First, ShutdownRequstPending = false
>>>
>>>      if (ShutdownRequestPending)    // don't send the stats
>>>          pgstat_send_wal(true);
>>>
>>> // receive signal and ShutdownRequestPending became true
>>>
>>>      HandleMainLoopInterrupts();   // proc exit without sending the
>>> stats
>>>
>>> ```
>>>
>>> Is it ok because it almost never occurs?
>>>
>>>
>>>> Here is the review comments for 0002 patch.
>>>>
>>>> +static void pgstat_send_checkpointer(void);
>>>>
>>>> I'm inclined to avoid adding the function with the prefix "pgstat_"
>>>> outside
>>>> pgstat.c. Instead, I'm ok to just call both pgstat_send_bgwriter()
>>>> and
>>>> pgstat_report_wal() directly after ShutdownXLOG(). Thought? Patch
>>>> attached.
>>>
>>> Thanks. I agree.
>>
>> Thanks for the review!
>>
>>
>> So, barring any objection, I will commit the changes for
>> walwriter and checkpointer. That is,
>>
>> v17-0001-send-stats-for-walwriter-when-shutdown.patch
>> v17-0002-send-stats-for-checkpointer-when-shutdown_fujii.patch
>
> I pushed these two patches.

Thanks a lot!

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-19 07:30:04
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/15 10:39, Masahiro Ikeda wrote:
> Thanks, I understood get_sync_bit() checks the sync flags and
> the write unit of generated wal data and replicated wal data is different.
> (It's interesting optimization whether to use kernel cache or not.)
>
> OK. Although I agree to separate the stats for the walrecever,
> I want to hear opinions from other people too. I didn't change the patch.
>
> Please feel to your comments.

What about applying the patch for common WAL write function like
XLogWriteFile(), separately from the patch for walreceiver's stats?
Seems the former reaches the consensus, so we can commit it firstly.
Also even only the former change is useful because which allows
walreceiver to report WALWrite wait event.

> OK. I agree.
>
> I wonder to change the error check ways depending on who calls this function?
> Now, only the walreceiver checks (1)errno==0 and doesn't check (2)errno==ENITR.
> Other processes are the opposite.
>
> IIUC, it's appropriate that every process checks (1)(2).
> Please let me know my understanding is wrong.

I'm thinking the same. Regarding (2), commit 79ce29c734 introduced
that code. According to the following commit log, it seems harmless
to retry on EINTR even walreceiver.

Also retry on EINTR. All signals used in the backend are flagged SA_RESTART
nowadays, so it shouldn't happen, but better to be defensive.

>> BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
>> reports an error. But this is useless. Probably IO timing should be
>> incremented after the return code of pg_pwrite() is checked, instead?
>
> Yes, I agree. I fixed it.
> (v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for the patch!

nleft = nbytes;
do
{
- errno = 0;
+ written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset,
+ ThisTimeLineID, openLogSegNo, wal_segment_size);

Can we merge this do-while loop in XLogWrite() into the loop
in XLogWriteFile()?

If we do that, ISTM that the following codes are not necessary in XLogWrite().

nleft -= written;
from += written;

+ * 'segsize' is a segment size of WAL segment file.

Since segsize is always wal_segment_size, segsize argument seems
not necessary in XLogWriteFile().

+XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset,
+ TimeLineID timelineid, XLogSegNo segno, int segsize)

Why did you use "const void *" instead of "char *" for *buf?

Regarding 0005 patch, I will review it later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	ikedamsh <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-22 00:50:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021-03-19 16:30, Fujii Masao wrote:
> On 2021/03/15 10:39, Masahiro Ikeda wrote:
>> Thanks, I understood get_sync_bit() checks the sync flags and
>> the write unit of generated wal data and replicated wal data is
>> different.
>> (It's interesting optimization whether to use kernel cache or not.)
>>
>> OK. Although I agree to separate the stats for the walrecever,
>> I want to hear opinions from other people too. I didn't change the
>> patch.
>>
>> Please feel to your comments.
>
> What about applying the patch for common WAL write function like
> XLogWriteFile(), separately from the patch for walreceiver's stats?
> Seems the former reaches the consensus, so we can commit it firstly.
> Also even only the former change is useful because which allows
> walreceiver to report WALWrite wait event.

Agreed. I separated the patches.

If only the former is committed, my trivial concern is that there may be
a disadvantage, but no advantage for the standby server. It may lead to
performance degradation to the wal receiver by calling
INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
latter patch is committed.

I think it's ok because this not happening in the case to disable the
"track_wal_io_timing" in the standby server. Although some users may start the
standby server using the backup which "track_wal_io_timing" is enabled in the
primary server, they will say it's ok since the users already accept the
performance degradation in the primary server.

>> OK. I agree.
>>
>> I wonder to change the error check ways depending on who calls this
>> function?
>> Now, only the walreceiver checks (1)errno==0 and doesn't check
>> (2)errno==ENITR.
>> Other processes are the opposite.
>>
>> IIUC, it's appropriate that every process checks (1)(2).
>> Please let me know my understanding is wrong.
>
> I'm thinking the same. Regarding (2), commit 79ce29c734 introduced
> that code. According to the following commit log, it seems harmless
> to retry on EINTR even walreceiver.
>
> Also retry on EINTR. All signals used in the backend are flagged
> SA_RESTART
> nowadays, so it shouldn't happen, but better to be defensive.

Thanks, I understood.

>>> BTW, currently XLogWrite() increments IO timing even when pg_pwrite()
>>> reports an error. But this is useless. Probably IO timing should be
>>> incremented after the return code of pg_pwrite() is checked, instead?
>>
>> Yes, I agree. I fixed it.
>> (v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
>
> Thanks for the patch!
>
> nleft = nbytes;
> do
> {
> - errno = 0;
> + written = XLogWriteFile(openLogFile, from, nleft, (off_t)
> startoffset,
> + ThisTimeLineID, openLogSegNo, wal_segment_size);
>
> Can we merge this do-while loop in XLogWrite() into the loop
> in XLogWriteFile()?
> If we do that, ISTM that the following codes are not necessary in
> XLogWrite().
>
> nleft -= written;
> from += written;

OK, I fixed it.

> + * 'segsize' is a segment size of WAL segment file.
>
> Since segsize is always wal_segment_size, segsize argument seems
> not necessary in XLogWriteFile().

Right. I fixed it.

> +XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset,
> + TimeLineID timelineid, XLogSegNo segno, int segsize)
>
> Why did you use "const void *" instead of "char *" for *buf?

I followed the argument of pg_pwrite().
But, I think "char *" is better, so fixed it.

> Regarding 0005 patch, I will review it later.

Thanks.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v19-0003-Makes-the-wal-receiver-report-WAL-statistics.patch	text/x-diff	962 bytes
v19-0006-merge-wal-write-function.patch	text/x-diff	6.3 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	ikedamsh <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-22 07:50:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/22 9:50, ikedamsh wrote:
> Agreed. I separated the patches.
>
> If only the former is committed, my trivial concern is that there may be
> a disadvantage, but no advantage for the standby server. It may lead to
> performance degradation to the wal receiver by calling
> INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
> latter patch is committed.

Your concern is valid, so let's polish and commit also the 0003 patch to v14.
I'm still thinking that it's better to separate wal_xxx columns into
walreceiver's and the others. But if we count even walreceiver activity on
the existing columns, regarding 0003 patch, we need to update the document?
For example, "Number of times WAL buffers were written out to disk via
XLogWrite request." should be "Number of times WAL buffers were written
out to disk via XLogWrite request and by WAL receiver process."? Maybe
we need to append some descriptions about this into "WAL configuration"
section?

> I followed the argument of pg_pwrite().
> But, I think "char *" is better, so fixed it.

Thanks for updating the patch!

+extern int XLogWriteFile(int fd, char *buf,
+ size_t nbyte, off_t offset,
+ TimeLineID timelineid, XLogSegNo segno,
+ bool write_all);

write_all seems not to be necessary. You added this flag for walreceiver,
I guess. But even without the argument, walreceiver seems to work expectedly.
So, what about the attached patch? I applied some cosmetic changes to the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment	Content-Type	Size
v20-0006-merge-wal-write-function.patch	text/plain	6.4 KB

From:	ikedamsh <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-22 11:25:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/22 16:50, Fujii Masao wrote:
>
>
> On 2021/03/22 9:50, ikedamsh wrote:
>> Agreed. I separated the patches.
>>
>> If only the former is committed, my trivial concern is that there may be
>> a disadvantage, but no advantage for the standby server. It may lead to
>> performance degradation to the wal receiver by calling
>> INSTR_TIME_SET_CURRENT(), but the stats can't visible for users until the
>> latter patch is committed.
>
> Your concern is valid, so let's polish and commit also the 0003 patch to v14.
> I'm still thinking that it's better to separate wal_xxx columns into
> walreceiver's and the others. But if we count even walreceiver activity on
> the existing columns, regarding 0003 patch, we need to update the document?
> For example, "Number of times WAL buffers were written out to disk via
> XLogWrite request." should be "Number of times WAL buffers were written
> out to disk via XLogWrite request and by WAL receiver process."? Maybe
> we need to append some descriptions about this into "WAL configuration"
> section?

Agreed. Users can know whether the stats is for walreceiver or not. The
pg_stat_wal view in standby server shows for the walreceiver, and in primary
server it shows for the others. So, I updated the document.
(v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

>> I followed the argument of pg_pwrite().
>> But, I think "char *" is better, so fixed it.
>
> Thanks for updating the patch!
>
> +extern int    XLogWriteFile(int fd, char *buf,
> +                          size_t nbyte, off_t offset,
> +                          TimeLineID timelineid, XLogSegNo segno,
> +                          bool write_all);
>
> write_all seems not to be necessary. You added this flag for walreceiver,
> I guess. But even without the argument, walreceiver seems to work expectedly.
> So, what about the attached patch? I applied some cosmetic changes to the patch.

Thanks a lot. Yes, "write_all" is unnecessary.
Your patch is looks good to me.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch	text/x-patch	3.8 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	ikedamsh <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-23 07:10:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/22 20:25, ikedamsh wrote:
> Agreed. Users can know whether the stats is for walreceiver or not. The
> pg_stat_wal view in standby server shows for the walreceiver, and in primary
> server it shows for the others. So, I updated the document.
> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)

Thanks for updating the docs!

There was the discussion about when the stats collector is invoked, at [1].
Currently during archive recovery or standby, the stats collector is
invoked when the startup process reaches the consistent state, sends
PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
read-only connections. But walreceiver can be invoked at earlier stage.
This can cause walreceiver to generate and send the statistics about WAL
writing even though the stats collector has not been running yet. This might
be problematic? If so, maybe we need to ensure that the stats collector is
invoked before walreceiver?

During recovery, the stats collector is not invoked if hot standby mode is
disabled. But walreceiver can be running in this case. So probably we should
change walreceiver so that it's invoked even when hot standby is disabled?
Otherwise we cannnot collect the statistics about WAL writing by walreceiver
in that case.

[1]
https://postgr.es/m/[email protected]

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-25 02:50:12
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/23 16:10, Fujii Masao wrote:
>
>
> On 2021/03/22 20:25, ikedamsh wrote:
>> Agreed. Users can know whether the stats is for walreceiver or not. The
>> pg_stat_wal view in standby server shows for the walreceiver, and in primary
>> server it shows for the others. So, I updated the document.
>> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
>
> Thanks for updating the docs!
>
> There was the discussion about when the stats collector is invoked, at [1].
> Currently during archive recovery or standby, the stats collector is
> invoked when the startup process reaches the consistent state, sends
> PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
> read-only connections. But walreceiver can be invoked at earlier stage.
> This can cause walreceiver to generate and send the statistics about WAL
> writing even though the stats collector has not been running yet. This might
> be problematic? If so, maybe we need to ensure that the stats collector is
> invoked before walreceiver?
>
> During recovery, the stats collector is not invoked if hot standby mode is
> disabled. But walreceiver can be running in this case. So probably we should
> change walreceiver so that it's invoked even when hot standby is disabled?
> Otherwise we cannnot collect the statistics about WAL writing by walreceiver
> in that case.
>
> [1]
> https://postgr.es/m/[email protected]

Thanks for comments! I didn't notice that.
As I mentioned[1], if my understanding is right, this issue seem to be not for
only the wal receiver.

Since the shared memory thread already handles these issues, does this patch,
which to collect the stats for the wal receiver and make a common function for
writing wal files, have to be committed after the patches for share memory
stats are committed? Or to handle them in this thread because we don't know
when the shared memory stats patches will be committed.

I think the former is better because to collect stats in shared memory is very
useful feature for users and it make a big change in design. So, I think it's
beneficial to make an effort to move the shared memory stats thread forward
(by reviewing or testing) instead of handling the issues in this thread.

[1]
https://www.postgresql.org/message-id/9f4e19ad-518d-b91a-e500-25a666471c42%40oss.nttdata.com

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	"David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Li Japin <japinli(at)hotmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com
Subject:	Re: About to add WAL write/fsync statistics to pg_stat_wal view
Date:	2021-03-25 13:06:32
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2021/03/25 11:50, Masahiro Ikeda wrote:
>
>
> On 2021/03/23 16:10, Fujii Masao wrote:
>>
>>
>> On 2021/03/22 20:25, ikedamsh wrote:
>>> Agreed. Users can know whether the stats is for walreceiver or not. The
>>> pg_stat_wal view in standby server shows for the walreceiver, and in primary
>>> server it shows for the others. So, I updated the document.
>>> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
>>
>> Thanks for updating the docs!
>>
>> There was the discussion about when the stats collector is invoked, at [1].
>> Currently during archive recovery or standby, the stats collector is
>> invoked when the startup process reaches the consistent state, sends
>> PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
>> read-only connections. But walreceiver can be invoked at earlier stage.
>> This can cause walreceiver to generate and send the statistics about WAL
>> writing even though the stats collector has not been running yet. This might
>> be problematic? If so, maybe we need to ensure that the stats collector is
>> invoked before walreceiver?
>>
>> During recovery, the stats collector is not invoked if hot standby mode is
>> disabled. But walreceiver can be running in this case. So probably we should
>> change walreceiver so that it's invoked even when hot standby is disabled?
>> Otherwise we cannnot collect the statistics about WAL writing by walreceiver
>> in that case.
>>
>> [1]
>> https://postgr.es/m/[email protected]
>
> Thanks for comments! I didn't notice that.
> As I mentioned[1], if my understanding is right, this issue seem to be not for
> only the wal receiver.
>
> Since the shared memory thread already handles these issues, does this patch,
> which to collect the stats for the wal receiver and make a common function for
> writing wal files, have to be committed after the patches for share memory
> stats are committed? Or to handle them in this thread because we don't know
> when the shared memory stats patches will be committed.
>
> I think the former is better because to collect stats in shared memory is very
> useful feature for users and it make a big change in design. So, I think it's
> beneficial to make an effort to move the shared memory stats thread forward
> (by reviewing or testing) instead of handling the issues in this thread.

Sounds reasonable. Agreed.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION