Quick Links

Enhancing Memory Context Statistics Reporting

Lists:	pgsql-hackers

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Enhancing Memory Context Statistics Reporting
Date:	2024-10-21 18:24:21
Message-ID:	CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

PostgreSQL provides following capabilities for reporting memory contexts
statistics.
1. pg_get_backend_memory_contexts(); [1]
2. pg_log_backend_memory_contexts(pid); [2]

[1] provides a view of memory context statistics for a local backend,
while [2] prints the memory context statistics of any backend or auxiliary
process to the PostgreSQL logs. Although [1] offers detailed statistics,
it is limited to the local backend, restricting its use to PostgreSQL
client backends only.
On the other hand, [2] provides the statistics for all backends but logs
them in a file,
which may not be convenient for quick access.

I propose enhancing memory context statistics reporting by combining these
capabilities and offering a view of memory statistics for all PostgreSQL
backends
and auxiliary processes.

Attached is a patch that implements this functionality. It introduces a SQL
function
that takes the PID of a backend as an argument, returning a set of records,
each containing statistics for a single memory context. The underlying C
function
sends a signal to the backend and waits for it to publish its memory
context statistics
before returning them to the user. The publishing backend copies these
statistics
during the next CHECK_FOR_INTERRUPTS call.

This approach facilitates on-demand publication of memory statistics
for a specific backend, rather than collecting them at regular intervals.
Since past memory context statistics may no longer be relevant,
there is little value in retaining historical data. Any collected
statistics
can be discarded once read by the client backend.

A fixed-size shared memory block, currently accommodating 30 records,
is used to store the statistics. This number was chosen arbitrarily,
as it covers all parent contexts at level 1 (i.e., direct children of the
top memory context)
based on my tests.
Further experiments are needed to determine the optimal number
for summarizing memory statistics.

Any additional statistics that exceed the shared memory capacity
are written to a file per backend in the PG_TEMP_FILES_DIR. The client
backend
first reads from the shared memory, and if necessary, retrieves the
remaining data from the file,
combining everything into a unified view. The files are cleaned up
automatically
if a backend crashes or during server restarts.

The statistics are reported in a breadth-first search order of the memory
context tree,
with parent contexts reported before their children. This provides a
cumulative summary
before diving into the details of each child context's consumption.

The rationale behind the shared memory chunk is to ensure that the
majority of contexts which are the direct children of TopMemoryContext,
fit into memory
This allows a client to request a summary of memory statistics,
which can be served from memory without the overhead of file access,
unless necessary.

A publishing backend signals waiting client backends using a condition
variable when it has finished writing its statistics to memory.
The client backend checks whether the statistics belong to the requested
backend.
If not, it continues waiting on the condition variable, timing out after 2
minutes.
This timeout is an arbitrary choice, and further work is required to
determine
a more practical value.

All backends use the same memory space to publish their statistics.
Before publishing, a backend checks whether the previous statistics have
been
successfully read by a client using a shared flag, "in_use."
This flag is set by the publishing backend and cleared by the client
backend once the data is read. If a backend cannot publish due to shared
memory being occupied, it exits the interrupt processing code,
and the client backend times out with a warning.

Please find below an example query to fetch memory contexts from the backend
with id '106114'. Second argument -'get_summary' is 'false',
indicating a request for statistics of all the contexts.

postgres=#
select * FROM pg_get_remote_backend_memory_contexts('116292', false) LIMIT
2;
-[ RECORD 1 ]-+----------------------
name | TopMemoryContext
ident |
type | AllocSet
path | {0}
total_bytes | 97696
total_nblocks | 5
free_bytes | 15376
free_chunks | 11
used_bytes | 82320
pid | 116292
-[ RECORD 2 ]-+----------------------
name | RowDescriptionContext
ident |
type | AllocSet
path | {0,1}
total_bytes | 8192
total_nblocks | 1
free_bytes | 6912
free_chunks | 0
used_bytes | 1280
pid | 116292

TODO:
1. Determine the behaviour when the statistics don't fit in one file.

*[1] **PostgreSQL: Re: Creating a function for exposing memory usage of
backend process
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.postgresql.org%2Fmessage-id%2F0a768ae1-1703-59c7-86cc-7068ff5e318c%2540oss.nttdata.com&data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436604911%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=cbO2DBP6IsgMPTEVFNh%2FKeq4IoK3MZvTpzKkCQzNPMo%3D&reserved=0>*

[2] *PostgreSQL: Re: Get memory contexts of an arbitrary backend process
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&reserved=0>*

Thank you,
Rahila Syed

Attachment	Content-Type	Size
0001-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	33.8 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-10-22 06:48:39
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 21, 2024 at 11:54:21PM +0530, Rahila Syed wrote:
> On the other hand, [2] provides the statistics for all backends but logs
> them in a file, which may not be convenient for quick access.

To be precise, pg_log_backend_memory_contexts() pushes the memory
context stats to LOG_SERVER_ONLY or stderr, hence this is appended to
the server logs.

> A fixed-size shared memory block, currently accommodating 30 records,
> is used to store the statistics. This number was chosen arbitrarily,
> as it covers all parent contexts at level 1 (i.e., direct children of the
> top memory context)
> based on my tests.
> Further experiments are needed to determine the optimal number
> for summarizing memory statistics.

+ * Statistics are shared via fixed shared memory which
+ * can hold statistics for 29 contexts. The rest of the
[...]
+ MemoryContextInfo memctx_infos[30];
[...]
+ memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
[...]
+ size = add_size(size, mul_size(30, sizeof(MemoryContextInfo)));
[...]
+ memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
[...]
+ memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));

This number is tied to MemoryContextState added by the patch. Sounds
like this would be better as a constant properly defined rather than
hardcoded in all these places. This would make the upper-bound more
easily switchable in the patch.

+ Datum path[128];
+ char type[128];
[...]
+ char name[1024];
+ char ident[1024];
+ char type[128];
+ Datum path[128];

Again, constants. Why these values? You may want to use more
#defines here.

> Any additional statistics that exceed the shared memory capacity
> are written to a file per backend in the PG_TEMP_FILES_DIR. The client
> backend
> first reads from the shared memory, and if necessary, retrieves the
> remaining data from the file,
> combining everything into a unified view. The files are cleaned up
> automatically
> if a backend crashes or during server restarts.

Is the addition of the file to write any remaining stats really that
useful? This comes with a heavy cost in the patch with the "in_use"
flag, the various tweaks around the LWLock release/acquire protecting
the shmem area and the extra cleanup steps required after even a clean
restart. That's a lot of facility for this kind of information.
Another thing that may be worth considering is to put this information
in a DSM per the variable-size nature of the information, perhaps cap
it to a max to make the memory footprint cheaper, and avoid all
on-disk footprint because we don't need it to begin with as this is
information that makes sense only while the server is running.

Also, why the single-backend limitation? One could imagine a shared
memory area indexed similarly to pgproc entries, that includes
auxiliary processes as much as backends, so as it can be possible to
get more memory footprints through SQL for more than one single
process at one moment in time. If each backend has its own area of
shmem to deal with, they could use a shared LWLock on the shmem area
with an extra spinlock while the context data is dumped into memory as
the copy is short-lived. Each one of them could save the information
in a DSM created only when a dump of the shmem is requested for a
given PID, for example.
--
Michael

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-10-23 03:58:12
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2024-10-22 03:24, Rahila Syed wrote:
> Hi,
>
> PostgreSQL provides following capabilities for reporting memory
> contexts statistics.
> 1. pg_get_backend_memory_contexts(); [1]
> 2. pg_log_backend_memory_contexts(pid); [2]
>
> [1] provides a view of memory context statistics for a local backend,
> while [2] prints the memory context statistics of any backend or
> auxiliary
> process to the PostgreSQL logs. Although [1] offers detailed
> statistics,
> it is limited to the local backend, restricting its use to PostgreSQL
> client backends only.
> On the other hand, [2] provides the statistics for all backends but
> logs them in a file,
> which may not be convenient for quick access.
>
> I propose enhancing memory context statistics reporting by combining
> these
> capabilities and offering a view of memory statistics for all
> PostgreSQL backends
> and auxiliary processes.

Thanks for working on this!

I originally tried to develop something like your proposal in [2], but
there were some difficulties and settled down to implement
pg_log_backend_memory_contexts().

> Attached is a patch that implements this functionality. It introduces
> a SQL function
> that takes the PID of a backend as an argument, returning a set of
> records,
> each containing statistics for a single memory context. The underlying
> C function
> sends a signal to the backend and waits for it to publish its memory
> context statistics
> before returning them to the user. The publishing backend copies
> these statistics
> during the next CHECK_FOR_INTERRUPTS call.

I remember waiting for dumping memory contexts stats could cause trouble
considering some erroneous cases.

For example, just after the target process finished dumping stats,
pg_get_remote_backend_memory_contexts() caller is terminated before
reading the stats, calling pg_get_remote_backend_memory_contexts() has
no response any more:

[session1]$ psql
(40699)=#

$ kill -s SIGSTOP 40699

[session2] psql
(40866)=# select * FROM
pg_get_remote_backend_memory_contexts('40699', false); -- waiting

$ kill -s SIGSTOP 40866

$ kill -s SIGCONT 40699

[session3] psql
(47656) $ select pg_terminate_backend(40866);

$ kill -s SIGCONT 40866 -- session2 terminated

[session3] (47656)=# select * FROM
pg_get_remote_backend_memory_contexts('47656', false); -- no response

It seems the reason is memCtxState->in_use is now and
memCtxState->proc_id is 40699.
We can continue to use pg_get_remote_backend_memory_contexts() after
specifying 40699, but it'd be hard to understand for users.

> This approach facilitates on-demand publication of memory statistics
> for a specific backend, rather than collecting them at regular
> intervals.
> Since past memory context statistics may no longer be relevant,
> there is little value in retaining historical data. Any collected
> statistics
> can be discarded once read by the client backend.
>
> A fixed-size shared memory block, currently accommodating 30 records,
> is used to store the statistics. This number was chosen arbitrarily,
> as it covers all parent contexts at level 1 (i.e., direct children of
> the top memory context)
> based on my tests.
> Further experiments are needed to determine the optimal number
> for summarizing memory statistics.
>
> Any additional statistics that exceed the shared memory capacity
> are written to a file per backend in the PG_TEMP_FILES_DIR. The client
> backend
> first reads from the shared memory, and if necessary, retrieves the
> remaining data from the file,
> combining everything into a unified view. The files are cleaned up
> automatically
> if a backend crashes or during server restarts.
>
> The statistics are reported in a breadth-first search order of the
> memory context tree,
> with parent contexts reported before their children. This provides a
> cumulative summary
> before diving into the details of each child context's consumption.
>
> The rationale behind the shared memory chunk is to ensure that the
> majority of contexts which are the direct children of
> TopMemoryContext,
> fit into memory
> This allows a client to request a summary of memory statistics,
> which can be served from memory without the overhead of file access,
> unless necessary.
>
> A publishing backend signals waiting client backends using a condition
>
> variable when it has finished writing its statistics to memory.
> The client backend checks whether the statistics belong to the
> requested backend.
> If not, it continues waiting on the condition variable, timing out
> after 2 minutes.
> This timeout is an arbitrary choice, and further work is required to
> determine
> a more practical value.
>
> All backends use the same memory space to publish their statistics.
> Before publishing, a backend checks whether the previous statistics
> have been
> successfully read by a client using a shared flag, "in_use."
> This flag is set by the publishing backend and cleared by the client
> backend once the data is read. If a backend cannot publish due to
> shared
> memory being occupied, it exits the interrupt processing code,
> and the client backend times out with a warning.
>
> Please find below an example query to fetch memory contexts from the
> backend
> with id '106114'. Second argument -'get_summary' is 'false',
> indicating a request for statistics of all the contexts.
>
> postgres=#
> select * FROM pg_get_remote_backend_memory_contexts('116292', false)
> LIMIT 2;
> -[ RECORD 1 ]-+----------------------
> name | TopMemoryContext
> ident |
> type | AllocSet
> path | {0}
> total_bytes | 97696
> total_nblocks | 5
> free_bytes | 15376
> free_chunks | 11
> used_bytes | 82320
> pid | 116292
> -[ RECORD 2 ]-+----------------------
> name | RowDescriptionContext
> ident |
> type | AllocSet
> path | {0,1}
> total_bytes | 8192
> total_nblocks | 1
> free_bytes | 6912
> free_chunks | 0
> used_bytes | 1280
> pid | 116292

32d3ed8165f821f introduced 1-based path to pg_backend_memory_contexts,
but pg_get_remote_backend_memory_contexts() seems to have 0-base path.

pg_backend_memory_contexts has "level" column, but
pg_get_remote_backend_memory_contexts doesn't.

Are there any reasons for these?

> TODO:
> 1. Determine the behaviour when the statistics don't fit in one file.
>
> [1] PostgreSQL: Re: Creating a function for exposing memory usage of
> backend process [1]
>
> [2] PostgreSQL: Re: Get memory contexts of an arbitrary backend
> process [2]
>
> Thank you,
> Rahila Syed
>
>
>
> Links:
> ------
> [1]
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.postgresql.org%2Fmessage-id%2F0a768ae1-1703-59c7-86cc-7068ff5e318c%2540oss.nttdata.com&data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436604911%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=cbO2DBP6IsgMPTEVFNh%2FKeq4IoK3MZvTpzKkCQzNPMo%3D&reserved=0
> [2]
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffanyv88.com%3A443%2Fhttps%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&reserved=0

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-10-23 04:50:16
Message-ID:	CAH2L28vHyd+21UEfNst3eRBH4oQ0wPpX7WZ9Hf1zTpr8CLfdjw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Michael,

Thank you for the review.

On Tue, Oct 22, 2024 at 12:18 PM Michael Paquier <michael(at)paquier(dot)xyz>
wrote:

> On Mon, Oct 21, 2024 at 11:54:21PM +0530, Rahila Syed wrote:
> > On the other hand, [2] provides the statistics for all backends but logs
> > them in a file, which may not be convenient for quick access.
>
> To be precise, pg_log_backend_memory_contexts() pushes the memory
> context stats to LOG_SERVER_ONLY or stderr, hence this is appended to
> the server logs.
>
> > A fixed-size shared memory block, currently accommodating 30 records,
> > is used to store the statistics. This number was chosen arbitrarily,
> > as it covers all parent contexts at level 1 (i.e., direct children of
> the
> > top memory context)
> > based on my tests.
> > Further experiments are needed to determine the optimal number
> > for summarizing memory statistics.
>
> + * Statistics are shared via fixed shared memory which
> + * can hold statistics for 29 contexts. The rest of the
> [...]
> + MemoryContextInfo memctx_infos[30];
> [...]
> + memset(&memCtxState->memctx_infos, 0, 30 *
> sizeof(MemoryContextInfo));
> [...]
> + size = add_size(size, mul_size(30, sizeof(MemoryContextInfo)));
> [...]
> + memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
> [...]
> + memset(&memCtxState->memctx_infos, 0, 30 *
> sizeof(MemoryContextInfo));
>
> This number is tied to MemoryContextState added by the patch. Sounds
> like this would be better as a constant properly defined rather than
> hardcoded in all these places. This would make the upper-bound more
> easily switchable in the patch.
>
>
Makes sense. Fixed in the attached patch.

> + Datum path[128];
> + char type[128];
> [...]
> + char name[1024];
> + char ident[1024];
> + char type[128];
> + Datum path[128];
>
> Again, constants. Why these values? You may want to use more
> #defines here.
>
> I added the #defines for these in the attached patch.
Size of the path array should match the number of levels in the memory
context tree and type is a constant string.

For the name and ident, I have used the existing #define
MEMORY_CONTEXT_IDENT_DISPLAY_SIZE as the size limit.

> > Any additional statistics that exceed the shared memory capacity
> > are written to a file per backend in the PG_TEMP_FILES_DIR. The client
> > backend
> > first reads from the shared memory, and if necessary, retrieves the
> > remaining data from the file,
> > combining everything into a unified view. The files are cleaned up
> > automatically
> > if a backend crashes or during server restarts.
>
> Is the addition of the file to write any remaining stats really that
> useful? This comes with a heavy cost in the patch with the "in_use"
> flag, the various tweaks around the LWLock release/acquire protecting
> the shmem area and the extra cleanup steps required after even a clean
> restart. That's a lot of facility for this kind of information.
>

The rationale behind using the file is to cater to the unbounded
number of memory contexts.
The "in_use" flag is used to govern the access to shared memory
as I am reserving enough memory for only one backend.
It ensures that another backend does not overwrite the statistics
in the shared memory, before it is read by a client backend.

> Another thing that may be worth considering is to put this information
> in a DSM per the variable-size nature of the information, perhaps cap
> it to a max to make the memory footprint cheaper, and avoid all
> on-disk footprint because we don't need it to begin with as this is
> information that makes sense only while the server is running.
>
> Thank you for the suggestion. I will look into using DSMs especially
if there is a way to limit the statistics dump, while still providing a
user
with enough information to debug memory consumption.

In this draft, I preferred using a file over DSMs, as a file can provide
ample space for dumping a large number of memory context statistics
without the risk of DSM creation failure due to insufficient memory.

Also, why the single-backend limitation?

To reduce the memory footprint, the shared memory is
created for only one backend.
Each backend has to wait for previous operation
to finish before it can write.

I think a good use case for this would be a background process
periodically running the monitoring function on each of the
backends sequentially to fetch the statistics.
This way there will be little contention for shared memory.

In case a shared memory is not available, a backend immediately
returns from the interrupt handler without blocking its normal
operations.

One could imagine a shared
> memory area indexed similarly to pgproc entries, that includes
> auxiliary processes as much as backends, so as it can be possible to
> get more memory footprints through SQL for more than one single
> process at one moment in time. If each backend has its own area of
> shmem to deal with, they could use a shared LWLock on the shmem area
> with an extra spinlock while the context data is dumped into memory as
> the copy is short-lived. Each one of them could save the information
> in a DSM created only when a dump of the shmem is requested for a
> given PID, for example.
>

I agree that such an infrastructure would be useful for fetching memory
statistics concurrently without significant synchronization overhead.
However, a drawback of this approach is reserving shared
memory slots up to MAX_BACKENDS without utilizing them
when no concurrent monitoring is happening.
As you mentioned, creating a DSM on the fly when a dump
request is received could help avoid over-allocating shared memory.
I will look into this suggestion

Thank you for your feedback!

Rahila Syed

Attachment	Content-Type	Size
0002-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	34.2 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-10-24 05:59:46
Message-ID:	CAH2L28uPdS+74QCQMeVzoE5+rpPs4Eh7geB77AicDZMJ6mkiAg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Torikoshia,

Thank you for reviewing the patch!

On Wed, Oct 23, 2024 at 9:28 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
wrote:

> On 2024-10-22 03:24, Rahila Syed wrote:
> > Hi,
> >
> > PostgreSQL provides following capabilities for reporting memory
> > contexts statistics.
> > 1. pg_get_backend_memory_contexts(); [1]
> > 2. pg_log_backend_memory_contexts(pid); [2]
> >
> > [1] provides a view of memory context statistics for a local backend,
> > while [2] prints the memory context statistics of any backend or
> > auxiliary
> > process to the PostgreSQL logs. Although [1] offers detailed
> > statistics,
> > it is limited to the local backend, restricting its use to PostgreSQL
> > client backends only.
> > On the other hand, [2] provides the statistics for all backends but
> > logs them in a file,
> > which may not be convenient for quick access.
> >
> > I propose enhancing memory context statistics reporting by combining
> > these
> > capabilities and offering a view of memory statistics for all
> > PostgreSQL backends
> > and auxiliary processes.
>
> Thanks for working on this!
>
> I originally tried to develop something like your proposal in [2], but
> there were some difficulties and settled down to implement
> pg_log_backend_memory_contexts().
>
> Yes. I am revisiting this problem :)

> > Attached is a patch that implements this functionality. It introduces
> > a SQL function
> > that takes the PID of a backend as an argument, returning a set of
> > records,
> > each containing statistics for a single memory context. The underlying
> > C function
> > sends a signal to the backend and waits for it to publish its memory
> > context statistics
> > before returning them to the user. The publishing backend copies
> > these statistics
> > during the next CHECK_FOR_INTERRUPTS call.
>
> I remember waiting for dumping memory contexts stats could cause trouble
> considering some erroneous cases.
>
> For example, just after the target process finished dumping stats,
> pg_get_remote_backend_memory_contexts() caller is terminated before
> reading the stats, calling pg_get_remote_backend_memory_contexts() has
> no response any more:
>
> [session1]$ psql
> (40699)=#
>
> $ kill -s SIGSTOP 40699
>
> [session2] psql
> (40866)=# select * FROM
> pg_get_remote_backend_memory_contexts('40699', false); -- waiting
>
> $ kill -s SIGSTOP 40866
>
> $ kill -s SIGCONT 40699
>
> [session3] psql
> (47656) $ select pg_terminate_backend(40866);
>
> $ kill -s SIGCONT 40866 -- session2 terminated
>
> [session3] (47656)=# select * FROM
> pg_get_remote_backend_memory_contexts('47656', false); -- no response
>
> It seems the reason is memCtxState->in_use is now and
> memCtxState->proc_id is 40699.
> We can continue to use pg_get_remote_backend_memory_contexts() after
> specifying 40699, but it'd be hard to understand for users.
>
> Thanks for testing and reporting. While I am not able to reproduce this
problem,
I think this may be happening because the requesting backend/caller is
terminated
before it gets a chance to mark memCtxState->in_use as false.

In this case memCtxState->in_use should be marked as
'false' possibly during the processing of ProcDiePending in
ProcessInterrupts().

> This approach facilitates on-demand publication of memory statistics
> > for a specific backend, rather than collecting them at regular
> > intervals.
> > Since past memory context statistics may no longer be relevant,
> > there is little value in retaining historical data. Any collected
> > statistics
> > can be discarded once read by the client backend.
> >
> > A fixed-size shared memory block, currently accommodating 30 records,
> > is used to store the statistics. This number was chosen arbitrarily,
> > as it covers all parent contexts at level 1 (i.e., direct children of
> > the top memory context)
> > based on my tests.
> > Further experiments are needed to determine the optimal number
> > for summarizing memory statistics.
> >
> > Any additional statistics that exceed the shared memory capacity
> > are written to a file per backend in the PG_TEMP_FILES_DIR. The client
> > backend
> > first reads from the shared memory, and if necessary, retrieves the
> > remaining data from the file,
> > combining everything into a unified view. The files are cleaned up
> > automatically
> > if a backend crashes or during server restarts.
> >
> > The statistics are reported in a breadth-first search order of the
> > memory context tree,
> > with parent contexts reported before their children. This provides a
> > cumulative summary
> > before diving into the details of each child context's consumption.
> >
> > The rationale behind the shared memory chunk is to ensure that the
> > majority of contexts which are the direct children of
> > TopMemoryContext,
> > fit into memory
> > This allows a client to request a summary of memory statistics,
> > which can be served from memory without the overhead of file access,
> > unless necessary.
> >
> > A publishing backend signals waiting client backends using a condition
> >
> > variable when it has finished writing its statistics to memory.
> > The client backend checks whether the statistics belong to the
> > requested backend.
> > If not, it continues waiting on the condition variable, timing out
> > after 2 minutes.
> > This timeout is an arbitrary choice, and further work is required to
> > determine
> > a more practical value.
> >
> > All backends use the same memory space to publish their statistics.
> > Before publishing, a backend checks whether the previous statistics
> > have been
> > successfully read by a client using a shared flag, "in_use."
> > This flag is set by the publishing backend and cleared by the client
> > backend once the data is read. If a backend cannot publish due to
> > shared
> > memory being occupied, it exits the interrupt processing code,
> > and the client backend times out with a warning.
> >
> > Please find below an example query to fetch memory contexts from the
> > backend
> > with id '106114'. Second argument -'get_summary' is 'false',
> > indicating a request for statistics of all the contexts.
> >
> > postgres=#
> > select * FROM pg_get_remote_backend_memory_contexts('116292', false)
> > LIMIT 2;
> > -[ RECORD 1 ]-+----------------------
> > name | TopMemoryContext
> > ident |
> > type | AllocSet
> > path | {0}
> > total_bytes | 97696
> > total_nblocks | 5
> > free_bytes | 15376
> > free_chunks | 11
> > used_bytes | 82320
> > pid | 116292
> > -[ RECORD 2 ]-+----------------------
> > name | RowDescriptionContext
> > ident |
> > type | AllocSet
> > path | {0,1}
> > total_bytes | 8192
> > total_nblocks | 1
> > free_bytes | 6912
> > free_chunks | 0
> > used_bytes | 1280
> > pid | 116292
>
> 32d3ed8165f821f introduced 1-based path to pg_backend_memory_contexts,
> but pg_get_remote_backend_memory_contexts() seems to have 0-base path.
>
> Right. I will change it to match this commit.

> pg_backend_memory_contexts has "level" column, but
> pg_get_remote_backend_memory_contexts doesn't.
>
> Are there any reasons for these?
>
> No particular reason, I can add this column as well.

Thank you,
Rahila Syed

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-10-25 00:21:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2024-10-24 14:59, Rahila Syed wrote:
> Hi Torikoshia,
>
> Thank you for reviewing the patch!
>
> On Wed, Oct 23, 2024 at 9:28 AM torikoshia
> <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
>
>> On 2024-10-22 03:24, Rahila Syed wrote:
>>> Hi,
>>>
>>> PostgreSQL provides following capabilities for reporting memory
>>> contexts statistics.
>>> 1. pg_get_backend_memory_contexts(); [1]
>>> 2. pg_log_backend_memory_contexts(pid); [2]
>>>
>>> [1] provides a view of memory context statistics for a local
>> backend,
>>> while [2] prints the memory context statistics of any backend or
>>> auxiliary
>>> process to the PostgreSQL logs. Although [1] offers detailed
>>> statistics,
>>> it is limited to the local backend, restricting its use to
>> PostgreSQL
>>> client backends only.
>>> On the other hand, [2] provides the statistics for all backends
>> but
>>> logs them in a file,
>>> which may not be convenient for quick access.
>>>
>>> I propose enhancing memory context statistics reporting by
>> combining
>>> these
>>> capabilities and offering a view of memory statistics for all
>>> PostgreSQL backends
>>> and auxiliary processes.
>>
>> Thanks for working on this!
>>
>> I originally tried to develop something like your proposal in [2],
>> but
>> there were some difficulties and settled down to implement
>> pg_log_backend_memory_contexts().
>
> Yes. I am revisiting this problem :)
>
>>> Attached is a patch that implements this functionality. It
>> introduces
>>> a SQL function
>>> that takes the PID of a backend as an argument, returning a set of
>>> records,
>>> each containing statistics for a single memory context. The
>> underlying
>>> C function
>>> sends a signal to the backend and waits for it to publish its
>> memory
>>> context statistics
>>> before returning them to the user. The publishing backend copies
>>> these statistics
>>> during the next CHECK_FOR_INTERRUPTS call.
>>
>> I remember waiting for dumping memory contexts stats could cause
>> trouble
>> considering some erroneous cases.
>>
>> For example, just after the target process finished dumping stats,
>> pg_get_remote_backend_memory_contexts() caller is terminated before
>> reading the stats, calling pg_get_remote_backend_memory_contexts()
>> has
>> no response any more:
>>
>> [session1]$ psql
>> (40699)=#
>>
>> $ kill -s SIGSTOP 40699
>>
>> [session2] psql
>> (40866)=# select * FROM
>> pg_get_remote_backend_memory_contexts('40699', false); -- waiting
>>
>> $ kill -s SIGSTOP 40866
>>
>> $ kill -s SIGCONT 40699
>>
>> [session3] psql
>> (47656) $ select pg_terminate_backend(40866);
>>
>> $ kill -s SIGCONT 40866 -- session2 terminated
>>
>> [session3] (47656)=# select * FROM
>> pg_get_remote_backend_memory_contexts('47656', false); -- no
>> response
>>
>> It seems the reason is memCtxState->in_use is now and
>> memCtxState->proc_id is 40699.
>> We can continue to use pg_get_remote_backend_memory_contexts() after
>>
>> specifying 40699, but it'd be hard to understand for users.
>
> Thanks for testing and reporting. While I am not able to reproduce
> this problem,
> I think this may be happening because the requesting backend/caller is
> terminated
> before it gets a chance to mark memCtxState->in_use as false.

Yeah, when I attached a debugger to 47656 when it was waiting on
pg_get_remote_backend_memory_contexts('47656', false),
memCtxState->in_use was true as you suspected:

(lldb) p memCtxState->in_use
(bool) $1 = true
(lldb) p memCtxState->proc_id
(int) $2 = 40699
(lldb) p pid
(int) $3 = 47656

> In this case memCtxState->in_use should be marked as
> 'false' possibly during the processing of ProcDiePending in
> ProcessInterrupts().
>
>>> This approach facilitates on-demand publication of memory
>> statistics
>>> for a specific backend, rather than collecting them at regular
>>> intervals.
>>> Since past memory context statistics may no longer be relevant,
>>> there is little value in retaining historical data. Any collected
>>> statistics
>>> can be discarded once read by the client backend.
>>>
>>> A fixed-size shared memory block, currently accommodating 30
>> records,
>>> is used to store the statistics. This number was chosen
>> arbitrarily,
>>> as it covers all parent contexts at level 1 (i.e., direct
>> children of
>>> the top memory context)
>>> based on my tests.
>>> Further experiments are needed to determine the optimal number
>>> for summarizing memory statistics.
>>>
>>> Any additional statistics that exceed the shared memory capacity
>>> are written to a file per backend in the PG_TEMP_FILES_DIR. The
>> client
>>> backend
>>> first reads from the shared memory, and if necessary, retrieves
>> the
>>> remaining data from the file,
>>> combining everything into a unified view. The files are cleaned up
>>> automatically
>>> if a backend crashes or during server restarts.
>>>
>>> The statistics are reported in a breadth-first search order of the
>>> memory context tree,
>>> with parent contexts reported before their children. This
>> provides a
>>> cumulative summary
>>> before diving into the details of each child context's
>> consumption.
>>>
>>> The rationale behind the shared memory chunk is to ensure that the
>>> majority of contexts which are the direct children of
>>> TopMemoryContext,
>>> fit into memory
>>> This allows a client to request a summary of memory statistics,
>>> which can be served from memory without the overhead of file
>> access,
>>> unless necessary.
>>>
>>> A publishing backend signals waiting client backends using a
>> condition
>>>
>>> variable when it has finished writing its statistics to memory.
>>> The client backend checks whether the statistics belong to the
>>> requested backend.
>>> If not, it continues waiting on the condition variable, timing out
>>> after 2 minutes.
>>> This timeout is an arbitrary choice, and further work is required
>> to
>>> determine
>>> a more practical value.
>>>
>>> All backends use the same memory space to publish their
>> statistics.
>>> Before publishing, a backend checks whether the previous
>> statistics
>>> have been
>>> successfully read by a client using a shared flag, "in_use."
>>> This flag is set by the publishing backend and cleared by the
>> client
>>> backend once the data is read. If a backend cannot publish due to
>>> shared
>>> memory being occupied, it exits the interrupt processing code,
>>> and the client backend times out with a warning.
>>>
>>> Please find below an example query to fetch memory contexts from
>> the
>>> backend
>>> with id '106114'. Second argument -'get_summary' is 'false',
>>> indicating a request for statistics of all the contexts.
>>>
>>> postgres=#
>>> select * FROM pg_get_remote_backend_memory_contexts('116292',
>> false)
>>> LIMIT 2;
>>> -[ RECORD 1 ]-+----------------------
>>> name | TopMemoryContext
>>> ident |
>>> type | AllocSet
>>> path | {0}
>>> total_bytes | 97696
>>> total_nblocks | 5
>>> free_bytes | 15376
>>> free_chunks | 11
>>> used_bytes | 82320
>>> pid | 116292
>>> -[ RECORD 2 ]-+----------------------
>>> name | RowDescriptionContext
>>> ident |
>>> type | AllocSet
>>> path | {0,1}
>>> total_bytes | 8192
>>> total_nblocks | 1
>>> free_bytes | 6912
>>> free_chunks | 0
>>> used_bytes | 1280
>>> pid | 116292
>>
>> 32d3ed8165f821f introduced 1-based path to
>> pg_backend_memory_contexts,
>> but pg_get_remote_backend_memory_contexts() seems to have 0-base
>> path.
>
> Right. I will change it to match this commit.
>
>> pg_backend_memory_contexts has "level" column, but
>> pg_get_remote_backend_memory_contexts doesn't.
>>
>> Are there any reasons for these?
>
> No particular reason, I can add this column as well.
>
> Thank you,
> Rahila Syed

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

From:	Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-10-26 14:03:32
Message-ID:	CAMm1aWYVz1A6LsX4yL555Y18bNMVRQeUW=coTJwiEUgGrNoPKA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Rahila,

I’ve spent some time reviewing the patch, and the review is still
ongoing. Here are the comments I’ve found so far.

1.
The tests are currently missing. Could you please add them?

2.
I have some concerns regarding the function name
‘pg_get_remote_backend_memory_contexts’. Specifically, the term
‘remote’ doesn’t seem appropriate to me. The function retrieves data
from other processes running on the same machine, which might give the
impression that it deals with processes on different machines. This
could be misleading or unclear in this context. The argument ‘pid’
already indicates that it can get data from different processes.
Additionally, the term ‘backend’ also seems inappropriate since we are
obtaining data from processes that are different from backend
processes.

3.
> + Datum values[10];
> + bool nulls[10];

Please consider #defining the column count, or you could reuse the
existing one ‘PG_GET_BACKEND_MEMORY_CONTEXTS_COLS’.

4.
> if (context_id <= 28)
> if (context_id == 29)
> if (context_id < 29)

#define these

5.
> for (MemoryContext cur_context = cur; cur_context != NULL; cur_context = cur_context->parent)
> {
> MemoryContextId *cur_entry;
>
> cur_entry = hash_search(context_id_lookup, &cur_context, HASH_FIND, &found);
>
> if (!found)
> {
> elog(LOG, "hash table corrupted, can't construct path value");
> break;
> }
> path = lcons_int(cur_entry->context_id, path);
> }

Similar code already exists in PutMemoryContextsStatsTupleStore().
Could you create a separate function to handle this?

6.
> /*
> * Shared memory is full, release lock and write to file from next
> * iteration
> */
> context_id++;
> if (context_id == 29)
> {

What if there are exactly 29 entries in the memory context? In that
case, creating the file would be unnecessary.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Wed, Oct 23, 2024 at 10:20 AM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>
> Hi Michael,
>
> Thank you for the review.
>
> On Tue, Oct 22, 2024 at 12:18 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>>
>> On Mon, Oct 21, 2024 at 11:54:21PM +0530, Rahila Syed wrote:
>> > On the other hand, [2] provides the statistics for all backends but logs
>> > them in a file, which may not be convenient for quick access.
>>
>> To be precise, pg_log_backend_memory_contexts() pushes the memory
>> context stats to LOG_SERVER_ONLY or stderr, hence this is appended to
>> the server logs.
>>
>> > A fixed-size shared memory block, currently accommodating 30 records,
>> > is used to store the statistics. This number was chosen arbitrarily,
>> > as it covers all parent contexts at level 1 (i.e., direct children of the
>> > top memory context)
>> > based on my tests.
>> > Further experiments are needed to determine the optimal number
>> > for summarizing memory statistics.
>>
>> + * Statistics are shared via fixed shared memory which
>> + * can hold statistics for 29 contexts. The rest of the
>> [...]
>> + MemoryContextInfo memctx_infos[30];
>> [...]
>> + memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
>> [...]
>> + size = add_size(size, mul_size(30, sizeof(MemoryContextInfo)));
>> [...]
>> + memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
>> [...]
>> + memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
>>
>> This number is tied to MemoryContextState added by the patch. Sounds
>> like this would be better as a constant properly defined rather than
>> hardcoded in all these places. This would make the upper-bound more
>> easily switchable in the patch.
>>
>
> Makes sense. Fixed in the attached patch.
>
>>
>> + Datum path[128];
>> + char type[128];
>> [...]
>> + char name[1024];
>> + char ident[1024];
>> + char type[128];
>> + Datum path[128];
>>
>> Again, constants. Why these values? You may want to use more
>> #defines here.
>>
> I added the #defines for these in the attached patch.
> Size of the path array should match the number of levels in the memory
> context tree and type is a constant string.
>
> For the name and ident, I have used the existing #define
> MEMORY_CONTEXT_IDENT_DISPLAY_SIZE as the size limit.
>
>>
>> > Any additional statistics that exceed the shared memory capacity
>> > are written to a file per backend in the PG_TEMP_FILES_DIR. The client
>> > backend
>> > first reads from the shared memory, and if necessary, retrieves the
>> > remaining data from the file,
>> > combining everything into a unified view. The files are cleaned up
>> > automatically
>> > if a backend crashes or during server restarts.
>>
>> Is the addition of the file to write any remaining stats really that
>> useful? This comes with a heavy cost in the patch with the "in_use"
>> flag, the various tweaks around the LWLock release/acquire protecting
>> the shmem area and the extra cleanup steps required after even a clean
>> restart. That's a lot of facility for this kind of information.
>
>
> The rationale behind using the file is to cater to the unbounded
> number of memory contexts.
> The "in_use" flag is used to govern the access to shared memory
> as I am reserving enough memory for only one backend.
> It ensures that another backend does not overwrite the statistics
> in the shared memory, before it is read by a client backend.
>
>>
>> Another thing that may be worth considering is to put this information
>> in a DSM per the variable-size nature of the information, perhaps cap
>> it to a max to make the memory footprint cheaper, and avoid all
>> on-disk footprint because we don't need it to begin with as this is
>> information that makes sense only while the server is running.
>>
> Thank you for the suggestion. I will look into using DSMs especially
> if there is a way to limit the statistics dump, while still providing a user
> with enough information to debug memory consumption.
>
> In this draft, I preferred using a file over DSMs, as a file can provide
> ample space for dumping a large number of memory context statistics
> without the risk of DSM creation failure due to insufficient memory.
>
>> Also, why the single-backend limitation?
>
>
> To reduce the memory footprint, the shared memory is
> created for only one backend.
> Each backend has to wait for previous operation
> to finish before it can write.
>
> I think a good use case for this would be a background process
> periodically running the monitoring function on each of the
> backends sequentially to fetch the statistics.
> This way there will be little contention for shared memory.
>
> In case a shared memory is not available, a backend immediately
> returns from the interrupt handler without blocking its normal
> operations.
>
>> One could imagine a shared
>> memory area indexed similarly to pgproc entries, that includes
>> auxiliary processes as much as backends, so as it can be possible to
>> get more memory footprints through SQL for more than one single
>> process at one moment in time. If each backend has its own area of
>> shmem to deal with, they could use a shared LWLock on the shmem area
>> with an extra spinlock while the context data is dumped into memory as
>> the copy is short-lived. Each one of them could save the information
>> in a DSM created only when a dump of the shmem is requested for a
>> given PID, for example.
>
>
> I agree that such an infrastructure would be useful for fetching memory
> statistics concurrently without significant synchronization overhead.
> However, a drawback of this approach is reserving shared
> memory slots up to MAX_BACKENDS without utilizing them
> when no concurrent monitoring is happening.
> As you mentioned, creating a DSM on the fly when a dump
> request is received could help avoid over-allocating shared memory.
> I will look into this suggestion
>
> Thank you for your feedback!
>
> Rahila Syed

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-10-26 14:14:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2024-Oct-21, Rahila Syed wrote:

> I propose enhancing memory context statistics reporting by combining
> these capabilities and offering a view of memory statistics for all
> PostgreSQL backends and auxiliary processes.

Sounds good.

> A fixed-size shared memory block, currently accommodating 30 records,
> is used to store the statistics.

Hmm, would it make sene to use dynamic shared memory for this? The
publishing backend could dsm_create one DSM chunk of the exact size that
it needs, pass the dsm_handle to the consumer, and then have it be
destroy once it's been read. That way you don't have to define an
arbitrary limit of any size. (Maybe you could keep a limit to how much
is published in shared memory and spill the rest to disk, but I think
such a limit should be very high[1], so that it's unlikely to take
effect in normal cases.)

[1] This is very arbitrary of course, but 1 MB gives enough room for
some 7000 contexts, which should cover normal cases.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Find a bug in a program, and fix it, and the program will work today.
Show the program how to find and fix a bug, and the program
will work forever" (Oliver Silfridge)

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-10-29 14:51:57
Message-ID:	hi23wbergcrdxzvoibpmiu3vpgkn7pop5mn4zqepfoah3h3w4j@hiltn5pw4f3r
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2024-10-26 16:14:25 +0200, Alvaro Herrera wrote:
> > A fixed-size shared memory block, currently accommodating 30 records,
> > is used to store the statistics.
>
> Hmm, would it make sene to use dynamic shared memory for this?

> The publishing backend could dsm_create one DSM chunk of the exact size that
> it needs, pass the dsm_handle to the consumer, and then have it be destroy
> once it's been read.

I'd probably just make it a dshash table or such, keyed by the pid, pointing
to a dsa allocation with the stats.

> That way you don't have to define an arbitrary limit
> of any size. (Maybe you could keep a limit to how much is published in
> shared memory and spill the rest to disk, but I think such a limit should be
> very high[1], so that it's unlikely to take effect in normal cases.)
>
> [1] This is very arbitrary of course, but 1 MB gives enough room for
> some 7000 contexts, which should cover normal cases.

Agreed. I can see a point in a limit for extreme cases, but spilling to disk
doesn't seem particularly useful.

Greetings,

Andres Freund

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-13 07:30:52
Message-ID:	CAH2L28vtosqeTDuyY0v-2WUGcbhcX80fpAqiVwvBzk=+yn5awg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Thank you for the review.

>
> Hmm, would it make sene to use dynamic shared memory for this? The
> publishing backend could dsm_create one DSM chunk of the exact size that
> it needs, pass the dsm_handle to the consumer, and then have it be
> destroy once it's been read. That way you don't have to define an
> arbitrary limit of any size. (Maybe you could keep a limit to how much
> is published in shared memory and spill the rest to disk, but I think
> such a limit should be very high[1], so that it's unlikely to take
> effect in normal cases.)

> [1] This is very arbitrary of course, but 1 MB gives enough room for
> some 7000 contexts, which should cover normal cases.
>

I used one DSA area per process to share statistics. Currently,
the size limit for each DSA is 16 MB, which can accommodate
approximately 6,700 MemoryContextInfo structs. Any additional
statistics will spill over to a file. I opted for DSAs over DSMs to
enable memory reuse by freeing segments for subsequent
statistics copies of the same backend, without needing to
recreate DSMs for each request.

The dsa_handle for each process is stored in an array,
indexed by the procNumber, within the shared memory.
The maximum size of this array is defined as the sum of
MaxBackends and the number of auxiliary processes.

As requested earlier, I have renamed the function to
pg_get_process_memory_contexts(pid, get_summary).
Suggestions for a better name are welcome.
When the get_summary argument is set to true, the function provides
statistics for memory contexts up to level 2—that is, the
top memory context and all its children.

Please find attached a rebased patch that includes these changes.
I will work on adding a test for the function and some code refactoring
suggestions.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v2-0001-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	37.9 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-14 04:39:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 13, 2024 at 01:00:52PM +0530, Rahila Syed wrote:
> I used one DSA area per process to share statistics. Currently,
> the size limit for each DSA is 16 MB, which can accommodate
> approximately 6,700 MemoryContextInfo structs. Any additional
> statistics will spill over to a file. I opted for DSAs over DSMs to
> enable memory reuse by freeing segments for subsequent
> statistics copies of the same backend, without needing to
> recreate DSMs for each request.

Already mentioned previously at [1] and echoing with some surrounding
arguments, but I'd suggest to keep it simple and just remove entirely
the part of the patch where the stats information gets spilled into
disk. With more than 6000-ish context information available with a
hard limit in place, there should be plenty enough to know what's
going on anyway.

[1]: https://postgr.es/m/[email protected]
--
Michael

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-14 11:48:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2024-Nov-14, Michael Paquier wrote:

> Already mentioned previously at [1] and echoing with some surrounding
> arguments, but I'd suggest to keep it simple and just remove entirely
> the part of the patch where the stats information gets spilled into
> disk. With more than 6000-ish context information available with a
> hard limit in place, there should be plenty enough to know what's
> going on anyway.

Functionally-wise I don't necessarily agree with _removing_ the spill
code, considering that production systems with thousands of tables would
easily reach that number of contexts (each index gets its own index info
context, each regexp gets its own memcxt); and I don't think silently
omitting a fraction of people's memory situation (or erroring out if the
case is hit) is going to make us any friends.

That said, it worries me that we choose a shared memory size so large
that it becomes impractical to hit the spill-to-disk code in regression
testing. Maybe we can choose a much smaller limit size when
USE_ASSERT_CHECKING is enabled, and use a test that hits that number?
That way, we know the code is being hit and tested, without imposing a
huge memory consumption on test machines.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-15 13:58:51
Message-ID:	CAH2L28vHnw36jpp9TvnarvqtkNBEjshJiO_-_ZzvL4uNBtvu6Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Thu, Nov 14, 2024 at 5:18 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
wrote:

> On 2024-Nov-14, Michael Paquier wrote:
>
> > Already mentioned previously at [1] and echoing with some surrounding
> > arguments, but I'd suggest to keep it simple and just remove entirely
> > the part of the patch where the stats information gets spilled into
> > disk. With more than 6000-ish context information available with a
> > hard limit in place, there should be plenty enough to know what's
> > going on anyway.
>
> Functionally-wise I don't necessarily agree with _removing_ the spill
> code, considering that production systems with thousands of tables would
> easily reach that number of contexts (each index gets its own index info
> context, each regexp gets its own memcxt); and I don't think silently
> omitting a fraction of people's memory situation (or erroring out if the
> case is hit) is going to make us any friends.
>
>
While I agree that removing the spill-to-file logic will simplify the code,
I also understand the rationale for retaining it to ensure completeness.
To achieve both completeness and avoid writing to a file, I can consider
displaying the numbers for the remaining contexts as a cumulative total
at the end of the output.

Something like follows:
```
postgres=# select * from pg_get_process_memory_contexts('237244', false);
name | ident
| type | path | total_bytes | tot
al_nblocks | free_bytes | free_chunks | used_bytes | pid
---------------------------------------+------------------------------------------------+----------+--------------+-------------+----
-----------+------------+-------------+------------+--------
TopMemoryContext |
| AllocSet | {0} | 97696 |
5 | 14288 | 11 | 83408 | 237244
search_path processing cache |
| AllocSet | {0,1} | 8192 |
1 | 5328 | 7 | 2864 | 237244

*Remaining contexts total: 23456 bytes (total_bytes) , 12345(used_bytes),
11,111(free_bytes)*
```

> That said, it worries me that we choose a shared memory size so large
> that it becomes impractical to hit the spill-to-disk code in regression
> testing. Maybe we can choose a much smaller limit size when
> USE_ASSERT_CHECKING is enabled, and use a test that hits that number?
> That way, we know the code is being hit and tested, without imposing a
> huge memory consumption on test machines.
>

Makes sense. I will look into writing such a test, if we finalize the
approach
of spill-to-file.

Please find attached a rebased and updated patch with a basic test
and some fixes. Kindly let me know your thoughts.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v3-0001-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	39.8 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-20 09:09:15
Message-ID:	CAH2L28sDxhx-vic=Gq3U5=OA_DYW_9-VMYjMggtKNw0ixULuBQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

To achieve both completeness and avoid writing to a file, I can consider
> displaying the numbers for the remaining contexts as a cumulative total
> at the end of the output.
>
> Something like follows:
> ```
> postgres=# select * from pg_get_process_memory_contexts('237244', false);
> name | ident
> | type | path | total_bytes | tot
> al_nblocks | free_bytes | free_chunks | used_bytes | pid
>
> ---------------------------------------+------------------------------------------------+----------+--------------+-------------+----
> -----------+------------+-------------+------------+--------
> TopMemoryContext |
> | AllocSet | {0} | 97696 |
> 5 | 14288 | 11 | 83408 | 237244
> search_path processing cache |
> | AllocSet | {0,1} | 8192 |
> 1 | 5328 | 7 | 2864 | 237244
>
> *Remaining contexts total: 23456 bytes (total_bytes) ,
> 12345(used_bytes), 11,111(free_bytes)*
> ```
>

Please find attached an updated patch with this change. The file previously
used to
store spilled statistics has been removed. Instead, a cumulative total of
the
remaining/spilled context statistics is now stored in the DSM segment,
which is
displayed as follows.

postgres=# select * from pg_get_process_memory_contexts('352966', false);
* name * | ident | type | path | *total_bytes*
| *total_nblocks* | *free_bytes *| *free_chunks *| *used_bytes* | pi
d
------------------------------+-------+----------+--------+-------------+---------------+------------+-------------+------------+----
----
TopMemoryContext | | AllocSet | {0} | 97696 |
5 | 14288 | 11 | 83408 | 352
966
.
.
.
MdSmgr | | AllocSet | {0,18} | 8192 |
1 | 7424 | 0 | 768 | 352
966
* Remaining Totals* | | | | *1756016*
| *188 *| *658584 *| * 132* | * 1097432 *| 352
966
(7129 rows)
-----

I believe this serves as a good compromise between completeness
and avoiding the overhead of file handling. However, I am open to
reintroducing file handling if displaying the complete statistics of the
remaining contexts prove to be more important.

All the known bugs in the patch have been fixed.

In summary, one DSA per PostgreSQL process is used to share
the statistics of that process. A DSA is created by the first client
backend that requests memory context statistics, and it is pinned
for all future requests to that process.
A handle to this DSA is shared between the client and the publishing
process using fixed shared memory. The fixed shared memory consists
of an array of size MaxBackends + auxiliary processes, indexed
by procno. Each element in this array is less than 100 bytes in size.

A PostgreSQL process uses a condition variable to signal a waiting client
backend once it has finished publishing the statistics. If, for some
reason,
the signal is not sent, the waiting client backend will time out.

When statistics of a local backend is requested, this function returns the
following
WARNING and exits, since this can be handled by an existing function which
doesn't require a DSA.

WARNING: cannot return statistics for local backend
HINT: Use pg_get_backend_memory_contexts instead

Looking forward to your review.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v4-Function-to-report-memory-context-stats-of-a-process.patch	application/octet-stream	34.2 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-20 14:01:04
Message-ID:	CAExHW5t4ToxhNE62zNboW2Z-B_EmD31eMQrFunR8GGnudyR7HA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 20, 2024 at 2:39 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>
> Hi,
>
>> To achieve both completeness and avoid writing to a file, I can consider
>> displaying the numbers for the remaining contexts as a cumulative total
>> at the end of the output.
>>
>> Something like follows:
>> ```
>> postgres=# select * from pg_get_process_memory_contexts('237244', false);
>> name | ident | type | path | total_bytes | tot
>> al_nblocks | free_bytes | free_chunks | used_bytes | pid
>> ---------------------------------------+------------------------------------------------+----------+--------------+-------------+----
>> -----------+------------+-------------+------------+--------
>> TopMemoryContext | | AllocSet | {0} | 97696 |
>> 5 | 14288 | 11 | 83408 | 237244
>> search_path processing cache | | AllocSet | {0,1} | 8192 |
>> 1 | 5328 | 7 | 2864 | 237244
>> Remaining contexts total: 23456 bytes (total_bytes) , 12345(used_bytes), 11,111(free_bytes)
>>
>> ```
>
>
> Please find attached an updated patch with this change. The file previously used to
> store spilled statistics has been removed. Instead, a cumulative total of the
> remaining/spilled context statistics is now stored in the DSM segment, which is
> displayed as follows.
>
> postgres=# select * from pg_get_process_memory_contexts('352966', false);
> name | ident | type | path | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes | pi
> d
> ------------------------------+-------+----------+--------+-------------+---------------+------------+-------------+------------+----
> ----
> TopMemoryContext | | AllocSet | {0} | 97696 | 5 | 14288 | 11 | 83408 | 352
> 966
> .
> .
> .
> MdSmgr | | AllocSet | {0,18} | 8192 | 1 | 7424 | 0 | 768 | 352
> 966
> Remaining Totals | | | | 1756016 | 188 | 658584 | 132 | 1097432 | 352
> 966
> (7129 rows)
> -----
>
> I believe this serves as a good compromise between completeness
> and avoiding the overhead of file handling. However, I am open to
> reintroducing file handling if displaying the complete statistics of the
> remaining contexts prove to be more important.
>
> All the known bugs in the patch have been fixed.
>
> In summary, one DSA per PostgreSQL process is used to share
> the statistics of that process. A DSA is created by the first client
> backend that requests memory context statistics, and it is pinned
> for all future requests to that process.
> A handle to this DSA is shared between the client and the publishing
> process using fixed shared memory. The fixed shared memory consists
> of an array of size MaxBackends + auxiliary processes, indexed
> by procno. Each element in this array is less than 100 bytes in size.
>
> A PostgreSQL process uses a condition variable to signal a waiting client
> backend once it has finished publishing the statistics. If, for some reason,
> the signal is not sent, the waiting client backend will time out.

How does the process know that the client backend has finished reading
stats and it can be refreshed? What happens, if the next request for
memory context stats comes before first requester has consumed the
statistics it requested?

Does the shared memory get deallocated when the backend which
allocated it exits?

>
> When statistics of a local backend is requested, this function returns the following
> WARNING and exits, since this can be handled by an existing function which
> doesn't require a DSA.
>
> WARNING: cannot return statistics for local backend
> HINT: Use pg_get_backend_memory_contexts instead

How about using pg_get_backend_memory_contexts() for both - local as
well as other backend? Let PID argument default to NULL which would
indicate local backend, otherwise some other backend?

--
Best Wishes,
Ashutosh Bapat

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-22 13:03:36
Message-ID:	CAH2L28uUHAMHXV9gF_=4DaXUocYchWGq3_JyabTmqsSD3GOCbQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

How does the process know that the client backend has finished reading
> stats and it can be refreshed? What happens, if the next request for
> memory context stats comes before first requester has consumed the
> statistics it requested?
>
> A process that's copying its statistics does not need to know that.
Whenever it receives a signal to copy statistics, it goes ahead and
copies the latest statistics to the DSA after acquiring an exclusive
lwlock.

A requestor takes a lock before it starts consuming the statistics.
If the next request comes while the first requestor is consuming the
statistics, the publishing process will wait on lwlock to be released
by the consuming process before it can write the statistics.
If the next request arrives before the first requester begins consuming
the statistics, the publishing process will acquire the lock and overwrite
the earlier statistics with the most recent ones.
As a result, both the first and second requesters will consume the
updated statistics.

Does the shared memory get deallocated when the backend which
> allocated it exits?
>
> Memory in the DSA is allocated by a postgres process and deallocated
by the client backend for each request. Both the publishing postgres process
and the client backend detach from the DSA at the end of each request.
However, the DSM segment(s) persist even after all the processes exit
and are only destroyed upon a server restart. Each DSA is associated
with the procNumber of a postgres process and
can be re-used by any future process with the same procNumber.

> >
> > When statistics of a local backend is requested, this function returns
> the following
> > WARNING and exits, since this can be handled by an existing function
> which
> > doesn't require a DSA.
> >
> > WARNING: cannot return statistics for local backend
> > HINT: Use pg_get_backend_memory_contexts instead
>
> How about using pg_get_backend_memory_contexts() for both - local as
> well as other backend? Let PID argument default to NULL which would
> indicate local backend, otherwise some other backend?
>
> I don't see much value in combining the two, specially since with
pg_get_process_memory_contexts() we can query both the postgres
backend and a background process, the name pg_get_backend_memory_context()
would be inaccurate and I am not sure whether a change to rename the
existing function would be welcome.

Please find an updated patch which fixes an issue seen in CI runs.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v5-Function-to-report-memory-context-stats-of-a-process.patch	application/octet-stream	33.8 KB

From:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-25 04:54:47
Message-ID:	CAExHW5uT10fQmf7pn9_W=uxFbiQ-3-cPRKXibb=g2Qzpt55XPw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 22, 2024 at 6:33 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>
> Hi,
>
>> How does the process know that the client backend has finished reading
>> stats and it can be refreshed? What happens, if the next request for
>> memory context stats comes before first requester has consumed the
>> statistics it requested?
>>
> A process that's copying its statistics does not need to know that.
> Whenever it receives a signal to copy statistics, it goes ahead and
> copies the latest statistics to the DSA after acquiring an exclusive
> lwlock.
>
> A requestor takes a lock before it starts consuming the statistics.
> If the next request comes while the first requestor is consuming the
> statistics, the publishing process will wait on lwlock to be released
> by the consuming process before it can write the statistics.
> If the next request arrives before the first requester begins consuming
> the statistics, the publishing process will acquire the lock and overwrite
> the earlier statistics with the most recent ones.
> As a result, both the first and second requesters will consume the
> updated statistics.

IIUC, the publisher and the consumer processes, both, use the same
LWLock. Publisher acquires an exclusive lock. Does consumer acquire
SHARED lock?

The publisher process might be in a transaction, processing a query or
doing something else. If it has to wait for an LWLock may affect its
performance. This will become even more visible if the client backend
is trying to diagnose a slow running query. Have we tried to measure
how long the publisher might have to wait for an LWLock while the
consumer is consuming statistics OR what is the impact of this wait?

>> >
>> > When statistics of a local backend is requested, this function returns the following
>> > WARNING and exits, since this can be handled by an existing function which
>> > doesn't require a DSA.
>> >
>> > WARNING: cannot return statistics for local backend
>> > HINT: Use pg_get_backend_memory_contexts instead
>>
>> How about using pg_get_backend_memory_contexts() for both - local as
>> well as other backend? Let PID argument default to NULL which would
>> indicate local backend, otherwise some other backend?
>>
> I don't see much value in combining the two, specially since with
> pg_get_process_memory_contexts() we can query both the postgres
> backend and a background process, the name pg_get_backend_memory_context()
> would be inaccurate and I am not sure whether a change to rename the
> existing function would be welcome.

Having two separate functions for the same functionality isn't a
friendly user interface.

Playing a bit with pg_terminate_backend() which is another function
dealing with backends to understand a. what does it do to its own
backend and b. which processes are considered backends.

1. pg_terminate_backend() allows to terminate the backend from which
it is fired.
#select pid, application_name, backend_type, pg_terminate_backend(pid)
from pg_stat_activity;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

2. It considers autovacuum launcher and logical replication launcher
as postgres backends but not checkpointer, background writer and
walwriter.
#select pid, application_name, backend_type, pg_terminate_backend(pid)
from pg_stat_activity where pid <> pg_backend_pid();
WARNING: PID 644887 is not a PostgreSQL backend process
WARNING: PID 644888 is not a PostgreSQL backend process
WARNING: PID 644890 is not a PostgreSQL backend process
pid | application_name | backend_type | pg_terminate_backend
--------+------------------+------------------------------+----------------------
645636 | | autovacuum launcher | t
645677 | | logical replication launcher | t
644887 | | checkpointer | f
644888 | | background writer | f
644890 | | walwriter | f
(5 rows)

In that sense you are correct that pg_get_backend_memory_context()
should not provide context information of WAL writer process for
example. But pg_get_process_memory_contexts() would be expected to
provide its own memory context information instead of redirecting to
another function through a WARNING. It could do that redirection
itself. That will also prevent the functions' output format going out
of sync.

--
Best Wishes,
Ashutosh Bapat

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-27 16:19:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I took a quick look at the patch today. Overall, I think this would be
very useful, I've repeatedly needed to inspect why a backend uses so
much memory, and I ended up triggering MemoryContextStats() from gdb.
This would be more convenient / safer. So +1 to the patch intent.

A couple review comments:

1) I read through the thread, and in general I agree with the reasoning
for removing the file part - it seems perfectly fine to just dump as
much as we can fit into a buffer, and then summarize the rest. But do we
need to invent a "new" limit here? The other places logging memory
contexts do something like this:

MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);

Which means we only print the 100 memory contexts at the top, and that's
it. Wouldn't that give us a reasonable memory limit too?

2) I see the function got renamed to pg_get_process_memory_contexts(),
bu the comment still says pg_get_remote_backend_memory_contexts().

3) I don't see any SGML docs for this new function. I was a bit unsure
what the "summary" argument is meant to do. The comment does not explain
that either.

4) I wonder if the function needs to return PID. I mean, the caller
knows which PID it is for, so it seems rather unnecessary.

5) In the "summary" mode, it might be useful to include info about how
many child contexts were aggregated. It's useful to know whether there
was 1 child or 10000 children. In the regular (non-summary) mode it'd
always be "1", probably, but maybe it'd interact with the limit in (1).
Not sure.

6) I feel a bit uneasy about the whole locking / communication scheme.
In particular, I'm worried about lockups, deadlocks, etc. So I decided
to do a trivial stress-test - just run the new function through pgbench
with many clients.

The memstats.sql script does just this:

SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY random() LIMIT 1)
, false);

where the inner query just picks a PID for some other backend, and asks
for memory context stats for that.

And just run it like this on a scale 1 pgbench database:

pgbench -n -f memstats.sql -c 10 test

And it gets stuck *immediately*. I've seen it to wait for other client
backends and auxiliary processes like autovacuum launcher.

This is absolutely idle system, there's no reason why a process would
not respond almost immediately. I wonder if e.g. autovacuum launcher may
not be handling these requests, or what if client backends can wait in a
cycle. IIRC condition variables are not covered by a deadlock detector,
so that would be an issue. But maybe I remember wrong?

7) I've also seen this error:

pgbench: error: client 6 script 0 aborted in command 0 query 0: \
ERROR: can't attach the same segment more than once

I haven't investigated it, but it seems like a problem handling errors,
where we fail to detach from a segment after a timeout. I may be wrong,
but it might be related to this:

> I opted for DSAs over DSMs to enable memory reuse by freeing
> segments for subsequent statistics copies of the same backend,
> without needing to recreate DSMs for each request.

I feel like this might be a premature optimization - I don't have a
clear idea how expensive it is to create DSM per request, but my
intuition is that it's cheaper than processing the contexts and
generating the info.

I'd just remove that, unless someone demonstrates it really matters. I
don't really worry about how expensive it is to process a request
(within reason, of course) - it will happen only very rarely. It's more
important to make sure there's no overhead when no one asks the backend
for memory context info, and simplicity.

Also, how expensive it is to just keep the DSA "just in case"? Imagine
someone asks for the memory context info once - isn't it a was to still
keep the DSA? I don't recall how much resources could that be.

I don't have a clear opinion on that, I'm more asking for opinions.

8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
is necessary, I think it should not be hard-coded.

regards

--
Tomas Vondra

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-28 23:23:57
Message-ID:	CAH2L28uvYHX1CfmbcB9JPzR6+MzJ+kb9ob+NvFy3wUHM8HUkGA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Tomas,

Thank you for the review.

>
>
> 1) I read through the thread, and in general I agree with the reasoning
> for removing the file part - it seems perfectly fine to just dump as
> much as we can fit into a buffer, and then summarize the rest. But do we
> need to invent a "new" limit here? The other places logging memory
> contexts do something like this:
>
> MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);
>
> Which means we only print the 100 memory contexts at the top, and that's
> it. Wouldn't that give us a reasonable memory limit too?
>
> I think this prints more than 100 memory contexts, since 100 denotes the
max_level
and contexts at each level could have upto 100 children. This limit seems
much higher than
what I am currently storing in DSA which is approx. 7000 contexts. I will
verify this again.

> 2) I see the function got renamed to pg_get_process_memory_contexts(),
> bu the comment still says pg_get_remote_backend_memory_contexts().
>
> Fixed

>
> 3) I don't see any SGML docs for this new function. I was a bit unsure
> what the "summary" argument is meant to do. The comment does not explain
> that either.
>
> Added docs.
Intention behind adding a summary argument is to report statistics of
contexts at level 0
and 1 i.e TopMemoryContext and its immediate children.

4) I wonder if the function needs to return PID. I mean, the caller
> knows which PID it is for, so it seems rather unnecessary.
>
> Perhaps it can be used to ascertain that the information indeed belongs to
the requested pid.

5) In the "summary" mode, it might be useful to include info about how
> many child contexts were aggregated. It's useful to know whether there
> was 1 child or 10000 children. In the regular (non-summary) mode it'd
> always be "1", probably, but maybe it'd interact with the limit in (1).
> Not sure.
>
> Sure, I will add this in the next iteration.

In my reproduction, this issue occurred because the process was terminated
while the requesting backend was waiting on the condition variable to be
signaled by it. I don’t see any solution other than having the waiting
client
backend timeout using ConditionVariableTimedSleep.

In the patch, since the timeout was set to a high value, pgbench ended up
stuck
waiting for the timeout to occur. The failure happens less frequently after
I added an
additional check for the process's existence, but it cannot be entirely
avoided. This is because a process can terminate after we check for its
existence but
before it signals the client. In such cases, the client will not receive
any signal.

I wonder if e.g. autovacuum launcher may
> not be handling these requests, or what if client backends can wait in a
> cycle.

Did not see a cyclic wait in client backends due to the pgbench stress test.

>
> 7) I've also seen this error:
>
> pgbench: error: client 6 script 0 aborted in command 0 query 0: \
> ERROR: can't attach the same segment more than once
>
I haven't investigated it, but it seems like a problem handling errors,
> where we fail to detach from a segment after a timeout.

Thanks for the hint, fixed by adding a missing call to dsa_detach after
timeout.

>
> > I opted for DSAs over DSMs to enable memory reuse by freeing
> > segments for subsequent statistics copies of the same backend,
> > without needing to recreate DSMs for each request.
>
> I feel like this might be a premature optimization - I don't have a
> clear idea how expensive it is to create DSM per request, but my
> intuition is that it's cheaper than processing the contexts and
> generating the info.
>
> I'd just remove that, unless someone demonstrates it really matters. I
> don't really worry about how expensive it is to process a request
> (within reason, of course) - it will happen only very rarely. It's more
> important to make sure there's no overhead when no one asks the backend
> for memory context info, and simplicity.
>
> Also, how expensive it is to just keep the DSA "just in case"? Imagine
> someone asks for the memory context info once - isn't it a was to still
> keep the DSA? I don't recall how much resources could that be.
>
> I don't have a clear opinion on that, I'm more asking for opinions.

Imagining a tool that periodically queries the backends for statistics,
it would be beneficial to avoid recreating the DSAs for each call.
Currently, DSAs of size 1MB per process
(i.e., a maximum of 1MB * (MaxBackends + auxiliary processes))
would be created and pinned for subsequent reporting. This size does
not seem excessively high, even for approx 100 backends and
auxiliary processes.

> 8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
> is necessary, I think it should not be hard-coded.
>
> Not sure which is the ideal value. Changed it to 15 secs and added a
#define as of now.
Something that gives enough time for the process to respond but
does not hold up the client for too long would be ideal. 15 secs seem to
be not enough for github CI tests, which fail with timeout error with this
setting.

PFA an updated patch with the above changes.

Attachment	Content-Type	Size
v6-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	35.4 KB

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-11-29 00:21:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/29/24 00:23, Rahila Syed wrote:
> Hi Tomas,
>
> Thank you for the review.
>
>
>
> 1) I read through the thread, and in general I agree with the reasoning
> for removing the file part - it seems perfectly fine to just dump as
> much as we can fit into a buffer, and then summarize the rest. But do we
> need to invent a "new" limit here? The other places logging memory
> contexts do something like this:
>
> MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);
>
> Which means we only print the 100 memory contexts at the top, and that's
> it. Wouldn't that give us a reasonable memory limit too?
>
> I think this prints more than 100 memory contexts, since 100 denotes the
> max_level
> and contexts at each level could have upto 100 children. This limit
> seems much higher than
> what I am currently storing in DSA which is approx. 7000 contexts. I
> will verify this again.
>

Yeah, you may be right. I don't remember what exactly that limit does.

>
> 2) I see the function got renamed to pg_get_process_memory_contexts(),
> bu the comment still says pg_get_remote_backend_memory_contexts().
>
> Fixed
>
>
> 3) I don't see any SGML docs for this new function. I was a bit unsure
> what the "summary" argument is meant to do. The comment does not explain
> that either.
>
> Added docs.
> Intention behind adding a summary argument is to report statistics of
> contexts at level 0
> and 1 i.e TopMemoryContext and its immediate children.
>

> 4) I wonder if the function needs to return PID. I mean, the caller
> knows which PID it is for, so it seems rather unnecessary.
>
> Perhaps it can be used to ascertain that the information indeed belongs to
> the requested pid.
>

I find that a bit ... suspicious. By this logic we'd include the input
parameters in every result, but we don't. So why is this case different?

> 5) In the "summary" mode, it might be useful to include info about how
> many child contexts were aggregated. It's useful to know whether there
> was 1 child or 10000 children. In the regular (non-summary) mode it'd
> always be "1", probably, but maybe it'd interact with the limit in (1).
> Not sure.
>
> Sure, I will add this in the next iteration.
>

>
> 6) I feel a bit uneasy about the whole locking / communication scheme.
> In particular, I'm worried about lockups, deadlocks, etc. So I decided
> to do a trivial stress-test - just run the new function through pgbench
> with many clients.
>
> The memstats.sql script does just this:
>
> SELECT * FROM pg_get_process_memory_contexts(
> (SELECT pid FROM pg_stat_activity
> WHERE pid != pg_backend_pid()
> ORDER BY random() LIMIT 1)
> , false);
>
> where the inner query just picks a PID for some other backend, and asks
> for memory context stats for that.
>
> And just run it like this on a scale 1 pgbench database:
>
> pgbench -n -f memstats.sql -c 10 test
>
> And it gets stuck *immediately*. I've seen it to wait for other client
> backends and auxiliary processes like autovacuum launcher.
>
> This is absolutely idle system, there's no reason why a process would
> not respond almost immediately.
>
>
> In my reproduction, this issue occurred because the process was terminated
> while the requesting backend was waiting on the condition variable to be
> signaled by it. I don’t see any solution other than having the waiting
> client
> backend timeout using ConditionVariableTimedSleep.
>
> In the patch, since the timeout was set to a high value, pgbench ended
> up stuck
> waiting for the timeout to occur. The failure happens less frequently
> after I added an
> additional check for the process's existence, but it cannot be entirely
> avoided. This is because a process can terminate after we check for its
> existence but
> before it signals the client. In such cases, the client will not receive
> any signal.
>

Hmmm, I see. I guess there's no way to know if a process responds to us,
but I guess it should be possible to wake up regularly and check if the
process still exists? Wouldn't that solve the case you mentioned?

> I wonder if e.g. autovacuum launcher may
> not be handling these requests, or what if client backends can wait in a
> cycle.
>
>
> Did not see a cyclic wait in client backends due to the pgbench stress test.
>

Not sure, but if I modify the query to only request memory contexts from
non-client processes, i.e.

SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
AND backend_type != 'client backend'
ORDER BY random() LIMIT 1)
, false);

then it gets stuck and reports this:

pgbench -n -f select.sql -c 4 -T 10 test
pgbench (18devel)
WARNING: Wait for 105029 process to publish stats timed out, ...

But process 105029 still very much exists, and it's the checkpointer:

$ ps ax | grep 105029
105029 ? Ss 0:00 postgres: checkpointer

OTOH if I modify the script to only look at client backends, and wait
until the processes get "stuck" (i.e. waiting on the condition variable,
consuming 0% CPU), I get this:

$ pgbench -n -f select.sql -c 4 -T 10 test
pgbench (18devel)
WARNING: Wait for 107146 process to publish stats timed out, try again
WARNING: Wait for 107144 process to publish stats timed out, try again
WARNING: Wait for 107147 process to publish stats timed out, try again
transaction type: select.sql
...

but when it gets 'stuck', most of the processes are still very much
running (but waiting for contexts from some other process). In the above
example I see this:

107144 ? Ss 0:02 postgres: user test [local] SELECT
107145 ? Ss 0:01 postgres: user test [local] SELECT
107147 ? Ss 0:02 postgres: user test [local] SELECT

So yes, 107146 seems to be gone. But why would that block getting info
from 107144 and 107147?

Maybe that's acceptable, but couldn't this be an issue with short-lived
connections, making it hard to implement the kind of automated
collection of stats that you envision. If it hits this kind of timeouts
often, it'll make it hard to reliably collect info. No?

I think it would be nice if you backed this with some numbers. I mean,
how expensive is it to create/destroy the DSA? How does it compare to
the other stuff this function needs to do?

> Currently, DSAs of size 1MB per process
> (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes))
> would be created and pinned for subsequent reporting. This size does
> not seem excessively high, even for approx 100 backends and
> auxiliary processes.
>

That seems like a pretty substantial amount of memory reserved for each
connection. IMHO the benefits would have to be pretty significant to
justify this, especially considering it's kept "forever", even if you
run the function only once per day.

>
> 8) Two minutes seems pretty arbitrary, and also quite high. If a timeout
> is necessary, I think it should not be hard-coded.
>
> Not sure which is the ideal value. Changed it to 15 secs and added a
> #define as of now.
> Something that gives enough time for the process to respond but
> does not hold up the client for too long would be ideal. 15 secs seem to
> be not enough for github CI tests, which fail with timeout error with
> this setting.
>
> PFA an updated patch with the above changes.

Why not to make this a parameter of the function? With some sensible
default, but easy to override.

regards

--
Tomas Vondra

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-12-03 19:09:02
Message-ID:	CAH2L28spk4c54k5pTSWRqu5LJmetQwTXtr=-F7Bnioxd425hqQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

>
>
> > 4) I wonder if the function needs to return PID. I mean, the caller
> > knows which PID it is for, so it seems rather unnecessary.
> >
> > Perhaps it can be used to ascertain that the information indeed belongs
> to
> > the requested pid.
> >
>
> I find that a bit ... suspicious. By this logic we'd include the input
> parameters in every result, but we don't. So why is this case different?
>
>
This was added to address a review suggestion. I had left it in case anyone
found it useful
for verification.
Previously, I included a check for scenarios where multiple processes could
write to the same
shared memory. Now, each process has a separate shared memory space
identified by
pgprocno, making it highly unlikely for the receiving process to see
another process's memory
dump.
Such a situation could theoretically occur if another process were mapped
to the same
pgprocno, although I’m not sure how likely that is. That said, I’ve added a
check in the receiver
to ensure the PID written in the shared memory matches the PID for which
the dump is
requested.
This guarantees that a user will never see the memory dump of another
process.
Given this, I’m fine with removing the pid column if it helps to make the
output more readable.

> 5) In the "summary" mode, it might be useful to include info about how
> > many child contexts were aggregated. It's useful to know whether
> there
> > was 1 child or 10000 children. In the regular (non-summary) mode it'd
> > always be "1", probably, but maybe it'd interact with the limit in
> (1).
> > Not sure.
> >
> > Sure, I will add this in the next iteration.
> >
>
> OK
>

I have added this information as a column named "num_agg_contexts", which
indicates
the number of contexts whose statistics have been aggregated/added for a
particular output.

In summary mode, all the child contexts of a given level-1 context are
aggregated, and
their statistics are presented as part of the parent context's statistics.
In this case,
num_agg_contexts provides the count of all child contexts under a given
level-1 context.

In regular (non-summary) mode, this column shows a value of 1, meaning the
statistics
correspond to a single context, with all context statistics displayed
individually. In this mode
an aggregate result is displayed if the number of contexts exceed the DSA
size limit. In
this case the num_agg_contexts will display the number of the remaining
contexts.

>
> > In the patch, since the timeout was set to a high value, pgbench ended
> > up stuck
> > waiting for the timeout to occur. The failure happens less frequently
> > after I added an
> > additional check for the process's existence, but it cannot be entirely
> > avoided. This is because a process can terminate after we check for its
> > existence but
> > before it signals the client. In such cases, the client will not receive
> > any signal.
> >
>
> Hmmm, I see. I guess there's no way to know if a process responds to us,
> but I guess it should be possible to wake up regularly and check if the
> process still exists? Wouldn't that solve the case you mentioned?
>
> I have fixed it accordingly in the attached patch by waking up after every
5 seconds
to check if the process exists and sleeping again if the wake-up condition
is not satisfied. The number of such tries is limited to 20. So, the total
wait
time can be 100 seconds. I will make the re-tries configurable, inline with
your
suggestion to be able to override the default waiting time.

> > I wonder if e.g. autovacuum launcher may
> > not be handling these requests, or what if client backends can wait
> in a
> > cycle.
> >
> >
> > Did not see a cyclic wait in client backends due to the pgbench stress
> test.
> >
>
> Not sure, but if I modify the query to only request memory contexts from
> non-client processes, i.e.
>
> SELECT * FROM pg_get_process_memory_contexts(
> (SELECT pid FROM pg_stat_activity
> WHERE pid != pg_backend_pid()
> AND backend_type != 'client backend'
> ORDER BY random() LIMIT 1)
> , false);
>
> then it gets stuck and reports this:
>
> pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING: Wait for 105029 process to publish stats timed out, ...
>
> But process 105029 still very much exists, and it's the checkpointer:
>
> In the case of checkpointer, I also see some wait time after running the
tests that you mentioned, but it eventually completes the request in my
runs.

> $ ps ax | grep 105029
> 105029 ? Ss 0:00 postgres: checkpointer
>
> OTOH if I modify the script to only look at client backends, and wait
> until the processes get "stuck" (i.e. waiting on the condition variable,
> consuming 0% CPU), I get this:
>
> $ pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING: Wait for 107146 process to publish stats timed out, try again
> WARNING: Wait for 107144 process to publish stats timed out, try again
> WARNING: Wait for 107147 process to publish stats timed out, try again
> transaction type: select.sql
> ...
>
> but when it gets 'stuck', most of the processes are still very much
> running (but waiting for contexts from some other process). In the above
> example I see this:
>
> 107144 ? Ss 0:02 postgres: user test [local] SELECT
> 107145 ? Ss 0:01 postgres: user test [local] SELECT
> 107147 ? Ss 0:02 postgres: user test [local] SELECT
>
> So yes, 107146 seems to be gone. But why would that block getting info
> from 107144 and 107147?
>
> Most likely 107144 and/or 107147 must also be waiting for 107146 which is
gone. Something like 107144 -> 107147 -> 107146(dead) or 107144
->107146(dead)
and 107147->107146(dead).

Maybe that's acceptable, but couldn't this be an issue with short-lived
> connections, making it hard to implement the kind of automated
> collection of stats that you envision. If it hits this kind of timeouts
> often, it'll make it hard to reliably collect info. No?

Yes, if there is a chain of waiting clients due to a process no longer
existing,
the waiting time to receive information will increase. However, as long as
a failed
a request caused by a non-existent process is detected promptly, the wait
time should
remain manageable, allowing other waiting clients to obtain the requested
information
from the existing processes.

In such cases, it might be necessary to experiment with the waiting times
at the receiving
client. Making the waiting time user-configurable, as you suggested, by
passing it as an
argument to the function, could help address this scenario.
Thanks for highlighting this, I will test this some more.

> >
> > > I opted for DSAs over DSMs to enable memory reuse by freeing
> > > segments for subsequent statistics copies of the same backend,
> > > without needing to recreate DSMs for each request.
> >
> > I feel like this might be a premature optimization - I don't have a
> > clear idea how expensive it is to create DSM per request, but my
> > intuition is that it's cheaper than processing the contexts and
> > generating the info.
> >
> > I'd just remove that, unless someone demonstrates it really matters.
> I
> > don't really worry about how expensive it is to process a request
> > (within reason, of course) - it will happen only very rarely. It's
> more
> > important to make sure there's no overhead when no one asks the
> backend
> > for memory context info, and simplicity.
> >
> > Also, how expensive it is to just keep the DSA "just in case"?
> Imagine
> > someone asks for the memory context info once - isn't it a was to
> still
> > keep the DSA? I don't recall how much resources could that be.
> >
> > I don't have a clear opinion on that, I'm more asking for opinions.
> >
> >
> > Imagining a tool that periodically queries the backends for statistics,
> > it would be beneficial to avoid recreating the DSAs for each call.
>
> I think it would be nice if you backed this with some numbers. I mean,
> how expensive is it to create/destroy the DSA? How does it compare to
> the other stuff this function needs to do?
>
> After instrumenting the code with timestamps, I observed that DSA creation
accounts for approximately 17% to 26% of the total execution time of the
function
pg_get_process_memory_contexts().

> Currently, DSAs of size 1MB per process
> > (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes))
> > would be created and pinned for subsequent reporting. This size does
> > not seem excessively high, even for approx 100 backends and
> > auxiliary processes.
> >
>
> That seems like a pretty substantial amount of memory reserved for each
> connection. IMHO the benefits would have to be pretty significant to
> justify this, especially considering it's kept "forever", even if you
> run the function only once per day.
>
> I can reduce the initial segment size to DSA_MIN_SEGMENT_SIZE, which is
256KB per process. If needed, this could grow up to 16MB based on the
current settings.

However, for the scenario you mentioned, it would be ideal to have a
mechanism
to mark a pinned DSA (using dsa_pin()) for deletion if it is not
used/attached within a
specified duration. Alternatively, I could avoid using dsa_pin()
altogether, allowing the
DSA to be automatically destroyed once all processes detach from it, and
recreate it
for a new request.

At the moment, I am unsure which approach is most feasible. Any suggestions
would be
greatly appreciated.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v7-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	41.0 KB

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-12-03 21:09:11
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/3/24 20:09, Rahila Syed wrote:
>
> Hi,
>
>
>
>
> > 4) I wonder if the function needs to return PID. I mean, the
> caller
> > knows which PID it is for, so it seems rather unnecessary.
> >
> > Perhaps it can be used to ascertain that the information indeed
> belongs to
> > the requested pid.
> >
>
> I find that a bit ... suspicious. By this logic we'd include the input
> parameters in every result, but we don't. So why is this case different?
>
>
> This was added to address a review suggestion. I had left it in case
> anyone found it useful
> for verification.
> Previously, I included a check for scenarios where multiple processes
> could write to the same
> shared memory. Now, each process has a separate shared memory space
> identified by
> pgprocno, making it highly unlikely for the receiving process to see
> another process's memory
> dump.
> Such a situation could theoretically occur if another process were
> mapped to the same
> pgprocno, although I’m not sure how likely that is. That said, I’ve
> added a check in the receiver
> to ensure the PID written in the shared memory matches the PID for which
> the dump is
> requested.
> This guarantees that a user will never see the memory dump of another
> process.
> Given this, I’m fine with removing the pid column if it helps to make
> the output more readable.
>

I'd just remove that. I agree it might have been useful with the single
chunk of shared memory, but I think with separate chunks it's not very
useful. And if we can end up with multiple processed getting the same
pgprocno I guess we have way bigger problems, this won't fix that.

> > 5) In the "summary" mode, it might be useful to include info
> about how
> > many child contexts were aggregated. It's useful to know
> whether there
> > was 1 child or 10000 children. In the regular (non-summary)
> mode it'd
> > always be "1", probably, but maybe it'd interact with the
> limit in (1).
> > Not sure.
> >
> > Sure, I will add this in the next iteration.
> >
>
> OK
>
>
> I have added this information as a column named "num_agg_contexts",
> which indicates
> the number of contexts whose statistics have been aggregated/added for a
> particular output.
>
> In summary mode, all the child contexts of a given level-1 context are
> aggregated, and
> their statistics are presented as part of the parent context's
> statistics. In this case,
> num_agg_contexts provides the count of all child contexts under a given
> level-1 context.
>
> In regular (non-summary) mode, this column shows a value of 1, meaning
> the statistics
> correspond to a single context, with all context statistics displayed
> individually. In this mode
> an aggregate result is displayed if the number of contexts exceed the
> DSA size limit. In
> this case the num_agg_contexts will display the number of the remaining
> contexts.
>

> >
> > In the patch, since the timeout was set to a high value, pgbench ended
> > up stuck
> > waiting for the timeout to occur. The failure happens less frequently
> > after I added an
> > additional check for the process's existence, but it cannot be
> entirely
> > avoided. This is because a process can terminate after we check
> for its
> > existence but
> > before it signals the client. In such cases, the client will not
> receive
> > any signal.
> >
>
> Hmmm, I see. I guess there's no way to know if a process responds to us,
> but I guess it should be possible to wake up regularly and check if the
> process still exists? Wouldn't that solve the case you mentioned?
>
> I have fixed it accordingly in the attached patch by waking up after
> every 5 seconds
> to check if the process exists and sleeping again if the wake-up condition
> is not satisfied. The number of such tries is limited to 20. So, the
> total wait
> time can be 100 seconds. I will make the re-tries configurable, inline
> with your
> suggestion to be able to override the default waiting time.
>

Makes sense, although 100 seconds seems a bit weird, it seems we usually
pick "natural" values like 60s, or multiples of that. But if it's
configurable, that's not a huge issue.

Could the process wake up earlier than the timeout, say if it gets EINT
signal? That'd break the "total timeout is 100 seconds", and it would be
better to check that explicitly. Not sure if this can happen, though.

One thing I'd maybe consider is starting with a short timeout, and
gradually increasing it until e.g. 5 seconds (or maybe just 1 second
would be perfectly fine, IMHO). With the current coding it means we
either get the response right away, or wait 5+ seconds. That's a big
huge jump. If we start e.g. with 10ms, and then gradually multiply it by
1.2, it means we only wait "0-20% extra" on average.

But perhaps this is very unlikely and not worth the complexity.

>
> > I wonder if e.g. autovacuum launcher may
> > not be handling these requests, or what if client backends can
> wait in a
> > cycle.
> >
> >
> > Did not see a cyclic wait in client backends due to the pgbench
> stress test.
> >
>
> Not sure, but if I modify the query to only request memory contexts from
> non-client processes, i.e.
>
> SELECT * FROM pg_get_process_memory_contexts(
> (SELECT pid FROM pg_stat_activity
> WHERE pid != pg_backend_pid()
> AND backend_type != 'client backend'
> ORDER BY random() LIMIT 1)
> , false);
>
> then it gets stuck and reports this:
>
> pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING: Wait for 105029 process to publish stats timed out, ...
>
> But process 105029 still very much exists, and it's the checkpointer:
>
> In the case of checkpointer, I also see some wait time after running the
> tests that you mentioned, but it eventually completes the request in my
> runs.
>

OK, but why should it even wait that long? Surely the checkpointer
should be able to report memory contexts too?

>
> $ ps ax | grep 105029
> 105029 ? Ss 0:00 postgres: checkpointer
>
> OTOH if I modify the script to only look at client backends, and wait
> until the processes get "stuck" (i.e. waiting on the condition variable,
> consuming 0% CPU), I get this:
>
> $ pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING: Wait for 107146 process to publish stats timed out, try again
> WARNING: Wait for 107144 process to publish stats timed out, try again
> WARNING: Wait for 107147 process to publish stats timed out, try again
> transaction type: select.sql
> ...
>
> but when it gets 'stuck', most of the processes are still very much
> running (but waiting for contexts from some other process). In the above
> example I see this:
>
> 107144 ? Ss 0:02 postgres: user test [local] SELECT
> 107145 ? Ss 0:01 postgres: user test [local] SELECT
> 107147 ? Ss 0:02 postgres: user test [local] SELECT
>
> So yes, 107146 seems to be gone. But why would that block getting info
> from 107144 and 107147?
>
> Most likely 107144 and/or 107147 must also be waiting for 107146 which is
> gone. Something like 107144 -> 107147 -> 107146(dead) or 107144 -
>>107146(dead)
> and 107147->107146(dead).
>

I think I forgot to mention only 107145 was waiting for 107146 (dead),
and the other processes were waiting for 107145 in some way. But yeah,
detecting the dead process would improve this, although it also shows
the issues can "spread" easily.

OTOH it's unlikely to have multiple pg_get_process_memory_contexts()
queries pointing at each other like this - monitoring will just to that
from one backend, and that's it. So not a huge issue.

>
> Maybe that's acceptable, but couldn't this be an issue with short-lived
> connections, making it hard to implement the kind of automated
> collection of stats that you envision. If it hits this kind of timeouts
> often, it'll make it hard to reliably collect info. No?
>
>
> Yes, if there is a chain of waiting clients due to a process no longer
> existing,
> the waiting time to receive information will increase. However, as long
> as a failed
> a request caused by a non-existent process is detected promptly, the
> wait time should
> remain manageable, allowing other waiting clients to obtain the
> requested information
> from the existing processes.
>
> In such cases, it might be necessary to experiment with the waiting
> times at the receiving
> client. Making the waiting time user-configurable, as you suggested, by
> passing it as an
> argument to the function, could help address this scenario.
> Thanks for highlighting this, I will test this some more.
>

I think we should try very hard to make this work well without the user
having to mess with the timeouts. These are exceptional conditions that
happen only very rarely, which makes it hard to find good values.

>
> >
> >    > I opted for DSAs over DSMs to enable memory reuse by freeing
> >    > segments for subsequent statistics copies of the same backend,
> >    > without needing to recreate DSMs for each request.
> >
> > I feel like this might be a premature optimization - I don't
> have a
> > clear idea how expensive it is to create DSM per request, but my
> > intuition is that it's cheaper than processing the contexts and
> > generating the info.
> >
> > I'd just remove that, unless someone demonstrates it really
> matters. I
> > don't really worry about how expensive it is to process a request
> > (within reason, of course) - it will happen only very rarely.
> It's more
> > important to make sure there's no overhead when no one asks
> the backend
> > for memory context info, and simplicity.
> >
> > Also, how expensive it is to just keep the DSA "just in case"?
> Imagine
> > someone asks for the memory context info once - isn't it a was
> to still
> > keep the DSA? I don't recall how much resources could that be.
> >
> > I don't have a clear opinion on that, I'm more asking for
> opinions.
> >
> >
> > Imagining a tool that periodically queries the backends for
> statistics,
> > it would be beneficial to avoid recreating the DSAs for each call.
>
> I think it would be nice if you backed this with some numbers. I mean,
> how expensive is it to create/destroy the DSA? How does it compare to
> the other stuff this function needs to do?
>
> After instrumenting the code with timestamps, I observed that DSA creation
> accounts for approximately 17% to 26% of the total execution time of the
> function
> pg_get_process_memory_contexts().
>
> > Currently, DSAs of size 1MB per process
> > (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes))
> > would be created and pinned for subsequent reporting. This size does
> > not seem excessively high, even for approx 100 backends and
> > auxiliary processes.
> >
>
> That seems like a pretty substantial amount of memory reserved for each
> connection. IMHO the benefits would have to be pretty significant to
> justify this, especially considering it's kept "forever", even if you
> run the function only once per day.
>
> I can reduce the initial segment size to DSA_MIN_SEGMENT_SIZE, which is
> 256KB per process. If needed, this could grow up to 16MB based on the
> current settings.
>
> However, for the scenario you mentioned, it would be ideal to have a
> mechanism
> to mark a pinned DSA (using dsa_pin()) for deletion if it is not used/
> attached within a
> specified duration. Alternatively, I could avoid using dsa_pin()
> altogether, allowing the
> DSA to be automatically destroyed once all processes detach from it, and
> recreate it
> for a new request.
>
> At the moment, I am unsure which approach is most feasible. Any
> suggestions would be
> greatly appreciated.
>

I'm entirely unconcerned about the pg_get_process_memory_contexts()
performance, within some reasonable limits. It's something executed
every now and then - no one is going to complain it takes 10ms extra,
measure tps with this function, etc.

17-26% seems surprisingly high, but Even 256kB is too much, IMHO. I'd
just get rid of this optimization until someone complains and explains
why it's worth it.

Yes, let's make it fast, but I don't think we should optimize it at the
expense of "regular workload" ...

regards

--
Tomas Vondra

From:	Amit Langote <amitlangote09(at)gmail(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-12-16 04:44:46
Message-ID:	CA+HiwqEF2-ChsOZ8zeRY2ocn3vn44UXMZZDu82TbD08iR84_rQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Rahila,

Thanks for working on this. I've wanted something like this a number
of times to replace my current method of attaching gdb like everyone
else I suppose.

I have a question / suggestion about the interface.

+Datum
+pg_get_process_memory_contexts(PG_FUNCTION_ARGS)
+{
+ int pid = PG_GETARG_INT32(0);
+ bool get_summary = PG_GETARG_BOOL(1);

IIUC, this always returns all memory contexts starting from
TopMemoryContext, summarizing some child contexts if memory doesn't
suffice. Would it be helpful to allow users to specify a context other
than TopMemoryContext as the root? This could be particularly useful
in cases where the information a user is looking for would otherwise
be grouped under "Remaining Totals." Alternatively, is there a way to
achieve this with the current function, perhaps by specifying a
condition in the WHERE clause?

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-12-16 12:51:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Thanks for updating the patch and here are some comments:

'path' column of pg_get_process_memory_contexts() begins with 0, but
that column of pg_backend_memory_contexts view begins with 1:

=# select path FROM pg_get_process_memory_contexts('20271', false);
path
-------
{0}
{0,1}
{0,2}
..

=# select path from pg_backend_memory_contexts;
path
-------
{1}
{1,2}
{1,3}
..asdf asdf

Would it be better to begin with 1 to make them consistent?

pg_log_backend_memory_contexts() does not allow non-superusers to
execute by default since it can peek at other session information.
pg_get_process_memory_contexts() does not have this restriction, but
wouldn't it be necessary?

When the target pid is the local backend, the HINT suggests using
pg_get_backend_memory_contexts(), but this function is not described in
the manual.
How about suggesting pg_backend_memory_contexts view instead?

=# select pg_get_process_memory_contexts('27041', false);
WARNING: cannot return statistics for local backend
HINT: Use pg_get_backend_memory_contexts instead

There are no explanations about 'num_agg_contexts', but I thought the
explanation like below would be useful.

> I have added this information as a column named "num_agg_contexts",
> which indicates
> the number of contexts whose statistics have been aggregated/added for
> a particular output.

git apply caused some warnings:

$ git apply
v7-Function-to-report-memory-context-stats-of-any-backe.patch
v7-Function-to-report-memory-context-stats-of-any-backe.patch:71: space
before tab in indent.
Requests to return the memory contexts of the backend with the
v7-Function-to-report-memory-context-stats-of-any-backe.patch:72: space
before tab in indent.
specified process ID. This function can send the request to
v7-Function-to-report-memory-context-stats-of-any-backe.patch:ldmv:
space before tab in indent.
both the backends and auxiliary processes. After receiving the
memory
v7-Function-to-report-memory-context-stats-of-any-backe.patch:74: space
before tab in indent.
contexts from the process, it returns the result as one row per
v7-Function-to-report-memory-context-stats-of-any-backe.patch:75: space
before tab in indent.
context. When get_summary is true, memory contexts at level 0

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2024-12-24 20:37:46
Message-ID:	CAH2L28sfqiY6VzzUesZbWQL5Ypkuhs=m5U-taV4W8Gwjg0NPLQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Tomas,

>
> I'd just remove that. I agree it might have been useful with the single
> chunk of shared memory, but I think with separate chunks it's not very
> useful. And if we can end up with multiple processed getting the same
> pgprocno I guess we have way bigger problems, this won't fix that.
>

OK, fixed accordingly in the attached patch.

> >
> > > In the patch, since the timeout was set to a high value, pgbench
> ended
> > > up stuck
> > > waiting for the timeout to occur. The failure happens less
> frequently
> > > after I added an
> > > additional check for the process's existence, but it cannot be
> > entirely
> > > avoided. This is because a process can terminate after we check
> > for its
> > > existence but
> > > before it signals the client. In such cases, the client will not
> > receive
> > > any signal.
> > >
> >
> > Hmmm, I see. I guess there's no way to know if a process responds to
> us,
> > but I guess it should be possible to wake up regularly and check if
> the
> > process still exists? Wouldn't that solve the case you mentioned?
> >
> > I have fixed it accordingly in the attached patch by waking up after
> > every 5 seconds
> > to check if the process exists and sleeping again if the wake-up
> condition
> > is not satisfied. The number of such tries is limited to 20. So, the
> > total wait
> > time can be 100 seconds. I will make the re-tries configurable, inline
> > with your
> > suggestion to be able to override the default waiting time.
> >
>
> Makes sense, although 100 seconds seems a bit weird, it seems we usually
> pick "natural" values like 60s, or multiples of that. But if it's
> configurable, that's not a huge issue.
>
> Could the process wake up earlier than the timeout, say if it gets EINT
> signal? That'd break the "total timeout is 100 seconds", and it would be
> better to check that explicitly. Not sure if this can happen, though.
>
> Not sure, I will check again. According to the comment on WaitLatch, a
process
waiting on it should only wake up when timeout happens or SetLatch is
called.

> One thing I'd maybe consider is starting with a short timeout, and
> gradually increasing it until e.g. 5 seconds (or maybe just 1 second
> would be perfectly fine, IMHO). With the current coding it means we
> either get the response right away, or wait 5+ seconds. That's a big
> huge jump. If we start e.g. with 10ms, and then gradually multiply it by
> 1.2, it means we only wait "0-20% extra" on average.
>
> But perhaps this is very unlikely and not worth the complexity.
>

OK, Currently I have changed it to always wait for signal from backend or
timeout
before checking the exit condition. This is to ensure that a backend gets
a chance to publish the new statistics since I am retaining the old
statistics
due to reasons explained below. I will experiment with setting a shorter
timeout
and gradually increasing it.

> In the case of checkpointer, I also see some wait time after running the
> > tests that you mentioned, but it eventually completes the request in my
> > runs.
> >
>
> OK, but why should it even wait that long? Surely the checkpointer
> should be able to report memory contexts too?

The checkpointer responds to requests promptly when the requests are
sequential.
However, a timeout may occur if concurrent requests for memory statistics
are
sent to the checkpointer.

In this case, one client sends a GetMemoryContext signal to the
checkpointer.
The checkpointer sets the PublishMemoryContextPending flag to true in the
handler for this signal. This flag remains true until a
CHECK_FOR_INTERRUPTS
is called, which processes the interrupt and clears the flag.

If another process concurrently sends a GetMemoryContext signal to the
checkpointer
before the CHECK_FOR_INTERRUPTS is called for the previous signal, the
PublishMemoryContextPending flag will already be set to true. When the
CHECK_FOR_INTERRUPTS is eventually called by the checkpointer, it processes
both
requests and dumps its memory context statistics.

However, only one of the two waiting clients gets to read the statistics.
This is because
the first client that gains access to the shared statistics reads the data
and frees the
DSA memory after it is done. As a result, the second client keeps waiting
until it times out,
since the checkpointer has already processed its request and sent the
statistics
which the second client never gets to read.

I believe that retaining the DSAs with the latest statistics after each
request would
help resolve the issue of request timeouts in scenarios with concurrent
requests.
I have included this in the attached patch.

> and the other processes were waiting for 107145 in some way. But yeah,
> detecting the dead process would improve this, although it also shows
> the issues can "spread" easily.
>
> OTOH it's unlikely to have multiple pg_get_process_memory_contexts()
> queries pointing at each other like this - monitoring will just to that
> from one backend, and that's it. So not a huge issue.
>
> Makes sense.

> .
> >
> > In such cases, it might be necessary to experiment with the waiting
> > times at the receiving
> > client. Making the waiting time user-configurable, as you suggested, by
> > passing it as an
> > argument to the function, could help address this scenario.
> > Thanks for highlighting this, I will test this some more.
> >
>
> I think we should try very hard to make this work well without the user
> having to mess with the timeouts. These are exceptional conditions that
> happen only very rarely, which makes it hard to find good values.
>
> OK.

> I'm entirely unconcerned about the pg_get_process_memory_contexts()
> performance, within some reasonable limits. It's something executed
> every now and then - no one is going to complain it takes 10ms extra,
> measure tps with this function, etc.
>
> 17-26% seems surprisingly high, but Even 256kB is too much, IMHO. I'd
> just get rid of this optimization until someone complains and explains
> why it's worth it.
>
> Yes, let's make it fast, but I don't think we should optimize it at the
> expense of "regular workload" ...
>
>
After debugging the concurrent requests timeout issue, it appears there is
yet another
argument in favor of avoiding the recreation of DSAs for every request: we
get to retain
the last reported statistics for a given postgres process, which can help
prevent certain
requests to fail in case of concurrent requests to the same process.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v8-0001-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	40.8 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-06 13:16:30
Message-ID:	CAH2L28vEiz+Qxw=Y5Dze8_TxDW1kAyh_3R-BKUGvOH6r02W6cg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Torikoshia,

Thank you for the review.

>
>
> =# select path FROM pg_get_process_memory_contexts('20271', false);
> path
> -------
> {0}
> {0,1}
> {0,2}
> ..
>
> =# select path from pg_backend_memory_contexts;
> path
> -------
> {1}
> {1,2}
> {1,3}
> ..asdf asdf
>
> Would it be better to begin with 1 to make them consistent?
>
> Makes sense, fixed in the attached patch.

pg_log_backend_memory_contexts() does not allow non-superusers to
> execute by default since it can peek at other session information.
> pg_get_process_memory_contexts() does not have this restriction, but
> wouldn't it be necessary?
>
> Yes. I added the restriction to only allow super users and
users with pg_read_all_stats privileges to query the memory context
statistics of another process.

> When the target pid is the local backend, the HINT suggests using
> pg_get_backend_memory_contexts(), but this function is not described in
> the manual.
> How about suggesting pg_backend_memory_contexts view instead?
>
> =# select pg_get_process_memory_contexts('27041', false);
> WARNING: cannot return statistics for local backend
> HINT: Use pg_get_backend_memory_contexts instead
>
>
> There are no explanations about 'num_agg_contexts', but I thought the
> explanation like below would be useful.
>
> Ok. I added an explanation of this column in the documentation.

> > I have added this information as a column named "num_agg_contexts",
> > which indicates
> > the number of contexts whose statistics have been aggregated/added for
> > a particular output.
>
> git apply caused some warnings:
>
> Thank you for reporting. They should be gone now.

PFA the patch with above updates.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v9-0001-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	41.7 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-06 17:04:20
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025/01/06 22:16, Rahila Syed wrote:
> PFA the patch with above updates.

Thanks for updating the patch! I like this feature.

I tested this feature and encountered two issues:

Issue 1: Error with pg_get_process_memory_contexts()
When I used pg_get_process_memory_contexts() on the PID of a backend process
that had just caused an error but hadn’t rolled back yet,
the following error occurred:

Session 1 (PID=70011):
=# begin;
=# select 1/0;
ERROR: division by zero

Session 2:
=# select * from pg_get_process_memory_contexts(70011, false);

Session 1 terminated with:
ERROR: ResourceOwnerEnlarge called after release started
FATAL: terminating connection because protocol synchronization was lost

Issue 2: Segmentation Fault
When I ran pg_get_process_memory_contexts() every 0.1 seconds using
\watch command while running "make -j 4 installcheck-world",
I encountered a segmentation fault:

LOG: client backend (PID 97975) was terminated by signal 11: Segmentation fault: 11
DETAIL: Failed process was running: select infinite_recurse();
LOG: terminating any other active server processes

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-06 21:02:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Rahila,

Thanks for the updated and rebased patch. I've tried the pgbench test
again, to see if it gets stuck somewhere, and I'm observing this on a
new / idle cluster:

$ pgbench -n -f test.sql -P 1 test -T 60
pgbench (18devel)
progress: 1.0 s, 1647.9 tps, lat 0.604 ms stddev 0.438, 0 failed
progress: 2.0 s, 1374.3 tps, lat 0.727 ms stddev 0.386, 0 failed
progress: 3.0 s, 1514.4 tps, lat 0.661 ms stddev 0.330, 0 failed
progress: 4.0 s, 1563.4 tps, lat 0.639 ms stddev 0.212, 0 failed
progress: 5.0 s, 1665.0 tps, lat 0.600 ms stddev 0.177, 0 failed
progress: 6.0 s, 1538.0 tps, lat 0.650 ms stddev 0.192, 0 failed
progress: 7.0 s, 1491.4 tps, lat 0.670 ms stddev 0.261, 0 failed
progress: 8.0 s, 1539.5 tps, lat 0.649 ms stddev 0.443, 0 failed
progress: 9.0 s, 1517.0 tps, lat 0.659 ms stddev 0.167, 0 failed
progress: 10.0 s, 1594.0 tps, lat 0.627 ms stddev 0.227, 0 failed
progress: 11.0 s, 28.0 tps, lat 0.705 ms stddev 0.277, 0 failed
progress: 12.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 13.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 14.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 15.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 16.0 s, 1480.6 tps, lat 4.043 ms stddev 130.113, 0 failed
progress: 17.0 s, 1524.9 tps, lat 0.655 ms stddev 0.286, 0 failed
progress: 18.0 s, 1246.0 tps, lat 0.802 ms stddev 0.330, 0 failed
progress: 19.0 s, 1383.1 tps, lat 0.722 ms stddev 0.934, 0 failed
progress: 20.0 s, 1432.7 tps, lat 0.698 ms stddev 0.199, 0 failed
...

There's always a period of 10-15 seconds when everything seems to be
working fine, and then a couple seconds when it gets stuck, with the usual

LOG: Wait for 69454 process to publish stats timed out, trying again

The PIDs I've seen were for checkpointer, autovacuum launcher, ... all
of that are processes that should be handling the signal, so how come it
gets stuck every now and then? The system is entirely idle, there's no
contention for the shmem stuff, etc. Could it be forgetting about the
signal in some cases, or something like that?

The test.sql is super simple:

SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY random() LIMIT 1)
, false);

Aside from this, I went through the patch to do a regular review, so
here's the main comments in somewhat random order:

1) The SGML docs talk about "contexts at level" but I don't think that's
defined/explained anywhere, there are different ways to assign levels in
a tree-like structure, so it's unclear if levels are assigned from the
top or bottom.

2) volatile sig_atomic_t PublishMemoryContextPending = false;

I'd move this right after LogMemoryContextPending (to match the other
places that add new stuff).

3) typedef enum PrintDetails

I suppose this should have some comments, explaining what the typedef is
for. Also, "details" sounds pretty generic, perhaps "destination" or
maybe "target" would be better?

4) The memcpy here seems unnecessary - the string is going to be static
in the binary, no need to copy it. In which case the whole switch is
going to be the same as in PutMemoryContextsStatsTupleStore, so maybe
move that into a separate function?

+ switch (context->type)
+ {
+ case T_AllocSetContext:
+ type = "AllocSet";
+ strncpy(memctx_info[curr_id].type, type, strlen(type));
+ break;
+ case T_GenerationContext:
+ type = "Generation";
+ strncpy(memctx_info[curr_id].type, type, strlen(type));
+ break;
+ case T_SlabContext:
+ type = "Slab";
+ strncpy(memctx_info[curr_id].type, type, strlen(type));
+ break;
+ case T_BumpContext:
+ type = "Bump";
+ strncpy(memctx_info[curr_id].type, type, strlen(type));
+ break;
+ default:
+ type = "???";
+ strncpy(memctx_info[curr_id].type, type, strlen(type));
+ break;
+ }

5) The comment about hash table in ProcessGetMemoryContextInterrupt
seems pretty far from hash_create(), so maybe move it.

6) ProcessGetMemoryContextInterrupt seems pretty long / complex, with
multiple nested loops, it'd be good to split it into smaller parts that
are easier to understand.

7) I'm not sure if/why we need to move MemoryContextId to memutils.h.

8) The new stuff in memutils.h is added to the wrong place, into a
section labeled "Memory-context-type-specific functions" (which it
certainly is not)

9) autovacuum.c adds the ProcessGetMemoryContextInterrupt() call after
ProcessCatchupInterrupt() - that's not wrong, but I'd move it right
after ProcessLogMemoryContextInterrupt(), just like everywhere else.

10) The pg_get_process_memory_contexts comment says:

Signal a backend or an auxiliary process to send its ...

But this is not just about the signal, it also waits for the results and
produces the result set.

11) pg_get_process_memory_contexts - Wouldn't it be better to move the
InitMaterializedSRF() call until after the privilege check, etc.?

12) The pg_get_process_memory_contexts comment should explain why it's
superuser-only function. Presumably it has similar DoS risks as the
other functions, because if not why would we have the restriction?

13) I reworded and expanded the pg_get_process_memory_contexts comment a
bit, and re-wrapped it too. But I think it also needs to explain how it
communicates with the other process (sending signal, sending data
through a DSA, ...). And also how the timeouts work.

14) I'm a bit confused about the DSA allocations (but I also haven't
worked with DSA very much, so maybe it's fine). Presumably the 16MB is
upper limit, we won't use that all the time. We allocate 1MB, but allow
it to grow up to 16MB, correct? 16MB seems like a lot, certainly enough
for this purpose - if it's not, I don't think we can come up with a
better limit.

15) In any case, I don't think the 16 should be hardcoded as a magic
constant in multiple places. That's bound to be error-prone.

16) I've reformatted / reindented / wrapped the code in various places,
to make it easier to read and more consistent with the nearby code. I
also added a bunch of comments explaining what the block of code is
meant to do (I mean, what it aims to do).

16) A comment in pg_get_process_memory_contexts says:

Pin the mapping so that it doesn't throw a warning

That doesn't seem very useful. It's not clear what kind of warning this
hides, but more importantly - we're not doing stuff to hide some sort of
warning, we do it to prevent what the warning is about.

17) pg_get_process_memory_contexts has a bunch of error cases, where we
need to detach the DSA and return NULL. Would be better to do a label
with a goto, I think.

18) I think pg_get_process_memory_contexts will have issues if this
happens in the first loop:

if ((memCtxState[procNumber].proc_id == pid) &&
DsaPointerIsValid(memCtxState[procNumber].memstats_dsa_pointer))
break;

Because then we end up with memctx_info pointing to garbage after the
loop. I don't know how hard is to hit this, I guess it can happen in
many processes calling pg_get_process_memory_contexts?

19) Minor comment and formatting of MemCtxShmemSize / MemCtxShmemInit.

20) MemoryContextInfo etc. need to be added to typedefs.list, so that
pgindent can do the right thing.

21) I think ProcessGetMemoryContextInterrupt has a bug because it uses
get_summary before reading it from the shmem.

Attached are two patches - 0001 is the original patch, 0002 has most of
my review comments (mentioned above), and a couple additional changes to
comments/formatting, etc. Those are suggestions rather than issues.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
vtomas-0001-Function-to-report-memory-context-stats-of-an.patch	text/x-patch	41.7 KB
vtomas-0002-review.patch	text/x-patch	17.3 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-08 12:03:01
Message-ID:	CAH2L28tuh9uqotOv5J6tcCcK+OEZaU8Vwr7sV1MbWX5zSJS3ag@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Fujii-san,

Thank you for testing the feature.

> Issue 1: Error with pg_get_process_memory_contexts()
> When I used pg_get_process_memory_contexts() on the PID of a backend
> process
> that had just caused an error but hadn’t rolled back yet,
> the following error occurred:
>
> Session 1 (PID=70011):
> =# begin;
> =# select 1/0;
> ERROR: division by zero
>
> Session 2:
> =# select * from pg_get_process_memory_contexts(70011, false);
>
> Session 1 terminated with:
> ERROR: ResourceOwnerEnlarge called after release started
> FATAL: terminating connection because protocol synchronization was lost
>
> In this scenario, a DSM segment descriptor is created and associated with
the
CurrentResourceOwner, which is set to the aborting transaction's resource
owner.
This occurs when ProcessGetMemoryContextInterrupts is called by the backend
while a transaction is still open and about to be rolled back.

I believe this issue needs to be addressed in the DSA and DSM code by
adding
a check to ensure that the CurrentResourceOwner is not about to be released
before
creating a DSM under the CurrentResourceOwner.

The attached fix resolves this issue. However, for a more comprehensive
solution,
I believe the same change should be extended to other parts of the DSA and
DSM
code where CurrentResourceOwner is referenced.

Issue 2: Segmentation Fault
> When I ran pg_get_process_memory_contexts() every 0.1 seconds using
> \watch command while running "make -j 4 installcheck-world",
> I encountered a segmentation fault:
>
> LOG: client backend (PID 97975) was terminated by signal 11:
> Segmentation fault: 11
> DETAIL: Failed process was running: select infinite_recurse();
> LOG: terminating any other active server processes
>
> I have not been able to reproduce this issue. Could you please clarify
which process you ran
pg_get_process_memory_context() on, with the interval of 0.1? Was it a
backend process
created by make installcheck-world, or some other process?

Thank you,
Rahila Syed

Attachment	Content-Type	Size
fix_for_resource_owner_error.patch	application/octet-stream	2.1 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-08 15:45:48
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025/01/08 21:03, Rahila Syed wrote:
> I have not been able to reproduce this issue. Could you please clarify which process you ran
> |pg_get_process_memory_context()| on, with the interval of 0.1?

I used the following query for testing:

=# SELECT count(*) FROM pg_stat_activity, pg_get_process_memory_contexts(pid, false) WHERE pid <> pg_backend_pid();
=# \watch 0.1

> Was it a backend process
> created by |make installcheck-world|, or some other process?

Yes, the target backends were from make installcheck-world.
No other workloads were running.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-13 02:36:10
Message-ID:	CAH2L28vrnzctbSZ+P3wkfHV5-8G6jgcMd7MKo3C9mWZYh9+7Ng@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Tomas,

Thank you for the review.

On Tue, Jan 7, 2025 at 2:32 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:

> Hi Rahila,
>
> Thanks for the updated and rebased patch. I've tried the pgbench test
> again, to see if it gets stuck somewhere, and I'm observing this on a
> new / idle cluster:
>
> $ pgbench -n -f test.sql -P 1 test -T 60
> pgbench (18devel)
> progress: 1.0 s, 1647.9 tps, lat 0.604 ms stddev 0.438, 0 failed
> progress: 2.0 s, 1374.3 tps, lat 0.727 ms stddev 0.386, 0 failed
> progress: 3.0 s, 1514.4 tps, lat 0.661 ms stddev 0.330, 0 failed
> progress: 4.0 s, 1563.4 tps, lat 0.639 ms stddev 0.212, 0 failed
> progress: 5.0 s, 1665.0 tps, lat 0.600 ms stddev 0.177, 0 failed
> progress: 6.0 s, 1538.0 tps, lat 0.650 ms stddev 0.192, 0 failed
> progress: 7.0 s, 1491.4 tps, lat 0.670 ms stddev 0.261, 0 failed
> progress: 8.0 s, 1539.5 tps, lat 0.649 ms stddev 0.443, 0 failed
> progress: 9.0 s, 1517.0 tps, lat 0.659 ms stddev 0.167, 0 failed
> progress: 10.0 s, 1594.0 tps, lat 0.627 ms stddev 0.227, 0 failed
> progress: 11.0 s, 28.0 tps, lat 0.705 ms stddev 0.277, 0 failed
> progress: 12.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 13.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 14.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 15.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 16.0 s, 1480.6 tps, lat 4.043 ms stddev 130.113, 0 failed
> progress: 17.0 s, 1524.9 tps, lat 0.655 ms stddev 0.286, 0 failed
> progress: 18.0 s, 1246.0 tps, lat 0.802 ms stddev 0.330, 0 failed
> progress: 19.0 s, 1383.1 tps, lat 0.722 ms stddev 0.934, 0 failed
> progress: 20.0 s, 1432.7 tps, lat 0.698 ms stddev 0.199, 0 failed
> ...
>
> There's always a period of 10-15 seconds when everything seems to be
> working fine, and then a couple seconds when it gets stuck, with the usual
>
> LOG: Wait for 69454 process to publish stats timed out, trying again
>
> The PIDs I've seen were for checkpointer, autovacuum launcher, ... all
> of that are processes that should be handling the signal, so how come it
> gets stuck every now and then? The system is entirely idle, there's no
> contention for the shmem stuff, etc. Could it be forgetting about the
> signal in some cases, or something like that?
>
> I am not sure as of now, I will debug further. Meanwhile, I have addressed
the
review comments. Please find the details and an updated patch below.

>
> 1) The SGML docs talk about "contexts at level" but I don't think that's
> defined/explained anywhere, there are different ways to assign levels in
> a tree-like structure, so it's unclear if levels are assigned from the
> top or bottom.
>
Fixed.

>
> 2) volatile sig_atomic_t PublishMemoryContextPending = false;
>
I'd move this right after LogMemoryContextPending (to match the other
> places that add new stuff).
>
Done.

>
> 3) typedef enum PrintDetails
>
> I suppose this should have some comments, explaining what the typedef is
> for. Also, "details" sounds pretty generic, perhaps "destination" or
> maybe "target" would be better?
>
> I added the comments above the typedef and changed the name to
PrintDestination.

> 4) The memcpy here seems unnecessary - the string is going to be static
> in the binary, no need to copy it. In which case the whole switch is
> going to be the same as in PutMemoryContextsStatsTupleStore, so maybe
> move that into a separate function?
>
> + switch (context->type)
> + {
> + case T_AllocSetContext:
> + type = "AllocSet";
> + strncpy(memctx_info[curr_id].type, type, strlen(type));
> + break;
> + case T_GenerationContext:
> + type = "Generation";
> + strncpy(memctx_info[curr_id].type, type, strlen(type));
> + break;
> + case T_SlabContext:
> + type = "Slab";
> + strncpy(memctx_info[curr_id].type, type, strlen(type));
> + break;
> + case T_BumpContext:
> + type = "Bump";
> + strncpy(memctx_info[curr_id].type, type, strlen(type));
> + break;
> + default:
> + type = "???";
> + strncpy(memctx_info[curr_id].type, type, strlen(type));
> + break;
> + }
>
Got rid of the copy and moved the switch to a separate function.

>
> 5) The comment about hash table in ProcessGetMemoryContextInterrupt
> seems pretty far from hash_create(), so maybe move it.
>
> Was fixed in your suggestions patch.

6) ProcessGetMemoryContextInterrupt seems pretty long / complex, with
> multiple nested loops, it'd be good to split it into smaller parts that
> are easier to understand.
>
> Done the refactoring to move certain parts into separate functions.

> 7) I'm not sure if/why we need to move MemoryContextId to memutils.h.
>

This is because I am referencing it from both mcxt.c and mcxtfuns.c. I can
consider moving some of the code out of mcxt.c and consolidating
everything related to this patch in mcxtfuncs.c if mcxt.c is intended to
contain only the core memory context logic.

> 8) The new stuff in memutils.h is added to the wrong place, into a
> section labeled "Memory-context-type-specific functions" (which it
> certainly is not)
>
> Fixed.

> 9) autovacuum.c adds the ProcessGetMemoryContextInterrupt() call after
> ProcessCatchupInterrupt() - that's not wrong, but I'd move it right
> after ProcessLogMemoryContextInterrupt(), just like everywhere else.
>
> Fixed too.

> 10) The pg_get_process_memory_contexts comment says:
>
> Signal a backend or an auxiliary process to send its ...
>
> But this is not just about the signal, it also waits for the results and
> produces the result set.

Makes sense, edited accordingly.

>
>
11) pg_get_process_memory_contexts - Wouldn't it be better to move the
> InitMaterializedSRF() call until after the privilege check, etc.?
>
> I have moved it after the super user check but kept it before some other
checks that lead to WARNING, after looking at how other functions have done
it.

> 12) The pg_get_process_memory_contexts comment should explain why it's
> superuser-only function. Presumably it has similar DoS risks as the
> other functions, because if not why would we have the restriction?
>
> Edited accordingly.

> 13) I reworded and expanded the pg_get_process_memory_contexts comment a
> bit, and re-wrapped it too. But I think it also needs to explain how it
> communicates with the other process (sending signal, sending data
> through a DSA, ...). And also how the timeouts work.
>
> Thank you for improving the comments. Added remaining changes as requested.

> 14) I'm a bit confused about the DSA allocations (but I also haven't
> worked with DSA very much, so maybe it's fine). Presumably the 16MB is
> upper limit, we won't use that all the time. We allocate 1MB, but allow
> it to grow up to 16MB, correct?

Yes.

16MB seems like a lot, certainly enough
> for this purpose - if it's not, I don't think we can come up with a
> better limit.
>
> I can try reducing it to 8MB, although it's expected to be only allocated
when needed.

> 15) In any case, I don't think the 16 should be hardcoded as a magic
> constant in multiple places. That's bound to be error-prone.
>
> Done.

> 16) I've reformatted / reindented / wrapped the code in various places,
> to make it easier to read and more consistent with the nearby code. I
> also added a bunch of comments explaining what the block of code is
> meant to do (I mean, what it aims to do).
>
> Thank you

> 16) A comment in pg_get_process_memory_contexts says:
>
> Pin the mapping so that it doesn't throw a warning
>
> That doesn't seem very useful. It's not clear what kind of warning this
> hides, but more importantly - we're not doing stuff to hide some sort of
> warning, we do it to prevent what the warning is about.
>
> Makes sense, fixed.

> 17) pg_get_process_memory_contexts has a bunch of error cases, where we
> need to detach the DSA and return NULL. Would be better to do a label
> with a goto, I think.
>
Done.

> 18) I think pg_get_process_memory_contexts will have issues if this
> happens in the first loop:
>
> if ((memCtxState[procNumber].proc_id == pid) &&
> DsaPointerIsValid(memCtxState[procNumber].memstats_dsa_pointer))
> break;
>
> Because then we end up with memctx_info pointing to garbage after the
> loop. I don't know how hard is to hit this, I guess it can happen in
> many processes calling pg_get_process_memory_contexts?
>

I think this is not possible since if the breaking condition is met, it
means
memstats_dsa_pointer is valid and memctx_info which resides
at mestats_dsa_pointer will contain valid data. Am I missing something?

Regarding the proc_id == pid check, I have added a comment in the code as
requested.

> 19) Minor comment and formatting of MemCtxShmemSize / MemCtxShmemInit.
>
> Ok.

20) MemoryContextInfo etc. need to be added to typedefs.list, so that
> pgindent can do the right thing.
>
> Done.

21) I think ProcessGetMemoryContextInterrupt has a bug because it uses
> get_summary before reading it from the shmem.
>
Fixed. It was not showing up in tests as the result of the bug was some
extra memory allocation
in dsa and some extra computation to populate all the paths in hash table
inspite of
get_summary being true.

>
> Attached are two patches - 0001 is the original patch, 0002 has most of
> my review comments (mentioned above), and a couple additional changes to
> comments/formatting, etc. Those are suggestions rather than issues.
>
> Thank you, applied the 0002 patch and made the changes mentioned in XXX.

Answering some of your questions in the 0002 patch below:

Q. * XXX Also, what if we fill exactly this number of contexts? Won't we
* lose the last entry because it will be overwitten by the summary?
A. We are filling 0 to max_stats - 2 slots by memory context in the loop
foreach_ptr(MemoryContextData, cur, contexts) in
ProcessGetMemoryContextInterrupt.
max_stats - 1 is reserved for the summary statistics.

Q. /* XXX I don't understand why we need to check get_summary here? */
A. get_summary check is there to ensure that the context_id is inserted in
the
hash_table if get_summary is true. If get_summary is true, the loop will
break after the first iteration
and the entire main list of contexts won't be traversed and hence
context_ids won't be inserted.
Hence it is handled separately inside a check for get_summary.

Q. /* XXX What if the memstats_dsa_pointer is not valid? Is it even
possible?
* If it is, we have garbage in memctx_info. Maybe it should be an
Assert()? */
A . Agreed. Changed it to an assert.

Q. /*
* XXX isn't 2 x 1kB for every context a bit too much? Maybe better
to
* make it variable-length?
*/
A. I don't know how to do this for a variable in shared memory, won't that
mean
allocating from the heap and thus the pointer would become invalid in
another
process?

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v10-0001-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	44.9 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-21 11:27:14
Message-ID:	CAH2L28vR622wV44XenbhWc7ETpNtjS_oeTha7OxMx35LjWPPqQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Tomas,

I've tried the pgbench test
> again, to see if it gets stuck somewhere, and I'm observing this on a
> new / idle cluster:
>
> $ pgbench -n -f test.sql -P 1 test -T 60
> pgbench (18devel)
> progress: 1.0 s, 1647.9 tps, lat 0.604 ms stddev 0.438, 0 failed
> progress: 2.0 s, 1374.3 tps, lat 0.727 ms stddev 0.386, 0 failed
> progress: 3.0 s, 1514.4 tps, lat 0.661 ms stddev 0.330, 0 failed
> progress: 4.0 s, 1563.4 tps, lat 0.639 ms stddev 0.212, 0 failed
> progress: 5.0 s, 1665.0 tps, lat 0.600 ms stddev 0.177, 0 failed
> progress: 6.0 s, 1538.0 tps, lat 0.650 ms stddev 0.192, 0 failed
> progress: 7.0 s, 1491.4 tps, lat 0.670 ms stddev 0.261, 0 failed
> progress: 8.0 s, 1539.5 tps, lat 0.649 ms stddev 0.443, 0 failed
> progress: 9.0 s, 1517.0 tps, lat 0.659 ms stddev 0.167, 0 failed
> progress: 10.0 s, 1594.0 tps, lat 0.627 ms stddev 0.227, 0 failed
> progress: 11.0 s, 28.0 tps, lat 0.705 ms stddev 0.277, 0 failed
> progress: 12.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 13.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 14.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 15.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 16.0 s, 1480.6 tps, lat 4.043 ms stddev 130.113, 0 failed
> progress: 17.0 s, 1524.9 tps, lat 0.655 ms stddev 0.286, 0 failed
> progress: 18.0 s, 1246.0 tps, lat 0.802 ms stddev 0.330, 0 failed
> progress: 19.0 s, 1383.1 tps, lat 0.722 ms stddev 0.934, 0 failed
> progress: 20.0 s, 1432.7 tps, lat 0.698 ms stddev 0.199, 0 failed
> ...
>
> There's always a period of 10-15 seconds when everything seems to be
> working fine, and then a couple seconds when it gets stuck, with the usual
>
> LOG: Wait for 69454 process to publish stats timed out, trying again
>
> The PIDs I've seen were for checkpointer, autovacuum launcher, ... all
> of that are processes that should be handling the signal, so how come it
> gets stuck every now and then? The system is entirely idle, there's no
> contention for the shmem stuff, etc. Could it be forgetting about the
> signal in some cases, or something like that?
>
> Yes, This occurs when, due to concurrent signals received by a backend,
both signals are processed together, and stats are published only once.
Once the stats are read by the first client that gains access, they are
erased,
causing the second client to wait until timeout.

If we make clients wait for the latest stats, timeouts may occur during
concurrent
operations. To avoid such timeouts, we can retain the previously published
memory
statistics for every backend and avoid waiting for the latest statistics
when the
previous statistics are newer than STALE_STATS_LIMIT. This limit can be
determined
based on the server load and how fast the memory statistics requests are
being
handled by the server.

For example, on a server running make -j 4 installcheck-world while
concurrently
probing client backends for memory statistics using pgbench, accepting
statistics
that were approximately 1 second old helped eliminate timeouts. Conversely,
on an
idle system, waiting for new statistics when the previous ones were older
than 0.1
seconds was sufficient to avoid any timeouts caused by concurrent requests.

PFA an updated and rebased patch that includes the capability to associate
timestamps with statistics. Additionally, I have made some minor fixes and
improved
the indentation.

Currently, I have set STALE_STATS_LIMIT to 0.5 seconds in code. which means
do not
do not wait for newer statistics if previous statistics were published
within the last
5 seconds of current request.

Inshort, there are following options to design the wait for statistics
depending on whether
we expect concurrent requests to a backend for memory statistics to be
common.

1. Always get the latest statistics and timeout if not able to.

This works fine for sequential probing which is going to be the most common
use case.
This can lead to a backend timeouts upto MAX_TRIES * MEMSTATS_WAIT_TIMEOUT.

2. Determine the appropriate STALE_STATS_LIMIT and not wait for the latest
stats if
previous statistics are within that limit .
This will help avoid the timeouts in case of the concurrent requests.

3. Do what v10 patch on this thread does -

Wait for the latest statistics for up to MEMSTATS_WAIT_TIMEOUT;
otherwise, display the previous statistics, regardless of when they were
published.

Since timeouts are likely to occur only during concurrent requests, the
displayed
statistics are unlikely to be very outdated.
However, in this scenario, we observe the behavior you mentioned, i.e.,
concurrent
backends can get stuck for the duration of MEMSTATS_WAIT_TIMEOUT
(currently 5 seconds as per the current settings).

I am inclined toward the third approach, as concurrent requests are not
expected
to be a common use case for this feature. Moreover, with the second
approach,
determining an appropriate value for STALE_STATS_LIMIT is challenging, as
it
depends on the server's load.

Kindly let me know your preference. I have attached a patch which
implements the
2nd approach for testing, the 3rd approach being implemented in the v10
patch.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v11-0001-Function-to-report-memory-context-stats-of-any-backe.patch	application/octet-stream	46.6 KB

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-21 16:31:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025/01/21 20:27, Rahila Syed wrote:
> Hi Tomas,
>
> I've tried the pgbench test
> again, to see if it gets stuck somewhere, and I'm observing this on a
> new / idle cluster:
>
> $ pgbench -n -f test.sql -P 1 test -T 60
> pgbench (18devel)
> progress: 1.0 s, 1647.9 tps, lat 0.604 ms stddev 0.438, 0 failed
> progress: 2.0 s, 1374.3 tps, lat 0.727 ms stddev 0.386, 0 failed
> progress: 3.0 s, 1514.4 tps, lat 0.661 ms stddev 0.330, 0 failed
> progress: 4.0 s, 1563.4 tps, lat 0.639 ms stddev 0.212, 0 failed
> progress: 5.0 s, 1665.0 tps, lat 0.600 ms stddev 0.177, 0 failed
> progress: 6.0 s, 1538.0 tps, lat 0.650 ms stddev 0.192, 0 failed
> progress: 7.0 s, 1491.4 tps, lat 0.670 ms stddev 0.261, 0 failed
> progress: 8.0 s, 1539.5 tps, lat 0.649 ms stddev 0.443, 0 failed
> progress: 9.0 s, 1517.0 tps, lat 0.659 ms stddev 0.167, 0 failed
> progress: 10.0 s, 1594.0 tps, lat 0.627 ms stddev 0.227, 0 failed
> progress: 11.0 s, 28.0 tps, lat 0.705 ms stddev 0.277, 0 failed
> progress: 12.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 13.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 14.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 15.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
> progress: 16.0 s, 1480.6 tps, lat 4.043 ms stddev 130.113, 0 failed
> progress: 17.0 s, 1524.9 tps, lat 0.655 ms stddev 0.286, 0 failed
> progress: 18.0 s, 1246.0 tps, lat 0.802 ms stddev 0.330, 0 failed
> progress: 19.0 s, 1383.1 tps, lat 0.722 ms stddev 0.934, 0 failed
> progress: 20.0 s, 1432.7 tps, lat 0.698 ms stddev 0.199, 0 failed
> ...
>
> There's always a period of 10-15 seconds when everything seems to be
> working fine, and then a couple seconds when it gets stuck, with the usual
>
> LOG: Wait for 69454 process to publish stats timed out, trying again
>
> The PIDs I've seen were for checkpointer, autovacuum launcher, ... all
> of that are processes that should be handling the signal, so how come it
> gets stuck every now and then? The system is entirely idle, there's no
> contention for the shmem stuff, etc. Could it be forgetting about the
> signal in some cases, or something like that?
>
> Yes, This occurs when, due to concurrent signals received by a backend,
> both signals are processed together, and stats are published only once.
> Once the stats are read by the first client that gains access, they are erased,
> causing the second client to wait until timeout.
>
> If we make clients wait for the latest stats, timeouts may occur during concurrent
> operations. To avoid such timeouts, we can retain the previously published memory
> statistics for every backend and avoid waiting for the latest statistics when the
> previous statistics are newer than STALE_STATS_LIMIT. This limit can be determined
> based on the server load and how fast the memory statistics requests are being
> handled by the server.
>
> For example, on a server running make -j 4 installcheck-world while concurrently
> probing client backends for memory statistics using pgbench, accepting statistics
> that were approximately 1 second old helped eliminate timeouts. Conversely, on an
> idle system, waiting for new statistics when the previous ones were older than 0.1
> seconds was sufficient to avoid any timeouts caused by concurrent requests.
>
> PFA an updated and rebased patch that includes the capability to associate
> timestamps with statistics. Additionally, I have made some minor fixes and improved
> the indentation.
>
> Currently, I have set STALE_STATS_LIMIT to 0.5 seconds in code. which means do not
> do not wait for newer statistics if previous statistics were published within the last
> 5 seconds of current request.
>
> Inshort, there are following options to design the wait for statistics depending on whether
> we expect concurrent requests to a backend for memory statistics to be common.
>
> 1. Always get the latest statistics and timeout if not able to.
>
> This works fine for sequential probing which is going to be the most common use case.
> This can lead to a backend timeouts upto MAX_TRIES * MEMSTATS_WAIT_TIMEOUT.
>
> 2. Determine the appropriate STALE_STATS_LIMIT and not wait for the latest stats if
> previous statistics are within that limit .
> This will help avoid the timeouts in case of the concurrent requests.
>
> 3. Do what v10 patch on this thread does -
>
> Wait for the latest statistics for up to MEMSTATS_WAIT_TIMEOUT;
> otherwise, display the previous statistics, regardless of when they were published.
>
> Since timeouts are likely to occur only during concurrent requests, the displayed
> statistics are unlikely to be very outdated.
> However, in this scenario, we observe the behavior you mentioned, i.e., concurrent
> backends can get stuck for the duration of MEMSTATS_WAIT_TIMEOUT
> (currently 5 seconds as per the current settings).
>
> I am inclined toward the third approach, as concurrent requests are not expected
> to be a common use case for this feature. Moreover, with the second approach,
> determining an appropriate value for STALE_STATS_LIMIT is challenging, as it
> depends on the server's load.

Just idea; as an another option, how about blocking new requests to
the target process (e.g., causing them to fail with an error or
returning NULL with a warning) if a previous request is still pending?
Users can simply retry the request if it fails. IMO failing quickly
seems preferable to getting stuck for a while in cases with concurrent
requests.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-24 13:47:35
Message-ID:	CAH2L28u7=fcgnY8bpM87moiJxt++wqWZXh2HxFabYjiHSg76Cg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

>
> Just idea; as an another option, how about blocking new requests to
> the target process (e.g., causing them to fail with an error or
> returning NULL with a warning) if a previous request is still pending?
> Users can simply retry the request if it fails. IMO failing quickly
> seems preferable to getting stuck for a while in cases with concurrent
> requests.
>
> Thank you for the suggestion. I agree that it is better to fail early
and avoid
waiting for a timeout in such cases. I will add a "pending request" tracker
for
this in shared memory. This approach will help prevent sending a concurrent
request if a request for the same backend is still being processed.
IMO, one downside of throwing an error in such cases is that the users
might
wonder if they need to take a corrective action, even though the issue is
actually
going to solve itself and they just need to retry. Therefore, issuing a
warning
or displaying previously updated statistics might be a better alternative
to throwing
an error.

Thank you,
Rahila Syed

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-24 22:20:50
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/24/25 14:47, Rahila Syed wrote:
>
> Hi,
>
>
> Just idea; as an another option, how about blocking new requests to
> the target process (e.g., causing them to fail with an error or
> returning NULL with a warning) if a previous request is still pending?
> Users can simply retry the request if it fails. IMO failing quickly
> seems preferable to getting stuck for a while in cases with concurrent
> requests.
>
> Thank you for the suggestion. I agree that it is better to fail
> early and avoid waiting for a timeout in such cases. I will add a
> "pending request" tracker for this in shared memory. This approach
> will help prevent sending a concurrent request if a request for the
> same backend is still being processed.
>

AFAIK these failures should be extremely rare - we're only talking about
that because the workload I used for testing is highly concurrent, i.e.
it requests memory context info extremely often. I doubt anyone sane is
going to do that in practice ...

> IMO, one downside of throwing an error in such cases is that the
> users might wonder if they need to take a corrective action, even
> though the issue is actually going to solve itself and they just
> need to retry. Therefore, issuing a warning or displaying previously
> updated statistics might be a better alternative to throwing an
> error.
>

Wouldn't this be mostly mitigated by adding proper detail/hint to the
error message? Sure, the user can always ignore that (especially when
calling this from a script), but well ... we can only do so much.

All this makes me think about how we shared pgstat data before the shmem
approach was introduced in PG15. Until then the process signaled pgstat
collector, and the collector wrote the statistics into a file, with a
timestamp. And the process used the timestamp to decide if it's fresh
enough ... Wouldn't the same approach work here?

I imagined it would work something like this:

requesting backend:
-------------------
* set request_ts to current timestamp
* signal the target process, to generate memory context info
* wait until the DSA gets filled with stats_ts > request_ts
* return the data, don't erase anything

target backend
--------------
* clear the signal
* generate the statistics
* set stats_ts to current timestamp
* wait all the backends waiting for the stats (through CV)

I see v11 does almost this, except that it accepts somewhat stale data.
But why would that be necessary? I don't think it's needed, and I don't
think we should accept data from before the process sends the signal.

regards

--
Tomas Vondra

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-01-29 12:45:38
Message-ID:	CAH2L28vAxcePsqV1AbjYeU4QAojyXS5M39d5MVGCSgAcoNybkQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Sat, Jan 25, 2025 at 3:50 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:

>
>
> On 1/24/25 14:47, Rahila Syed wrote:
> >
> > Hi,
> >
> >
> > Just idea; as an another option, how about blocking new requests to
> > the target process (e.g., causing them to fail with an error or
> > returning NULL with a warning) if a previous request is still
> pending?
> > Users can simply retry the request if it fails. IMO failing quickly
> > seems preferable to getting stuck for a while in cases with
> concurrent
> > requests.
> >
> > Thank you for the suggestion. I agree that it is better to fail
> > early and avoid waiting for a timeout in such cases. I will add a
> > "pending request" tracker for this in shared memory. This approach
> > will help prevent sending a concurrent request if a request for the
> > same backend is still being processed.
> >
>
> AFAIK these failures should be extremely rare - we're only talking about
> that because the workload I used for testing is highly concurrent, i.e.
> it requests memory context info extremely often. I doubt anyone sane is
> going to do that in practice ...

Yes, that makes sense.

>
>
> IMO, one downside of throwing an error in such cases is that the
> > users might wonder if they need to take a corrective action, even
> > though the issue is actually going to solve itself and they just
> > need to retry. Therefore, issuing a warning or displaying previously
> > updated statistics might be a better alternative to throwing an
> > error.
> >
>
> Wouldn't this be mostly mitigated by adding proper detail/hint to the
> error message? Sure, the user can always ignore that (especially when
> calling this from a script), but well ... we can only do so much.
>

OK.

All this makes me think about how we shared pgstat data before the shmem
> approach was introduced in PG15. Until then the process signaled pgstat
> collector, and the collector wrote the statistics into a file, with a
> timestamp. And the process used the timestamp to decide if it's fresh
> enough ... Wouldn't the same approach work here?
>
> I imagined it would work something like this:
>
> requesting backend:
> -------------------
> * set request_ts to current timestamp
> * signal the target process, to generate memory context info
> * wait until the DSA gets filled with stats_ts > request_ts
> * return the data, don't erase anything
>
> target backend
> --------------
> * clear the signal
> * generate the statistics
> * set stats_ts to current timestamp
> * wait all the backends waiting for the stats (through CV)
>
> I see v11 does almost this, except that it accepts somewhat stale data.
>
That's correct.

> But why would that be necessary? I don't think it's needed, and I don't
> think we should accept data from before the process sends the signal.
>
> This is done in an attempt to avoid concurrent requests from timing out.
In such cases, data in response to another request is likely to already be
in the
dynamic shared memory. Hence instead of waiting for the latest data and
risking a
timeout, the approach displays available statistics that are newer than a
defined
threshold. Additionally, since we can't distinguish between sequential and
concurrent requests, we accept somewhat stale data for all requests.

I realize this approach has some issues, mainly regarding how to determine
an appropriate threshold value or a limit for old data.

Therefore, I agree that it makes sense to display the data that is
published
after the request is made. If such data can't be published due to
concurrent
requests or other delays, the function should detect this and return as
soon as
possible.

Thank you,
Rahila Syed

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-02-03 12:47:14
Message-ID:	CAH2L28vYvC5DP+YccBE7VnC-khn3k_vd=MGxn8BFEpLc40ncHw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

> >
> > Just idea; as an another option, how about blocking new requests to
> > the target process (e.g., causing them to fail with an error or
> > returning NULL with a warning) if a previous request is still
> pending?
> > Users can simply retry the request if it fails. IMO failing quickly
> > seems preferable to getting stuck for a while in cases with
> concurrent
> > requests.
> >
> > Thank you for the suggestion. I agree that it is better to fail
> > early and avoid waiting for a timeout in such cases. I will add a
> > "pending request" tracker for this in shared memory. This approach
> > will help prevent sending a concurrent request if a request for the
> > same backend is still being processed.
> >

Please find attached a patch that adds a request_pending field in
shared memory. This allows us to detect concurrent requests early
and return a WARNING message immediately, avoiding unnecessary
waiting and potential timeouts. This is added in v12-0002* patch.

I imagined it would work something like this:
>
> requesting backend:
> -------------------
> * set request_ts to current timestamp
> * signal the target process, to generate memory context info
> * wait until the DSA gets filled with stats_ts > request_ts
> * return the data, don't erase anything
>
> target backend
> --------------
> * clear the signal
> * generate the statistics
> * set stats_ts to current timestamp
> * wait all the backends waiting for the stats (through CV)
>

The attached v12-0002* patch implements this. We determine
the latest statistics based on the stats timestamp, if it is greater
than the timestamp when the request was sent, the statistics are
considered up to date and are returned immediately. Otherwise,
the client waits for the latest statistics to be published until the
timeout is reached.

With the latest changes, I don't see a dip in tps even when
concurrent requests are run in pgbench script.

pgbench -n -f monitoring.sql -P 1 postgres -T 60
pgbench (18devel)
progress: 1.0 s, 816.9 tps, lat 1.218 ms stddev 0.317, 0 failed
progress: 2.0 s, 821.9 tps, lat 1.216 ms stddev 0.177, 0 failed
progress: 3.0 s, 817.1 tps, lat 1.224 ms stddev 0.209, 0 failed
progress: 4.0 s, 791.0 tps, lat 1.262 ms stddev 0.292, 0 failed
progress: 5.0 s, 780.8 tps, lat 1.280 ms stddev 0.326, 0 failed
progress: 6.0 s, 675.2 tps, lat 1.482 ms stddev 0.503, 0 failed
progress: 7.0 s, 674.0 tps, lat 1.482 ms stddev 0.387, 0 failed
progress: 8.0 s, 821.0 tps, lat 1.217 ms stddev 0.272, 0 failed
progress: 9.0 s, 903.0 tps, lat 1.108 ms stddev 0.196, 0 failed
progress: 10.0 s, 886.9 tps, lat 1.128 ms stddev 0.160, 0 failed
progress: 11.0 s, 887.1 tps, lat 1.126 ms stddev 0.243, 0 failed
progress: 12.0 s, 871.0 tps, lat 1.147 ms stddev 0.227, 0 failed
progress: 13.0 s, 735.0 tps, lat 1.361 ms stddev 0.329, 0 failed
progress: 14.0 s, 655.9 tps, lat 1.522 ms stddev 0.331, 0 failed
progress: 15.0 s, 674.0 tps, lat 1.484 ms stddev 0.254, 0 failed
progress: 16.0 s, 659.0 tps, lat 1.517 ms stddev 0.289, 0 failed
progress: 17.0 s, 641.0 tps, lat 1.558 ms stddev 0.281, 0 failed
progress: 18.0 s, 707.8 tps, lat 1.412 ms stddev 0.324, 0 failed
progress: 19.0 s, 746.3 tps, lat 1.341 ms stddev 0.219, 0 failed
progress: 20.0 s, 659.9 tps, lat 1.513 ms stddev 0.372, 0 failed
progress: 21.0 s, 651.8 tps, lat 1.533 ms stddev 0.372, 0 failed
WARNING: cannot process the request at the moment
HINT: Another request is pending, try again
progress: 22.0 s, 635.2 tps, lat 1.574 ms stddev 0.519, 0 failed
WARNING: cannot process the request at the moment
HINT: Another request is pending, try again
progress: 23.0 s, 730.0 tps, lat 1.369 ms stddev 0.408, 0 failed
WARNING: cannot process the request at the moment
HINT: Another request is pending, try again
WARNING: cannot process the request at the moment
HINT: Another request is pending, try again

where monitoring.sql is as follows:
SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY random() LIMIT 1)
, false);

I have split the patch into 2 patches with v12-0001* consisting of fixes
needed to allow using the MemoryContextStatsInternals for this
feature and
v12-0002* containing all the remaining changes for the feature.

A few outstanding issues are as follows:

1. Currently one DSA is created per backend when the first request for
statistics is made and remains for the lifetime of the server.
I think I should add logic to periodically destroy DSAs, when memory
context statistics are not being *actively* queried from the backend,
as determined by the statistics timestamp.
2. The two issues reported by Fujii-san here: [1].
i. I have proposed a fix for the first issue here [2].
ii. I am able to reproduce the second issue. This happens when we try
to query statistics of a backend running infinite_recurse.sql. While I am
working on finding a root-cause, I think it happens due to some memory
being overwritten due to to stack-depth violation, as the issue is not seen
when I reduce the max_stack_depth to 100kb.

[1].
https://www.postgresql.org/message-id/a1a7e2b7-8f33-4313-baff-42e92ec14fd3%40oss.nttdata.com
[2].
https://www.postgresql.org/message-id/CAH2L28shr0j3JE5V3CXDFmDH-agTSnh2V8pR23X0UhRMbDQD9Q%40mail.gmail.com

Attachment	Content-Type	Size
v12-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	44.8 KB
v12-0001-Preparatory-changes-for-reporting-memory-context-sta.patch	application/octet-stream	4.3 KB

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-02-10 12:02:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025-02-03 21:47, Rahila Syed wrote:
> Hi,
>
>>>
>>> Just idea; as an another option, how about blocking new
>> requests to
>>> the target process (e.g., causing them to fail with an error
>> or
>>> returning NULL with a warning) if a previous request is still
>> pending?
>>> Users can simply retry the request if it fails. IMO failing
>> quickly
>>> seems preferable to getting stuck for a while in cases with
>> concurrent
>>> requests.
>>>
>>> Thank you for the suggestion. I agree that it is better to fail
>>> early and avoid waiting for a timeout in such cases. I will add a
>>> "pending request" tracker for this in shared memory. This approach
>>
>>> will help prevent sending a concurrent request if a request for
>> the
>>> same backend is still being processed.
>>>
>
> Please find attached a patch that adds a request_pending field in
> shared memory. This allows us to detect concurrent requests early
> and return a WARNING message immediately, avoiding unnecessary
> waiting and potential timeouts. This is added in v12-0002* patch.

Thanks for updating the patch!

The below comments would be a bit too detailed at this stage, but I’d
like to share the points I noticed.

> 76 + arguments: PID and a boolean, get_summary. The function
> can send

Since get_summary is a parameter, should we enclose it in <parameter>
tags, like <parameter>get_summary</parameter>?

> 387 + * The shared memory buffer has a limited size - it the process
> has too many
> 388 + * memory contexts,

Should 'it' be 'if'?

> 320 * By default, only superusers are allowed to signal to return the
> memory
> 321 * contexts because allowing any users to issue this request at an
> unbounded
> 322 * rate would cause lots of requests to be sent and which can lead
> to denial of
> 323 * service. Additional roles can be permitted with GRANT.

This comment seems to contradict the following code:

> 360 * Only superusers or users with pg_read_all_stats privileges
> can view the
> 361 * memory context statistics of another process
> 362 */
> 363 if (!has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS))
> 364 ereport(ERROR,
> 365 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> 366 errmsg("memory context statistics privilege
> error")));

> 485 + if (memCtxState[procNumber].memstats_dsa_handle ==
> DSA_HANDLE_INVALID)
> 486 + {
> 487 +
> 488 + LWLockRelease(&memCtxState[procNumber].lw_lock);

> 505 + else
> 506 + {
> 507 + LWLockRelease(&memCtxState[procNumber].lw_lock);

The LWLockRelease() function appears in both the if and else branches.
Can we move it outside the conditional block to avoid duplication?

> 486 + {
> 487 +
> 488 + LWLockRelease(&memCtxState[procNumber].lw_lock);

The blank line at 487 seems unnecessary. Should we remove it?

> 534 {
> 535 ereport(LOG,
> 536 (errmsg("Wait for %d process to publish stats
> timed out, trying again",
> 537 pid)));
> 538 if (num_retries > MAX_RETRIES)
> 539 goto end;
> 540 num_retries = num_retries + 1;
> 541 }

If the target process remains unresponsive, the logs will repeatedly
show:

LOG: Wait for xxxx process to publish stats timed out, trying again
LOG: Wait for xxxx process to publish stats timed out, trying again
...
LOG: Wait for xxxx process to publish stats timed out, trying again

However, the final log message is misleading because it does not
actually try again. Should we adjust the last log message to reflect the
correct behavior?

> 541 }
> 542
> 543 }

The blank line at 542 seems unnecessary. Should we remove it?

> 874 + context_id_lookup =
> hash_create("pg_get_remote_backend_memory_contexts",

Should 'pg_get_remote_backend_memory_contexts' be renamed to
'pg_get_process_memory_contexts' now?

> 899 + * Allocate memory in this process's dsa for storing statistics
> of the the

'the the' is a duplicate.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Andres Freund <andres(at)anarazel(dot)de>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-02-18 13:05:38
Message-ID:	CAH2L28sNHmj+aCJ=XkNb0an-XAs3eOUcU4orx2G9yviRL560fg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

>
> Thanks for updating the patch!
>
> The below comments would be a bit too detailed at this stage, but I’d
> like to share the points I noticed.
>
> Thanks for sharing the detailed comments. I have incorporated some of them
into the new version of the patch. I will include the rest when I refine and
comment the code further.

Meanwhile, I have fixed the following outstanding issues:

1. Currently one DSA is created per backend when the first request for
> statistics is made and remains for the lifetime of the server.
> I think I should add logic to periodically destroy DSAs, when memory
> context statistics are not being *actively* queried from the backend,
> as determined by the statistics timestamp.
>

After an offline discussion with Andres and Tomas, I have fixed this to use
only one DSA for all the publishing backends/processes. Each backend
allocates smaller chunks of memory within the DSA while publishing
statistics.
These chunks are tracked independently by each backend, ensuring that two
publishing backends/processes do not block each other despite using the
same
DSA. This approach eliminates the overhead of creating multiple DSAs,
one for each backend.

I am not destroying the DSA area because it stores the previously published
statistics for each process. This allows the system to display older
statistics
when the latest data cannot be retrieved within a reasonable time.
Only the most recently updated statistics are kept, while all earlier ones
are freed using dsa_free by each backend when they are no longer needed.
.

> 2. The two issues reported by Fujii-san here: [1].
> i. I have proposed a fix for the first issue here [2].
> ii. I am able to reproduce the second issue. This happens when we try
> to query statistics of a backend running infinite_recurse.sql. While I am
> working on finding a root-cause, I think it happens due to some memory
> being overwritten due to to stack-depth violation, as the issue is not
> seen
> when I reduce the max_stack_depth to 100kb.
> }
> }
>

The second issue is also resolved by using smaller allocations within a
DSA.
Previously, it occurred because a few statically allocated strings were
placed
within a single large chunk of DSA allocation. I have changed this to use
dynamically allocated chunks with dsa_allocate0 within the same DSA.

Please find attached updated and rebased patches.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v13-0001-Preparatory-changes-for-reporting-memory-context-sta.patch	application/octet-stream	4.3 KB
v13-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	50.3 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-02-20 13:26:49
Message-ID:	CAH2L28sMnNy9DvzAAoiE8qQs0MRX9ALhaYAf2f-aLivL47Ryhw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Please find attached the updated patches after some cleanup and test
fixes.

Thank you,
Rahila Syed

On Tue, Feb 18, 2025 at 6:35 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

> Hi,
>
>>
>> Thanks for updating the patch!
>>
>> The below comments would be a bit too detailed at this stage, but I’d
>> like to share the points I noticed.
>>
>> Thanks for sharing the detailed comments. I have incorporated some of them
> into the new version of the patch. I will include the rest when I refine
> and
> comment the code further.
>
> Meanwhile, I have fixed the following outstanding issues:
>
> 1. Currently one DSA is created per backend when the first request for
>> statistics is made and remains for the lifetime of the server.
>> I think I should add logic to periodically destroy DSAs, when memory
>> context statistics are not being *actively* queried from the backend,
>> as determined by the statistics timestamp.
>>
>
> After an offline discussion with Andres and Tomas, I have fixed this to
> use
> only one DSA for all the publishing backends/processes. Each backend
> allocates smaller chunks of memory within the DSA while publishing
> statistics.
> These chunks are tracked independently by each backend, ensuring that two
> publishing backends/processes do not block each other despite using the
> same
> DSA. This approach eliminates the overhead of creating multiple DSAs,
> one for each backend.
>
> I am not destroying the DSA area because it stores the previously
> published
> statistics for each process. This allows the system to display older
> statistics
> when the latest data cannot be retrieved within a reasonable time.
> Only the most recently updated statistics are kept, while all earlier ones
> are freed using dsa_free by each backend when they are no longer needed.
> .
>
>> 2. The two issues reported by Fujii-san here: [1].
>> i. I have proposed a fix for the first issue here [2].
>> ii. I am able to reproduce the second issue. This happens when we try
>> to query statistics of a backend running infinite_recurse.sql. While I am
>> working on finding a root-cause, I think it happens due to some memory
>> being overwritten due to to stack-depth violation, as the issue is not
>> seen
>> when I reduce the max_stack_depth to 100kb.
>> }
>> }
>>
>
> The second issue is also resolved by using smaller allocations within a
> DSA.
> Previously, it occurred because a few statically allocated strings were
> placed
> within a single large chunk of DSA allocation. I have changed this to use
> dynamically allocated chunks with dsa_allocate0 within the same DSA.
>
> Please find attached updated and rebased patches.
>
> Thank you,
> Rahila Syed
>

Attachment	Content-Type	Size
v14-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	49.6 KB
v14-0001-Preparatory-changes-for-reporting-memory-context-sta.patch	application/octet-stream	4.3 KB

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-02-21 15:01:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2/20/25 14:26, Rahila Syed wrote:
> Hi,
>
> Please find attached the updated patches after some cleanup and test
> fixes.
>
> Thank you,
> Rahila Syed
>
> On Tue, Feb 18, 2025 at 6:35 PM Rahila Syed <rahilasyed90(at)gmail(dot)com
> <mailto:rahilasyed90(at)gmail(dot)com>> wrote:
>
> Hi,
>
>
> Thanks for updating the patch!
>
> The below comments would be a bit too detailed at this stage,
> but I’d
> like to share the points I noticed.
>
> Thanks for sharing the detailed comments. I have incorporated some
> of them
> into the new version of the patch. I will include the rest when I
> refine and
> comment the code further.
>
> Meanwhile, I have fixed the following outstanding issues:
>
> 1. Currently one DSA is created per backend when the first
> request for
> statistics is made and remains for the lifetime of the server.
> I think I should add logic to periodically destroy DSAs, when memory
> context statistics are not being *actively* queried from the
> backend,
> as determined by the statistics timestamp.
>
>
> After an offline discussion with Andres and Tomas, I have fixed this
> to use
> only one DSA for all the publishing backends/processes. Each backend
> allocates smaller chunks of memory within the DSA while publishing
> statistics.
> These chunks are tracked independently by each backend, ensuring
> that two
> publishing backends/processes do not block each other despite using
> the same
> DSA. This approach eliminates the overhead of creating multiple DSAs,
> one for each backend.
>
> I am not destroying the DSA area because it stores the previously
> published
> statistics for each process. This allows the system to display older
> statistics
> when the latest data cannot be retrieved within a reasonable time.
> Only the most recently updated statistics are kept, while all
> earlier ones
> are freed using dsa_free by each backend when they are no longer needed.
> .

I think something is not quite right, because if I try running a simple
pgbench script that does pg_get_process_memory_contexts() on PIDs of
random postgres process (just like in the past), I immediately get this:

pgbench: error: client 28 script 0 aborted in command 0 query 0: ERROR:
can't attach the same segment more than once
pgbench: error: client 10 script 0 aborted in command 0 query 0: ERROR:
can't attach the same segment more than once
pgbench: error: client 5 script 0 aborted in command 0 query 0: ERROR:
can't attach the same segment more than once
pgbench: error: client 8 script 0 aborted in command 0 query 0: ERROR:
can't attach the same segment more than once
...

Perhaps the backends need to synchronize creation of the DSA?

>
> 2. The two issues reported by Fujii-san here: [1].
> i. I have proposed a fix for the first issue here [2].
> ii. I am able to reproduce the second issue. This happens when
> we try
> to query statistics of a backend running infinite_recurse.sql.
> While I am
> working on finding a root-cause, I think it happens due to some
> memory
> being overwritten due to to stack-depth violation, as the issue
> is not seen
> when I reduce the max_stack_depth to 100kb.
> }
> }
>
>
> The second issue is also resolved by using smaller allocations
> within a DSA.
> Previously, it occurred because a few statically allocated strings
> were placed
> within a single large chunk of DSA allocation. I have changed this
> to use
> dynamically allocated chunks with dsa_allocate0 within the same DSA.
>

Sounds good. Do you have any measurements how much this reduced the size
of the entries written to the DSA? How many entries will fit into 1MB of
shared memory?

regards

--
Tomas Vondra

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-02-24 12:46:45
Message-ID:	CAH2L28uayhv+AxgPLThexJ21NA8j7XFiYqu6rgsZSSNosvPjvg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I think something is not quite right, because if I try running a simple
> pgbench script that does pg_get_process_memory_contexts() on PIDs of
> random postgres process (just like in the past), I immediately get this:
>
> Thank you for testing. This issue occurs when a process that previously
attached
to a DSA area for publishing its own context statistics tries to attach to
it again while
querying statistics from another backend. Previously, I was not detaching
at the end
of publishing the statistics. I have now changed it to detach from the area
after the
statistics are published. The fix is included in the updated patch.

> Perhaps the backends need to synchronize creation of the DSA?
>

This has been implemented in the patch.

> Sounds good. Do you have any measurements how much this reduced the size
> of the entries written to the DSA? How many entries will fit into 1MB of
> shared memory?

The size of the entries has approximately halved after dynamically
allocating the
strings and a datum array.
Also, previously, I was allocating the entire memory for all contexts in
one large chunk
from DSA. I have now separated them into smaller allocations
per context. The integer counters are still allocated at once for all
contexts, but
the size of an allocated chunk will not exceed approximately 128 bytes *
total_num_of_contexts.
Average total number of contexts is in the hundreds.

PFA the updated and rebased patches.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v15-0001-Preparatory-changes-for-reporting-memory-context-sta.patch	application/octet-stream	4.3 KB
v15-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	47.8 KB

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-02-28 15:42:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 24 Feb 2025, at 13:46, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

> PFA the updated and rebased patches.

Thanks for the rebase, a few mostly superficial comments from a first
read-through. I'll do some more testing and playing around with it for
functional comments.

+ ...
+ child contexts' statistics, with num_agg_contexts indicating the number
+ of these aggregated child contexts.
The documentation refers to attributes in the return row but the format of that
row isn't displayed which makes following along hard. I think we should
include a table or a programlisting showing the return data before this
paragraph.

+const char *
+AssignContextType(NodeTag type)
This function doesn't actually assign anything so the name is a bit misleading,
it would be better with ContextTypeToString or something similar.

+ * By default, only superusers or users with PG_READ_ALL_STATS are allowed to
This sentence is really long and should probably be broken up.

+ * The shared memory buffer has a limited size - it the process has too many
s/it/if/

+ * If the publishing backend does not respond before the condition variable
+ * times out, which is set to MEMSTATS_WAIT_TIMEOUT, retry for max_tries
+ * number of times, which is defined by user, before giving up and
+ * returning previously published statistics, if any.
This comment should mention what happens if the process gives up and there is
no previously published stats.

+ int i;
...
+ for (i = 0; i < memCtxState[procNumber].total_stats; i++)
This can be rewritten as "for (int i = 0; .." since we allow C99.

+ * process running and consuming lots of memory, that it might end on its
+ * own first and its memory contexts are not logged is not a problem.
This comment is copy/pasted from pg_log_backend_memory_contexts and while it
mostly still apply it should at least be rewritten to not refer to logging as
this function doesn't do that.

+ ereport(WARNING,
+ (errmsg("PID %d is not a PostgreSQL server process",
No need to add the extra parenthesis around errmsg anymore, so I think new code
should omit those.

+ errhint("Use pg_backend_memory_contexts view instead")));
Super nitpick, but errhints should be complete sentences ending with a period.

+ * statitics have previously been published by the backend. In which case,
s/statitics/statistics/

+ * statitics have previously been published by the backend. In which case,
+ * check if statistics are not older than curr_timestamp, if they are wait
I think the sentence around the time check is needlessly confusing, could it be
rewritten into something like:
"A valid DSA pointer isn't proof that statistics are available, it can be
valid due to previously published stats. Check if the stats are updated by
comparing the timestamp, if the stats are newer than our previously
recorded timestamp from before sending the procsignal they must by
definition be updated."

+ /* Assert for dsa_handle to be valid */
Was this intended to be turned into an Assert call? Else it seems better to remove.

+ if (print_location != PRINT_STATS_NONE)
This warrants a comment stating why it makes sense.

+ * Do not print the statistics if print_to_stderr is PRINT_STATS_NONE,
s/print_to_stderr/print_location/. Also, do we really need print_to_stderr in
this function at all? It seems a bit awkward to combine a boolean and a
paramter for a tri-state value when the parameter holds the tri_state already.
For readability I think just checking print_location will be better since the
value will be clear, where print_to_stderr=false is less clear in a tri-state
scenario.

+ * its ancestors to a list, inorder to compute a path.
s/inorder/in order/

+ elog(LOG, "hash table corrupted, can't construct path value");
+ break;
This will return either a NIL list or a partial path, but PublishMemoryContext
doesn't really take into consideration that it might be so. Is this really
benign to the point that we can blindly go on? Also, elog(LOG..) is mostly for
tracing or debugging as elog() isn't intended for user facing errors.

+static void
+compute_num_of_contexts(List *contexts, HTAB *context_id_lookup,
+ int *stats_count, bool get_summary)
This function does a lot than compute the number of contexts so the name seems
a bit misleading. Perhaps a rename to compute_contexts() or something similar?

+ memctx_info[curr_id].name = dsa_allocate0(area,
+ strlen(clipped_ident) + 1);
These calls can use idlen instead of more strlen() calls no? While there is no
performance benefit, it would increase readability IMHO if the code first
calculates a value, and then use it.

+ strncpy(name,
+ clipped_ident, strlen(clipped_ident));
Since clipped_ident has been nul terminated earlier there is no need to use
strncpy, we can instead use strlcpy and take the target buffer size into
consideration rather than the input string length.

PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
+ PROCSIG_GET_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
This comment should be different from the LOG_MEMORY_xx one.

+#define MEM_CONTEXT_SHMEM_STATS_SIZE 30
+#define MAX_TYPE_STRING_LENGTH 64
These are unused, from an earlier version of the patch perhaps?

+ * Singe DSA area is created and used by all the processes,
s/Singe/Since/

+typedef struct MemoryContextBackendState
This is only used in mcxtfuncs.c and can be moved there rather than being
exported in the header.

+} MemoryContextId;
This lacks an entry in the typedefs.list file.

--
Daniel Gustafsson

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-03-04 07:00:02
Message-ID:	CAH2L28s8Etbz2XM0xiH=RyRHAnEAxMD2AVpvcHyhHEHTbf-Uqg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Daniel,

Thanks for the rebase, a few mostly superficial comments from a first
> read-through.
>
Thank you for your comments.

> The documentation refers to attributes in the return row but the format of
> that
> row isn't displayed which makes following along hard. I think we should
> include a table or a programlisting showing the return data before this
> paragraph.
>
> I included the sql function call and its output in programlisting format
after the
function description.
Since the description was part of a table, I added this additional
information at the
end of that table.

> +const char *
> +AssignContextType(NodeTag type)
> This function doesn't actually assign anything so the name is a bit
> misleading,
> it would be better with ContextTypeToString or something similar.
>
> Done.

>
> + * By default, only superusers or users with PG_READ_ALL_STATS are
> allowed to
> This sentence is really long and should probably be broken up.
>
> Fixed.

>
> + * The shared memory buffer has a limited size - it the process has too
> many
> s/it/if/
>
> Fixed.

> + * If the publishing backend does not respond before the condition
> variable
> + * times out, which is set to MEMSTATS_WAIT_TIMEOUT, retry for max_tries
> + * number of times, which is defined by user, before giving up and
> + * returning previously published statistics, if any.
> This comment should mention what happens if the process gives up and there
> is
> no previously published stats.
>
> Done.

>
> + int i;
> ...
> + for (i = 0; i < memCtxState[procNumber].total_stats; i++)
> This can be rewritten as "for (int i = 0; .." since we allow C99.
>
> Done.

>
> + * process running and consuming lots of memory, that it might end on
> its
> + * own first and its memory contexts are not logged is not a problem.
> This comment is copy/pasted from pg_log_backend_memory_contexts and while
> it
> mostly still apply it should at least be rewritten to not refer to logging
> as
> this function doesn't do that.
>
> Fixed.

>
> + ereport(WARNING,
> + (errmsg("PID %d is not a PostgreSQL server process",
> No need to add the extra parenthesis around errmsg anymore, so I think new
> code
> should omit those.
>
> Done.

>
> + errhint("Use pg_backend_memory_contexts view instead")));
> Super nitpick, but errhints should be complete sentences ending with a
> period.
>
> Done.

>
> + * statitics have previously been published by the backend. In which
> case,
> s/statitics/statistics/
>
> Fixed.

>
> + * statitics have previously been published by the backend. In which
> case,
> + * check if statistics are not older than curr_timestamp, if they are
> wait
> I think the sentence around the time check is needlessly confusing, could
> it be
> rewritten into something like:
> "A valid DSA pointer isn't proof that statistics are available, it can
> be
> valid due to previously published stats. Check if the stats are
> updated by
> comparing the timestamp, if the stats are newer than our previously
> recorded timestamp from before sending the procsignal they must by
> definition be updated."
>
> Replaced accordingly.

>
> + /* Assert for dsa_handle to be valid */
> Was this intended to be turned into an Assert call? Else it seems better
> to remove.
>

Added an assert and removed the comment.

> + if (print_location != PRINT_STATS_NONE)
> This warrants a comment stating why it makes sense.
>

> + * Do not print the statistics if print_to_stderr is PRINT_STATS_NONE,
> s/print_to_stderr/print_location/. Also, do we really need
> print_to_stderr in
> this function at all? It seems a bit awkward to combine a boolean and a
> paramter for a tri-state value when the parameter holds the tri_state
> already.
> For readability I think just checking print_location will be better since
> the
> value will be clear, where print_to_stderr=false is less clear in a
> tri-state
> scenario.
>
> I removed the boolean print_to_stderr, the checks are now using
the tri-state enum-print_location.

> + * its ancestors to a list, inorder to compute a path.
> s/inorder/in order/
>
> Fixed.

>
> + elog(LOG, "hash table corrupted, can't construct path value");
> + break;
> This will return either a NIL list or a partial path, but
> PublishMemoryContext
> doesn't really take into consideration that it might be so. Is this really
> benign to the point that we can blindly go on? Also, elog(LOG..) is
> mostly for
> tracing or debugging as elog() isn't intended for user facing errors.
>
> I agree that this should be addressed. I added a check for path value
before
storing it in shared memory. If the path is NIL, the path pointer in DSA
will point
to InvalidDsaPointer.
When a client encounters an InvalidDsaPointer it will print NULL in the
path column.
Partial path scenario is unlikely IMO, and I am not sure if it warrants
additional
checks.

> +static void
> +compute_num_of_contexts(List *contexts, HTAB *context_id_lookup,
> + int *stats_count, bool get_summary)
> This function does a lot than compute the number of contexts so the name
> seems
> a bit misleading. Perhaps a rename to compute_contexts() or something
> similar?
>
> Renamed to compute_contexts_count_and_ids.

>
> + memctx_info[curr_id].name = dsa_allocate0(area,
> + strlen(clipped_ident) + 1);
> These calls can use idlen instead of more strlen() calls no? While there
> is no
> performance benefit, it would increase readability IMHO if the code first
> calculates a value, and then use it.
>
> Done.

>
> + strncpy(name,
> + clipped_ident, strlen(clipped_ident));
> Since clipped_ident has been nul terminated earlier there is no need to use
> strncpy, we can instead use strlcpy and take the target buffer size into
> consideration rather than the input string length.
>
> Replaced with the strlcpy calls.

>
> PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts
> */
> + PROCSIG_GET_MEMORY_CONTEXT, /* ask backend to log the memory contexts
> */
> This comment should be different from the LOG_MEMORY_xx one.
>
> Fixed.

+#define MEM_CONTEXT_SHMEM_STATS_SIZE 30
> +#define MAX_TYPE_STRING_LENGTH 64
> These are unused, from an earlier version of the patch perhaps?
>
> Removed

+ * Singe DSA area is created and used by all the processes,
> s/Singe/Since/
>

Fixed.

+typedef struct MemoryContextBackendState
> This is only used in mcxtfuncs.c and can be moved there rather than being
> exported in the header.
>

This is being used in mcxt.c too in the form of the variable memCtxState.

+} MemoryContextId;
> This lacks an entry in the typedefs.list file.
>
> Added.

Please find attached the updated patches with the above fixes.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v16-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	52.8 KB
v16-0001-Preparatory-changes-for-reporting-memory-context-sta.patch	application/octet-stream	4.3 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-03-13 13:56:51
Message-ID:	CAH2L28vULzqoit+YCKR5UhdT+AR+b1Qcs6Hgpz6nQz6NBT2jug@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Please find attached updated and rebased patches. It has the following
changes

1. To prevent memory leaks, ensure that the latest statistics published by
a process
are freed before it exits. This can be achieved by calling dsa_free in the
before_shmem_exit callback.
2. Add a level column to maintain consistency with the output of
pg_backend_memory_contexts.

Thank you,
Rahila Syed

On Tue, Mar 4, 2025 at 12:30 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

> Hi Daniel,
>
> Thanks for the rebase, a few mostly superficial comments from a first
>> read-through.
>>
> Thank you for your comments.
>
>
>> The documentation refers to attributes in the return row but the format
>> of that
>> row isn't displayed which makes following along hard. I think we should
>> include a table or a programlisting showing the return data before this
>> paragraph.
>>
>> I included the sql function call and its output in programlisting format
> after the
> function description.
> Since the description was part of a table, I added this additional
> information at the
> end of that table.
>
>
>> +const char *
>> +AssignContextType(NodeTag type)
>> This function doesn't actually assign anything so the name is a bit
>> misleading,
>> it would be better with ContextTypeToString or something similar.
>>
>> Done.
>
>
>>
>> + * By default, only superusers or users with PG_READ_ALL_STATS are
>> allowed to
>> This sentence is really long and should probably be broken up.
>>
>> Fixed.
>
>>
>> + * The shared memory buffer has a limited size - it the process has too
>> many
>> s/it/if/
>>
>> Fixed.
>
>
>> + * If the publishing backend does not respond before the condition
>> variable
>> + * times out, which is set to MEMSTATS_WAIT_TIMEOUT, retry for max_tries
>> + * number of times, which is defined by user, before giving up and
>> + * returning previously published statistics, if any.
>> This comment should mention what happens if the process gives up and
>> there is
>> no previously published stats.
>>
>> Done.
>
>
>>
>> + int i;
>> ...
>> + for (i = 0; i < memCtxState[procNumber].total_stats; i++)
>> This can be rewritten as "for (int i = 0; .." since we allow C99.
>>
>> Done.
>
>
>>
>> + * process running and consuming lots of memory, that it might end on
>> its
>> + * own first and its memory contexts are not logged is not a problem.
>> This comment is copy/pasted from pg_log_backend_memory_contexts and while
>> it
>> mostly still apply it should at least be rewritten to not refer to
>> logging as
>> this function doesn't do that.
>>
>> Fixed.
>
>
>>
>> + ereport(WARNING,
>> + (errmsg("PID %d is not a PostgreSQL server process",
>> No need to add the extra parenthesis around errmsg anymore, so I think
>> new code
>> should omit those.
>>
>> Done.
>
>
>>
>> + errhint("Use pg_backend_memory_contexts view instead")));
>> Super nitpick, but errhints should be complete sentences ending with a
>> period.
>>
>> Done.
>
>
>>
>> + * statitics have previously been published by the backend. In which
>> case,
>> s/statitics/statistics/
>>
>> Fixed.
>
>
>>
>> + * statitics have previously been published by the backend. In which
>> case,
>> + * check if statistics are not older than curr_timestamp, if they are
>> wait
>> I think the sentence around the time check is needlessly confusing, could
>> it be
>> rewritten into something like:
>> "A valid DSA pointer isn't proof that statistics are available, it
>> can be
>> valid due to previously published stats. Check if the stats are
>> updated by
>> comparing the timestamp, if the stats are newer than our previously
>> recorded timestamp from before sending the procsignal they must by
>> definition be updated."
>>
>> Replaced accordingly.
>
>
>>
>> + /* Assert for dsa_handle to be valid */
>> Was this intended to be turned into an Assert call? Else it seems better
>> to remove.
>>
>
> Added an assert and removed the comment.
>
>
>> + if (print_location != PRINT_STATS_NONE)
>> This warrants a comment stating why it makes sense.
>>
>
>> + * Do not print the statistics if print_to_stderr is PRINT_STATS_NONE,
>> s/print_to_stderr/print_location/. Also, do we really need
>> print_to_stderr in
>> this function at all? It seems a bit awkward to combine a boolean and a
>> paramter for a tri-state value when the parameter holds the tri_state
>> already.
>> For readability I think just checking print_location will be better since
>> the
>> value will be clear, where print_to_stderr=false is less clear in a
>> tri-state
>> scenario.
>>
>> I removed the boolean print_to_stderr, the checks are now using
> the tri-state enum-print_location.
>
>
>> + * its ancestors to a list, inorder to compute a path.
>> s/inorder/in order/
>>
>> Fixed.
>
>
>>
>> + elog(LOG, "hash table corrupted, can't construct path value");
>> + break;
>> This will return either a NIL list or a partial path, but
>> PublishMemoryContext
>> doesn't really take into consideration that it might be so. Is this
>> really
>> benign to the point that we can blindly go on? Also, elog(LOG..) is
>> mostly for
>> tracing or debugging as elog() isn't intended for user facing errors.
>>
>> I agree that this should be addressed. I added a check for path value
> before
> storing it in shared memory. If the path is NIL, the path pointer in DSA
> will point
> to InvalidDsaPointer.
> When a client encounters an InvalidDsaPointer it will print NULL in the
> path column.
> Partial path scenario is unlikely IMO, and I am not sure if it warrants
> additional
> checks.
>
>
>> +static void
>> +compute_num_of_contexts(List *contexts, HTAB *context_id_lookup,
>> + int *stats_count, bool get_summary)
>> This function does a lot than compute the number of contexts so the name
>> seems
>> a bit misleading. Perhaps a rename to compute_contexts() or something
>> similar?
>>
>> Renamed to compute_contexts_count_and_ids.
>
>
>>
>> + memctx_info[curr_id].name = dsa_allocate0(area,
>> + strlen(clipped_ident) + 1);
>> These calls can use idlen instead of more strlen() calls no? While there
>> is no
>> performance benefit, it would increase readability IMHO if the code first
>> calculates a value, and then use it.
>>
>> Done.
>
>
>>
>> + strncpy(name,
>> + clipped_ident, strlen(clipped_ident));
>> Since clipped_ident has been nul terminated earlier there is no need to
>> use
>> strncpy, we can instead use strlcpy and take the target buffer size into
>> consideration rather than the input string length.
>>
>> Replaced with the strlcpy calls.
>
>
>>
>> PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts
>> */
>> + PROCSIG_GET_MEMORY_CONTEXT, /* ask backend to log the memory contexts
>> */
>> This comment should be different from the LOG_MEMORY_xx one.
>>
>> Fixed.
>
> +#define MEM_CONTEXT_SHMEM_STATS_SIZE 30
>> +#define MAX_TYPE_STRING_LENGTH 64
>> These are unused, from an earlier version of the patch perhaps?
>>
>> Removed
>
> + * Singe DSA area is created and used by all the processes,
>> s/Singe/Since/
>>
>
> Fixed.
>
> +typedef struct MemoryContextBackendState
>> This is only used in mcxtfuncs.c and can be moved there rather than being
>> exported in the header.
>>
>
> This is being used in mcxt.c too in the form of the variable memCtxState.
>
>
>>
>
> +} MemoryContextId;
>> This lacks an entry in the typedefs.list file.
>>
>> Added.
>
> Please find attached the updated patches with the above fixes.
>
> Thank you,
> Rahila Syed
>

Attachment	Content-Type	Size
v17-0001-Preparatory-changes-for-reporting-memory-context-sta.patch	application/octet-stream	4.3 KB
v17-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	54.5 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-03-15 08:40:39
Message-ID:	CAPpHfdu1pxssAYAUkpQJBvgENCfVZetFL3v1txrWDUiKV80hJw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Rahila!

On Thu, Mar 13, 2025 at 3:57 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>
> Please find attached updated and rebased patches. It has the following changes
>
> 1. To prevent memory leaks, ensure that the latest statistics published by a process
> are freed before it exits. This can be achieved by calling dsa_free in the
> before_shmem_exit callback.
> 2. Add a level column to maintain consistency with the output of
> pg_backend_memory_contexts.

Thank you for your work on this subject.

v17-0001-Preparatory-changes-for-reporting-memory-context-sta.patch

It looks like we're increasing *num_contexts twice per child memory
context. First, it gets increased with a recursive
MemoryContextStatsInternal() call, then by adding an ichild. I might
be wrong, but I think these calculations at least deserve more
comments.

v17-0002-Function-to-report-memory-context-statistics.patch

+ if (procNumber == MyProcNumber)
+ {
+ ereport(WARNING,
+ errmsg("cannot return statistics for local backend"),
+ errhint("Use pg_backend_memory_contexts view instead."));
+ PG_RETURN_NULL();
+ }

Is it worth it to keep this restriction? Can we fetch info about
local memory context for the sake of generality?

I know there have been discussions in the thread before, but the
mechanism of publishing memory context stats via DSA looks quite
complicated. Also, the user probably intends to inspect memory
contexts when there is not a huge amount of free memory. So, failure
is probable on DSA allocation. Could we do simpler? For instance,
allocate some amount of static shared memory and use it as a message
queue between processes. As a heavy load is not supposed to be here,
I think one queue would be enough.

------
Regards,
Alexander Korotkov
Supabase

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-03-17 07:52:41
Message-ID:	CAH2L28t1bZ6CxfHHVJfTfH62XL25abXoNhd8B4R2C78QfbQt+A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Alexander,

Thank you for the review.

> It looks like we're increasing *num_contexts twice per child memory
> context. First, it gets increased with a recursive
> MemoryContextStatsInternal() call, then by adding an ichild. I might
> be wrong, but I think these calculations at least deserve more
> comments.
>

I believe that's not the case. The recursive calls only work for children
encountered up to max_level and less than max_children per context.
The rest of the children are handled using MemoryContextTraverseNext,
without recursive calls. Thus, num_contexts is incremented for those
children separately from the recursive call counter.

I will add more comments around this.

> v17-0002-Function-to-report-memory-context-statistics.patch
>
> + if (procNumber == MyProcNumber)
> + {
> + ereport(WARNING,
> + errmsg("cannot return statistics for local backend"),
> + errhint("Use pg_backend_memory_contexts view instead."));
> + PG_RETURN_NULL();
> + }
>
> Is it worth it to keep this restriction? Can we fetch info about
> local memory context for the sake of generality?
>
>
I think that could be done, but using pg_backend_memory_context would
be more efficient in this case.

> I know there have been discussions in the thread before, but the
> mechanism of publishing memory context stats via DSA looks quite
> complicated. Also, the user probably intends to inspect memory
> contexts when there is not a huge amount of free memory. So, failure
> is probable on DSA allocation. Could we do simpler? For instance,
> allocate some amount of static shared memory and use it as a message
> queue between processes. As a heavy load is not supposed to be here,
> I think one queue would be enough.
>
>
There could be other uses for such a function, such as a monitoring
dashboard
that periodically queries all running backends for memory statistics. If we
use a
single queue shared between all the backends, they will need to wait for
the queue
to become available before sharing their statistics, leading to processing
delays at
the publishing backend.

Even with separate queues for each backend or without expecting concurrent
use,
publishing statistics could be delayed if a message queue is full. This is
because a
backend needs to wait for a client process to consume existing messages or
statistics before publishing more.
If a client process exits without consuming messages, the publishing
backend will
experience timeouts when trying to publish stats. This will impact backend
performance
as statistics are published during CHECK_FOR_INTERRUPTS.

In the current implementation, the backend publishes all the statistics in
one go
without waiting for clients to read any statistics.

In addition, allocating complete message queues in static shared memory can
be
expensive, especially since these static structures need to be created even
if memory
context statistics are never queried.
On the contrary, a dsa is created for the feature whenever statistics are
first queried.
We are not preallocating shared memory for this feature, except for small
structures
to store the dsa_handle and dsa_pointers for each backend.

Thank you,
Rahila Syed

From:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-03-17 08:36:50
Message-ID:	CAExHW5skNvp265od6XPs0O-RL3cwtgHW3N87Ob0+nByZ=_HzAA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 17, 2025 at 1:23 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>
>>
>> v17-0002-Function-to-report-memory-context-statistics.patch
>>
>> + if (procNumber == MyProcNumber)
>> + {
>> + ereport(WARNING,
>> + errmsg("cannot return statistics for local backend"),
>> + errhint("Use pg_backend_memory_contexts view instead."));
>> + PG_RETURN_NULL();
>> + }
>>
>> Is it worth it to keep this restriction? Can we fetch info about
>> local memory context for the sake of generality?
>>
>
> I think that could be done, but using pg_backend_memory_context would
> be more efficient in this case.
>

I have raised a similar concern before. Having two separate functions
one for local backend and other for remote is going to be confusing.
We should have one function doing both and renamed appropriately.

--
Best Wishes,
Ashutosh Bapat

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-03-20 07:39:45
Message-ID:	CAH2L28sMyRh_ZomRxkx_RdaQoLcyGAwKCr1TSmrVudbbR_Q1eQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

>>
> >> + if (procNumber == MyProcNumber)
> >> + {
> >> + ereport(WARNING,
> >> + errmsg("cannot return statistics for local backend"),
> >> + errhint("Use pg_backend_memory_contexts view
> instead."));
> >> + PG_RETURN_NULL();
> >> + }
> >>
> >> Is it worth it to keep this restriction? Can we fetch info about
> >> local memory context for the sake of generality?
> >>
> >
> > I think that could be done, but using pg_backend_memory_context would
> > be more efficient in this case.
> >
>
> I have raised a similar concern before. Having two separate functions
> one for local backend and other for remote is going to be confusing.
> We should have one function doing both and renamed appropriately.
>
>
This is a separate concern from what has been raised by Alexander.
He has suggested removing the restriction and fetching local backend
statistics also
with the proposed function.
I've removed this restriction in the latest version of the patch. Now, the
proposed
function can be used to fetch local backend statistics too.

Regarding your suggestion on merging these functions, although they both
report memory
context statistics, they differ in how they fetch these statistics—locally
versus from dynamic
shared memory. Additionally, the function signatures are different: the
proposed function
takes three arguments (pid, get_summary, and num_tries), whereas
pg_get_backend_memory_contexts does not take any arguments. Therefore, I
believe
these functions can remain separate as long as we document their usages
correctly.

Please find attached rebased and updated patches. I have added some more
comments and
fixed an issue caused due to registering before_shmem_exit callback from
interrupt processing
routine. To address this issue, I am registering this callback in the
InitProcess() function.

This happened because interrupt processing could be triggered from a
PG_ENSURE_ERROR_CLEANUP block. This block operates under the assumption
that
the before_shmem_exit callback registered at the beginning of the block,
will be the last one
in the registered callback list at the end of the block, which would not be
the case if I register
before_shmem_exit callback in the interrupt handling routine.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v18-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	54.3 KB
v18-0001-Preparatory-changes-for-reporting-memory-context-sta.patch	application/octet-stream	4.3 KB

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-03-25 14:14:08
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 20 Mar 2025, at 08:39, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

Thanks for the new version, I believe this will be a welcome tool in the
debugging toolbox.

I took a cleanup pass over the docs with among others the below changes:
* You had broken the text in paragraphs, but without <para/> tags they are
rendered as a single blob of text so added that.
* Removed the "(PID)" explanation as process id is used elsewhere on the same
page already without explanation.
* Added <productname/> markup on PostgreSQL
* Added <literal/> markup on paramater values
* Switched the example query output to use \x
* Added a note on when pg_backend_memory_contexts is a better choice
The paragraphs need some re-indenting to avoid too long lines, but I opted out
of doing so here to make reviewing the changes easier.

A few comments on the code (all comments are performed in 0003 attached here
which also has smaller cleanups wrt indentation, code style etc):

+#include <math.h>
I don't think we need this, maybe it was from an earlier version of the patch?

+MEM_CTX_PUBLISH "Waiting for backend to publish memory information."
I wonder if this should really be "process" and not backend?

+ default:
+ context_type = "???";
+ break;
In ContextTypeToString() I'm having second thoughts about this, there shouldn't
be any legitimate use-case of passing a nodetag this function which would fail
MemoryContextIsValid(). I wonder if we aren't helping callers more by erroring
out rather than silently returning an unknown? I haven't changed this but
maybe we should to set the API right from the start?

+ * if the process has more memory contexts than that can fit in the allocated
s/than that can/than what can/?

+ errmsg("memory context statistics privilege error"));
Similar checks elsewhere in the tree mostly use "permission denied to .." so I
think we should adopt that here as well.

+ LWLockAcquire(&memCtxState[procNumber].lw_lock, LW_EXCLUSIVE);
+ msecs =
+ TimestampDifferenceMilliseconds(curr_timestamp,
+ memCtxState[procNumber].stats_timestamp);
Since we only want to consider the stats if they are from the current process,
we can delay checking the time difference until after we've checked the pid and
thus reduce the amount of time we hold the lock in the error case.

+ /*
+ * Recheck the state of the backend before sleeping on the condition
+ * variable
+ */
+ proc = BackendPidGetProc(pid);
Here we are really rechecking that the process is still alive, but I wonder if
we should take the opportunity to ensure that the type is what we expect it to
be? If the pid has moved from being a backend to an aux proc or vice versa we
really don't want to go on.

+ ereport(WARNING,
+ errmsg("PID %d is not a PostgreSQL server process",
+ pid));
I wonder if we should differentiate between the warnings? When we hit this in
the loop the errmsg is describing a slightly different case. I did leave it
for now, but it's food for thought if we should perhaps reword this one.

+ ereport(LOG,
+ errmsg("Wait for %d process to publish stats timed out, trying again",
+ pid));
This should probably by DEBUG1, in a congested cluster it might cause a fair
bit of logging which isn't really helping the user. Also, nitpick, errmsg
starts with a lowercase letter.

+static Size
+MemCtxShmemSize(void)
We don't really need this function anymore and keeping it separate we risk it
going out of sync with the matching calcuation in MemCtxBackendShmemInit, so I
think we should condense into one.

else
{
+ Assert(print_location == PRINT_STATS_NONE);
Rather than an if-then-else and an assert we can use a switch statement without
a default, this way we'll automatically get a warning if a value is missed.

+ ereport(LOG,
+ errmsg("hash table corrupted, can't construct path value"));
I know you switched from elog(LOG.. to ereport(LOG.. but I still think a LOG
entry stating corruption isn't helpful, it's not actionable for the user.
Given that it's a case that shouldn't happen I wonder if we should downgrade it
to an Assert(false) and potentially a DEBUG1?

--
Daniel Gustafsson

Attachment	Content-Type	Size
v19-0001-Preparatory-changes-for-reporting-memory-context.patch	application/octet-stream	4.3 KB
v19-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	54.3 KB
v19-0003-Review-comments-and-fixups.patch	application/octet-stream	18.2 KB
unknown_filename	text/plain	2 bytes

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-03-26 10:34:17
Message-ID:	CAH2L28v9EU4dxKUpvMt_CFAzG72CYusPWxADsgT=cwFJP-fP0A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Daniel,

Thank you for your review.

I have incorporated all your changes in v20 patches and ensured that the
review comments
corresponding to 0001 patch are included in that patch and not in 0002.

>
> +MEM_CTX_PUBLISH "Waiting for backend to publish memory information."
> I wonder if this should really be "process" and not backend?
>
> Fixed.

>
> + default:
> + context_type = "???";
> + break;
> In ContextTypeToString() I'm having second thoughts about this, there
> shouldn't
> be any legitimate use-case of passing a nodetag this function which would
> fail
> MemoryContextIsValid(). I wonder if we aren't helping callers more by
> erroring
> out rather than silently returning an unknown? I haven't changed this but
> maybe we should to set the API right from the start?
>

I cannot think of any legitimate scenario where the context type would be
unknown.
However, if we were to throw an error, it would prevent us from reporting
any memory
usage information when the context type is unidentified. Perhaps, it would
be more
informative and less restrictive to label it as "Unrecognized" or "Unknown."
I wonder if this was the reasoning behind doing it when it was added with
the
pg_backend_memory_contexts() function.

>
> + /*
> + * Recheck the state of the backend before sleeping on the
> condition
> + * variable
> + */
> + proc = BackendPidGetProc(pid);
> Here we are really rechecking that the process is still alive, but I
> wonder if
> we should take the opportunity to ensure that the type is what we expect
> it to
> be? If the pid has moved from being a backend to an aux proc or vice
> versa we
> really don't want to go on.
>
>
The reasoning makes sense to me. For periodic monitoring of all processes,
any PID that reincarnates into a different type could be queried in
subsequent
function calls. Regarding targeted monitoring of a specific process, such a
reincarnated
process would exhibit a completely different memory consumption,
likely not aligning with the user's original intent behind requesting the
statistics.

>
> + ereport(WARNING,
> + errmsg("PID %d is not a PostgreSQL server process",
> + pid));
> I wonder if we should differentiate between the warnings? When we hit
> this in
> the loop the errmsg is describing a slightly different case. I did leave
> it
> for now, but it's food for thought if we should perhaps reword this one.
>
>
Changed it to "PID %d is no longer the same PostgreSQL server process".

>
> + ereport(LOG,
> + errmsg("hash table corrupted, can't construct path
> value"));
> I know you switched from elog(LOG.. to ereport(LOG.. but I still think a
> LOG
> entry stating corruption isn't helpful, it's not actionable for the user.
> Given that it's a case that shouldn't happen I wonder if we should
> downgrade it
> to an Assert(false) and potentially a DEBUG1?
>
> How about changing it to ERROR, in accordance with current occurrences of
the
same message? I did it in the attached version, however I am open to
changing
it to an Assert(false) and DEBUG1.

Apart from the above, I made the following improvements.

1. Eliminated the unnecessary creation of an extra memory context before
calling hash_create.
The hash_create function already generates a memory context containing the
hash table,
enabling easy memory deallocation by simply deleting the context via
hash_destroy.
Therefore, the patch relies on hash_destroy for memory management instead
of manual freeing.

2. Optimized memory usage by storing the path as an array of integers
rather than as an array of
Datums.
This approach conserves DSA memory allocated for storing this information.

3. Miscellaneous comment cleanups and introduced macros to simplify
calculations.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v20-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	52.0 KB
v20-0001-Preparatory-changes-for-reporting-memory-context-sta.patch	application/octet-stream	5.1 KB

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-02 21:44:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 26 Mar 2025, at 11:34, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

> + ereport(LOG,
> + errmsg("hash table corrupted, can't construct path value"));
> I know you switched from elog(LOG.. to ereport(LOG.. but I still think a LOG
> entry stating corruption isn't helpful, it's not actionable for the user.
> Given that it's a case that shouldn't happen I wonder if we should downgrade it
> to an Assert(false) and potentially a DEBUG1?
>
> How about changing it to ERROR, in accordance with current occurrences of the
> same message? I did it in the attached version, however I am open to changing
> it to an Assert(false) and DEBUG1.

In the attached I moved it to an elog() as it's an internal error, and spending
translation effort on it seems fruitless.

> 1. Eliminated the unnecessary creation of an extra memory context before calling hash_create.
> The hash_create function already generates a memory context containing the hash table,
> enabling easy memory deallocation by simply deleting the context via hash_destroy.
> Therefore, the patch relies on hash_destroy for memory management instead of manual freeing.

Nice

> 2. Optimized memory usage by storing the path as an array of integers rather than as an array of
> Datums.
> This approach conserves DSA memory allocated for storing this information.

Ah yes, much better.

The attached v21 has a few improvements:

* The function documentation didn't specify the return type, only the fact that
it's setof record. I've added all output columns.

* Some general cleaups of the docs with better markup, improved xref linking
and various rewording.

* Comment cleanups and language alignment

* Added a missing_ok parameter to ContextTypeToString(). While all callers are
fine with unknown context types, if we introduce an API for this it seems
prudent to not place that burden on callers but to take it on in the function.

* Renamed get_summary to just summary, and num_of_tries to retries which feels
more in line with the naming convention in other functions

* Deferred calling InitMaterializedSRF() until after the PID has been checked
for validity.

* Pulled back the timeout to 500msec from 1 second. In running congested
pgbench simulations I saw better performance and improved results in getting stats.

* Replaced strncpy with strlcpy and consistently used idlen to keep all length
calculations equal.

* Fixed misspelled param name in pg_proc.dat

* Pulled back maximum memory usage from 8Mb to 1Mb. 8Mb for the duration of a
process (once allocated) is a lot for a niche feature and I while I'm still not
sure 1Mb is the right value I think from experimentation that it's closer.

I think this version is close to a committable state, will spend a little more
time testing, polishing and rewriting the commit message. I will also play
around with placement within the memory context code files to keep it from
making backpatch issues.

--
Daniel Gustafsson

Attachment	Content-Type	Size
v21-0002-Function-to-report-memory-context-statistics.patch	application/octet-stream	54.1 KB
v21-0001-Preparatory-changes-for-reporting-memory-context.patch	application/octet-stream	5.1 KB

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-05 19:29:21
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 2 Apr 2025, at 23:44, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:

> I think this version is close to a committable state, will spend a little more
> time testing, polishing and rewriting the commit message. I will also play
> around with placement within the memory context code files to keep it from
> making backpatch issues.

After a bit more polish I landed with the attached, which I most likely will go
ahead with after another round in CI.

--
Daniel Gustafsson

Attachment	Content-Type	Size
v23-0001-Add-function-to-get-memory-context-stats-for-pro.patch	application/octet-stream	58.1 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-07 06:51:43
Message-ID:	CAH2L28s+s4JJdPz2RkALzOvXpxXUcmm=fvg-Y6M82g9Cp=bB-w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Daniel,

>
> After a bit more polish I landed with the attached, which I most likely
> will go
> ahead with after another round in CI.
>

Thank you for refining the code. The changes look good to me.
Regression tests ran smoothly in parallel with the memory monitoring
function,
pgbench results with the following custom script also shows good
performance.
```
SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
ORDER BY random() LIMIT 1)
, false, 5);
```

Thank you,
Rahila Syed

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-07 13:41:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Following up on some off-list comments, attached is a v26 with a few small last
changes:

* Improved documentation (docs and comments)
* Fixed up Shmem sizing and init
* Delayed registering to the shmem cleanup to get it earlier in cleanup
* Renamed a few datastructures to improve readability
* Various bits of polish

I think this function can be a valuable debugging aid going forward.

--
Daniel Gustafsson

Attachment	Content-Type	Size
v26-0001-Add-function-to-get-memory-context-stats-for-pro.patch	application/octet-stream	63.0 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-07 15:43:51
Message-ID:	5bxhxniyvjyfldi7yjxcnxkl3i2ghci2grjyeclrbkfqnyhowk@dfkqzbvtbpml
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-04-07 15:41:37 +0200, Daniel Gustafsson wrote:
> I think this function can be a valuable debugging aid going forward.

What I am most excited about for this is to be able to measure server-wide and
fleet-wide memory usage over time. Today I have actually very little idea
about what memory is being used for across all connections, not to speak of a
larger number of servers.

> diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
> index 4f6795f7265..d3b4df27935 100644
> --- a/src/backend/postmaster/auxprocess.c
> +++ b/src/backend/postmaster/auxprocess.c
> @@ -84,6 +84,13 @@ AuxiliaryProcessMainCommon(void)
> /* register a before-shutdown callback for LWLock cleanup */
> before_shmem_exit(ShutdownAuxiliaryProcess, 0);
>
> + /*
> + * The before shmem exit callback frees the DSA memory occupied by the
> + * latest memory context statistics that could be published by this aux
> + * proc if requested.
> + */
> + before_shmem_exit(AtProcExit_memstats_dsa_free, 0);
> +
> SetProcessingMode(NormalProcessing);
> }

How about putting it into BaseInit()? Or maybe just register it when its
first used?

> diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> index fda91ffd1ce..d3cb3f1891c 100644
> --- a/src/backend/postmaster/checkpointer.c
> +++ b/src/backend/postmaster/checkpointer.c
> @@ -663,6 +663,10 @@ ProcessCheckpointerInterrupts(void)
> /* Perform logging of memory contexts of this process */
> if (LogMemoryContextPending)
> ProcessLogMemoryContextInterrupt();
> +
> + /* Publish memory contexts of this process */
> + if (PublishMemoryContextPending)
> + ProcessGetMemoryContextInterrupt();
> }
>
> /*

Not this patch's responsibility, but we really desperately need to unify our
interrupt handling. Manually keeping a ~dozen of functions similar, but not
exactly the same, is an insane approach.

> --- a/src/backend/utils/activity/wait_event_names.txt
> +++ b/src/backend/utils/activity/wait_event_names.txt
> @@ -161,6 +161,7 @@ WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
> WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
> WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
> XACT_GROUP_UPDATE "Waiting for the group leader to update transaction status at transaction end."
> +MEM_CTX_PUBLISH "Waiting for a process to publish memory information."

The memory context stuff abbreviates as cxt not ctx. There's a few more cases
of that in the patch.

> +const char *
> +ContextTypeToString(NodeTag type)
> +{
> + const char *context_type;
> +
> + switch (type)
> + {
> + case T_AllocSetContext:
> + context_type = "AllocSet";
> + break;
> + case T_GenerationContext:
> + context_type = "Generation";
> + break;
> + case T_SlabContext:
> + context_type = "Slab";
> + break;
> + case T_BumpContext:
> + context_type = "Bump";
> + break;
> + default:
> + context_type = "???";
> + break;
> + }
> + return (context_type);

Why these parens?

> + * If the publishing backend does not respond before the condition variable
> + * times out, which is set to MEMSTATS_WAIT_TIMEOUT, retry given that there is
> + * time left within the timeout specified by the user, before giving up and
> + * returning previously published statistics, if any. If no previous statistics
> + * exist, return NULL.

Why do we need to repeatedly wake up rather than just sleeping with the
"remaining" amount of time based on the time the function was called and the
time that has passed since?

> + /*
> + * A valid DSA pointer isn't proof that statistics are available, it can
> + * be valid due to previously published stats.

Somehow "valid DSA pointer" is a bit too much about the precise mechanics and
not enough about what's actually happening. I'd rather say something like

"Even if the proc has published statistics, they may not be due to the current
request, but previously published stats."

> + if (ConditionVariableTimedSleep(&memCtxState[procNumber].memctx_cv,
> + MEMSTATS_WAIT_TIMEOUT,
> + WAIT_EVENT_MEM_CTX_PUBLISH))
> + {
> + timer += MEMSTATS_WAIT_TIMEOUT;
> +
> + /*
> + * Wait for the timeout as defined by the user. If no updated
> + * statistics are available within the allowed time then display
> + * previously published statistics if there are any. If no
> + * previous statistics are available then return NULL. The timer
> + * is defined in milliseconds since thats what the condition
> + * variable sleep uses.
> + */
> + if ((timer * 1000) >= timeout)
> + {

I'd suggest just comparing how much time has elapsed since the timestamp
you've requested earlier.

> + LWLockAcquire(&memCtxState[procNumber].lw_lock, LW_EXCLUSIVE);
> + /* Displaying previously published statistics if available */
> + if (DsaPointerIsValid(memCtxState[procNumber].memstats_dsa_pointer))
> + break;
> + else
> + {
> + LWLockRelease(&memCtxState[procNumber].lw_lock);
> + PG_RETURN_NULL();
> + }
> + }
> + }
> + }

> +/*
> + * Initialize shared memory for displaying memory context statistics
> + */
> +void
> +MemoryContextReportingShmemInit(void)
> +{
> + bool found;
> +
> + memCtxArea = (MemoryContextState *)
> + ShmemInitStruct("MemoryContextState", sizeof(MemoryContextState), &found);
> +
> + if (!IsUnderPostmaster)
> + {
> + Assert(!found);

I don't really understand why this uses IsUnderPostmaster? Seems like this
should just use found like most (or all) the other *ShmemInit() functions do?

> + LWLockInitialize(&memCtxArea->lw_lock, LWLockNewTrancheId());

I think for builtin code we just hardcode the tranches in BuiltinTrancheIds.

> + memCtxState = (MemoryContextBackendState *)
> + ShmemInitStruct("MemoryContextBackendState",
> + ((MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(MemoryContextBackendState)),
> + &found);

FWIW, I think it'd be mildly better if these two ShmemInitStruct()'s were
combined.

> static void
> MemoryContextStatsInternal(MemoryContext context, int level,
> int max_level, int max_children,
> MemoryContextCounters *totals,
> - bool print_to_stderr)
> + PrintDestination print_location, int *num_contexts)
> {
> MemoryContext child;
> int ichild;
> @@ -884,10 +923,39 @@ MemoryContextStatsInternal(MemoryContext context, int level,
> Assert(MemoryContextIsValid(context));
>
> /* Examine the context itself */
> - context->methods->stats(context,
> - MemoryContextStatsPrint,
> - &level,
> - totals, print_to_stderr);
> + switch (print_location)
> + {
> + case PRINT_STATS_TO_STDERR:
> + context->methods->stats(context,
> + MemoryContextStatsPrint,
> + &level,
> + totals, true);
> + break;
> +
> + case PRINT_STATS_TO_LOGS:
> + context->methods->stats(context,
> + MemoryContextStatsPrint,
> + &level,
> + totals, false);
> + break;
> +
> + case PRINT_STATS_NONE:
> +
> + /*
> + * Do not print the statistics if print_location is
> + * PRINT_STATS_NONE, only compute totals. This is used in
> + * reporting of memory context statistics via a sql function. Last
> + * parameter is not relevant.
> + */
> + context->methods->stats(context,
> + NULL,
> + NULL,
> + totals, false);
> + break;
> + }
> +
> + /* Increment the context count for each of the recursive call */
> + *num_contexts = *num_contexts + 1;

It feels a bit silly to duplicate the call to context->methods->stats three
times. We've changed these parameters a bunch in the past, having more callers
to fix makes that more work. Can't the switch just set up the args that are
then passed to one call to context->methods->stats?

> +
> + /* Compute the number of stats that can fit in the defined limit */
> + max_stats = (MAX_SEGMENTS_PER_BACKEND * DSA_DEFAULT_INIT_SEGMENT_SIZE)
> + / (MAX_MEMORY_CONTEXT_STATS_SIZE);

MAX_SEGMENTS_PER_BACKEND sounds way too generic to me for something defined in
memutils.h. I don't really understand why DSA_DEFAULT_INIT_SEGMENT_SIZE is
something that makes sense to use here?

The header says:

> +/* Maximum size (in Mb) of DSA area per process */
> +#define MAX_SEGMENTS_PER_BACKEND 1

But the name doesn't at all indicate it's in megabytes. Nor does the way it's
used clearly indicate that. That seems to be completely incidental based on
the current default value DSA_DEFAULT_INIT_SEGMENT_SIZE.

> + /*
> + * Hold the process lock to protect writes to process specific memory. Two
> + * processes publishing statistics do not block each other.
> + */

s/specific/process specific/

> + LWLockAcquire(&memCtxState[idx].lw_lock, LW_EXCLUSIVE);
> + memCtxState[idx].proc_id = MyProcPid;
> +
> + if (DsaPointerIsValid(memCtxState[idx].memstats_dsa_pointer))
> + {
> + /*
> + * Free any previous allocations, free the name, ident and path
> + * pointers before freeing the pointer that contains them.
> + */
> + free_memorycontextstate_dsa(area, memCtxState[idx].total_stats,
> + memCtxState[idx].memstats_dsa_pointer);
> +
> + dsa_free(area, memCtxState[idx].memstats_dsa_pointer);
> + memCtxState[idx].memstats_dsa_pointer = InvalidDsaPointer;

Both callers to free_memorycontextstate_dsa() do these lines immediately after
calling free_memorycontextstate_dsa(), why not do that inside?

> + for (MemoryContext c = TopMemoryContext->firstchild; c != NULL;
> + c = c->nextchild)
> + {
> + MemoryContextCounters grand_totals;
> + int num_contexts = 0;
> + int level = 0;
> +
> + path = NIL;
> + memset(&grand_totals, 0, sizeof(grand_totals));
> +
> + MemoryContextStatsInternal(c, level, 100, 100, &grand_totals,
> + PRINT_STATS_NONE, &num_contexts);
> +
> + path = compute_context_path(c, context_id_lookup);
> +
> + PublishMemoryContext(meminfo, ctx_id, c, path,
> + grand_totals, num_contexts, area, 100);
> + ctx_id = ctx_id + 1;
> + }
> + memCtxState[idx].total_stats = ctx_id;
> + /* Notify waiting backends and return */
> + hash_destroy(context_id_lookup);
> + dsa_detach(area);
> + signal_memorycontext_reporting();
> + }
> +
> + foreach_ptr(MemoryContextData, cur, contexts)
> + {
> + List *path = NIL;
> +
> + /*
> + * Figure out the transient context_id of this context and each of its
> + * ancestors, to compute a path for this context.
> + */
> + path = compute_context_path(cur, context_id_lookup);
> +
> + /* Account for saving one statistics slot for cumulative reporting */
> + if (context_id < (max_stats - 1) || stats_count <= max_stats)
> + {
> + /* Examine the context stats */
> + memset(&stat, 0, sizeof(stat));
> + (*cur->methods->stats) (cur, NULL, NULL, &stat, true);

Hm. So here we call the callback ourselves, even though we extended
MemoryContextStatsInternal() to satisfy the summary output. I guess it's
tolerable, but it's not great.

> + /* Copy statistics to DSA memory */
> + PublishMemoryContext(meminfo, context_id, cur, path, stat, 1, area, 100);
> + }
> + else
> + {
> + /* Examine the context stats */
> + memset(&stat, 0, sizeof(stat));
> + (*cur->methods->stats) (cur, NULL, NULL, &stat, true);

But do we really do it twice in a row? The lines are exactly the same, so it
seems that should just be done before the if?

> +
> + /* Notify waiting backends and return */
> + hash_destroy(context_id_lookup);
> + dsa_detach(area);
> + signal_memorycontext_reporting();
> +}
> +
> +/*
> + * Signal all the waiting client backends after copying all the statistics.
> + */
> +static void
> +signal_memorycontext_reporting(void)
> +{
> + memCtxState[MyProcNumber].stats_timestamp = GetCurrentTimestamp();
> + LWLockRelease(&memCtxState[MyProcNumber].lw_lock);
> + ConditionVariableBroadcast(&memCtxState[MyProcNumber].memctx_cv);
> +}

IMO somewhat confusing to release the lock in a function named
signal_memorycontext_reporting(). Why do we do that after
hash_destroy()/dsa_detach()?

> +static void
> +compute_contexts_count_and_ids(List *contexts, HTAB *context_id_lookup,
> + int *stats_count, bool summary)
> +{
> + foreach_ptr(MemoryContextData, cur, contexts)
> + {
> + MemoryContextId *entry;
> + bool found;
> +
> + entry = (MemoryContextId *) hash_search(context_id_lookup, &cur,
> + HASH_ENTER, &found);
> + Assert(!found);
> +
> + /* context id starts with 1 */
> + entry->context_id = ++(*stats_count);

Given that we don't actually do anything here relating to starting with 1, I
find that comment confusing.

> +static void
> +PublishMemoryContext(MemoryContextStatsEntry *memctx_info, int curr_id,
> + MemoryContext context, List *path,
> + MemoryContextCounters stat, int num_contexts,
> + dsa_area *area, int max_levels)
> +{
> + const char *ident = context->ident;
> + const char *name = context->name;
> + int *path_list;
> +
> + /*
> + * To be consistent with logging output, we label dynahash contexts with
> + * just the hash table name as with MemoryContextStatsPrint().
> + */
> + if (context->ident && strncmp(context->name, "dynahash", 8) == 0)
> + {
> + name = context->ident;
> + ident = NULL;
> + }
> +
> + if (name != NULL)
> + {
> + int namelen = strlen(name);
> + char *nameptr;
> +
> + if (strlen(name) >= MEMORY_CONTEXT_IDENT_SHMEM_SIZE)
> + namelen = pg_mbcliplen(name, namelen,
> + MEMORY_CONTEXT_IDENT_SHMEM_SIZE - 1);
> +
> + memctx_info[curr_id].name = dsa_allocate0(area, namelen + 1);

Given the number of references to memctx_info[curr_id] I'd put it in a local variable.

Why is this a dsa_allocate0 given that we're immediately overwriting it?

> + nameptr = (char *) dsa_get_address(area, memctx_info[curr_id].name);
> + strlcpy(nameptr, name, namelen + 1);
> + }
> + else
> + memctx_info[curr_id].name = InvalidDsaPointer;
> +
> + /* Trim and copy the identifier if it is not set to NULL */
> + if (ident != NULL)
> + {
> + int idlen = strlen(context->ident);
> + char *identptr;
> +
> + /*
> + * Some identifiers such as SQL query string can be very long,
> + * truncate oversize identifiers.
> + */
> + if (idlen >= MEMORY_CONTEXT_IDENT_SHMEM_SIZE)
> + idlen = pg_mbcliplen(ident, idlen,
> + MEMORY_CONTEXT_IDENT_SHMEM_SIZE - 1);
> +
> + memctx_info[curr_id].ident = dsa_allocate0(area, idlen + 1);
> + identptr = (char *) dsa_get_address(area, memctx_info[curr_id].ident);
> + strlcpy(identptr, ident, idlen + 1);

Hm. First I thought we'd leak memory if this second (and subsequent)
dsa_allocate failed. Then I thought we'd be ok, because the memory would be
memory because it'd be reachable from memCtxState[idx].memstats_dsa_pointer.

But I think it wouldn't *quite* work, because memCtxState[idx].total_stats is
only set *after* we would have failed.

> + /* Allocate DSA memory for storing path information */
> + if (path == NIL)
> + memctx_info[curr_id].path = InvalidDsaPointer;
> + else
> + {
> + int levels = Min(list_length(path), max_levels);
> +
> + memctx_info[curr_id].path_length = levels;
> + memctx_info[curr_id].path = dsa_allocate0(area, levels * sizeof(int));
> + memctx_info[curr_id].levels = list_length(path);
> + path_list = (int *) dsa_get_address(area, memctx_info[curr_id].path);
> +
> + foreach_int(i, path)
> + {
> + path_list[foreach_current_index(i)] = i;
> + if (--levels == 0)
> + break;
> + }
> + }
> + memctx_info[curr_id].type = ContextTypeToString(context->type);

I don't think this works across platforms. On windows / EXEC_BACKEND builds
the location of string constants can differ across backends. And: Why do we
need the string here? You can just call ContextTypeToString when reading?

> +/*
> + * Free the memory context statistics stored by this process
> + * in DSA area.
> + */
> +void
> +AtProcExit_memstats_dsa_free(int code, Datum arg)
> +{

FWIW, to me the fact that it does a dsa_free() is an implementation
detail. It's also not the only thing this does.

And, I don't think AtProcExit* really is accurate, given that it runs *before*
shmem is cleaned up?

I wonder if the best approach here wouldn't be to forgo the use of a
before_shmem_exit() callback, but instead use on_dsm_detach(). That would
require we'd not constantly detach from the dsm segment, but I don't
understand why we do that in the first place?

> + int idx = MyProcNumber;
> + dsm_segment *dsm_seg = NULL;
> + dsa_area *area = NULL;
> +
> + if (memCtxArea->memstats_dsa_handle == DSA_HANDLE_INVALID)
> + return;
> +
> + dsm_seg = dsm_find_mapping(memCtxArea->memstats_dsa_handle);
> +
> + LWLockAcquire(&memCtxState[idx].lw_lock, LW_EXCLUSIVE);
> +
> + if (!DsaPointerIsValid(memCtxState[idx].memstats_dsa_pointer))
> + {
> + LWLockRelease(&memCtxState[idx].lw_lock);
> + return;
> + }
> +
> + /* If the dsm mapping could not be found, attach to the area */
> + if (dsm_seg != NULL)
> + return;

I don't understand what we do here with the dsm? Why do we not need cleanup
if we are already attached to the dsm segment?

> +/*
> + * Static shared memory state representing the DSA area created for memory
> + * context statistics reporting. A single DSA area is created and used by all
> + * the processes, each having its specific DSA allocations for sharing memory
> + * statistics, tracked by per backend static shared memory state.
> + */
> +typedef struct MemoryContextState
> +{
> + dsa_handle memstats_dsa_handle;
> + LWLock lw_lock;
> +} MemoryContextState;

IMO that's too generic a name for something in a header.

> +/*
> + * Used for storage of transient identifiers for pg_get_backend_memory_contexts
> + */
> +typedef struct MemoryContextId
> +{
> + MemoryContext context;
> + int context_id;
> +} MemoryContextId;

This too. Particularly because MemoryContextData->ident exist but is
something different.

> +DO $$
> +DECLARE
> + launcher_pid int;
> + r RECORD;
> +BEGIN
> + SELECT pid from pg_stat_activity where backend_type='autovacuum launcher'
> + INTO launcher_pid;
> +
> + select type, name, ident
> + from pg_get_process_memory_contexts(launcher_pid, false, 20)
> + where path = '{1}' into r;
> + RAISE NOTICE '%', r;
> + select type, name, ident
> + from pg_get_process_memory_contexts(pg_backend_pid(), false, 20)
> + where path = '{1}' into r;
> + RAISE NOTICE '%', r;
> +END $$;

I'd also test an aux process. I think the AV launcher isn't one, because it
actually does "table" access of shared relations.

Greetings,

Andres Freund

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-07 16:27:57
Message-ID:	CAH2L28v66-=8R0P9itK16edxn2n41nVE7yVMr98V=bLOekgteA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Please see some responses below.

On Mon, Apr 7, 2025 at 9:13 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2025-04-07 15:41:37 +0200, Daniel Gustafsson wrote:
> > I think this function can be a valuable debugging aid going forward.
>
> What I am most excited about for this is to be able to measure server-wide
> and
> fleet-wide memory usage over time. Today I have actually very little idea
> about what memory is being used for across all connections, not to speak
> of a
> larger number of servers.
>
>
> > diff --git a/src/backend/postmaster/auxprocess.c
> b/src/backend/postmaster/auxprocess.c
> > index 4f6795f7265..d3b4df27935 100644
> > --- a/src/backend/postmaster/auxprocess.c
> > +++ b/src/backend/postmaster/auxprocess.c
> > @@ -84,6 +84,13 @@ AuxiliaryProcessMainCommon(void)
> > /* register a before-shutdown callback for LWLock cleanup */
> > before_shmem_exit(ShutdownAuxiliaryProcess, 0);
> >
> > + /*
> > + * The before shmem exit callback frees the DSA memory occupied by
> the
> > + * latest memory context statistics that could be published by
> this aux
> > + * proc if requested.
> > + */
> > + before_shmem_exit(AtProcExit_memstats_dsa_free, 0);
> > +
> > SetProcessingMode(NormalProcessing);
> > }
>
> How about putting it into BaseInit()? Or maybe just register it when its
> first used?
>
>
Problem with registering it when dsa is first used is that dsa is used in
an interrupt handler.
The handler could be called from the PG_ENSURE_ERROR_CLEANUP block. This
block
operates under the assumption that the before_shmem_exit callback
registered at the beginning,
will be the last one in the registered callback list at the end of the
block. However, this won't be
the case if a callback is registered from an interrupt handler called in
the
PG_ENSURE_ERROR_CLEANUP block.

I don't really understand why DSA_DEFAULT_INIT_SEGMENT_SIZE is

something that makes sense to use here?
>
>
To determine the memory limit per backend in multiples of
DSA_DEFAULT_INIT_SEGMENT_SIZE.
Currently it is set to 1 * DSA_DEFAULT_INIT_SEGMENT_SIZE.
Since a call to dsa_create would create a DSA segment of this size, I
thought it makes sense
to define a limit related to the segment size.

> > +/*
>
> + /* If the dsm mapping could not be found, attach to the area */
> > + if (dsm_seg != NULL)
> > + return;
>
> I don't understand what we do here with the dsm? Why do we not need
> cleanup
> if we are already attached to the dsm segment?
>

I am not expecting to hit this case, since we are always detaching from the
dsa.
This could be an assert but since it is a cleanup code, I thought returning
would be
a harmless step.

Thank you,
Rahila Syed

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-07 16:38:28
Message-ID:	le7vtpckuo6yc2usxwfc5r4ub7ghvph2ovxw3xcwue6wb63tyh@jvh4zt6ffobg
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-04-07 21:57:57 +0530, Rahila Syed wrote:
> > > diff --git a/src/backend/postmaster/auxprocess.c
> > b/src/backend/postmaster/auxprocess.c
> > > index 4f6795f7265..d3b4df27935 100644
> > > --- a/src/backend/postmaster/auxprocess.c
> > > +++ b/src/backend/postmaster/auxprocess.c
> > > @@ -84,6 +84,13 @@ AuxiliaryProcessMainCommon(void)
> > > /* register a before-shutdown callback for LWLock cleanup */
> > > before_shmem_exit(ShutdownAuxiliaryProcess, 0);
> > >
> > > + /*
> > > + * The before shmem exit callback frees the DSA memory occupied by
> > the
> > > + * latest memory context statistics that could be published by
> > this aux
> > > + * proc if requested.
> > > + */
> > > + before_shmem_exit(AtProcExit_memstats_dsa_free, 0);
> > > +
> > > SetProcessingMode(NormalProcessing);
> > > }
> >
> > How about putting it into BaseInit()? Or maybe just register it when its
> > first used?
> >
> >
> Problem with registering it when dsa is first used is that dsa is used in an
> interrupt handler. The handler could be called from the
> PG_ENSURE_ERROR_CLEANUP block. This block operates under the assumption that
> the before_shmem_exit callback registered at the beginning, will be the last
> one in the registered callback list at the end of the block. However, this
> won't be the case if a callback is registered from an interrupt handler
> called in the PG_ENSURE_ERROR_CLEANUP block.

Ugh, I really dislike PG_ENSURE_ERROR_CLEANUP().

That's not an argument against moving it to BaseInit() though, as that's
called before procsignal is even initialized and before signals are unmasked.

> I don't really understand why DSA_DEFAULT_INIT_SEGMENT_SIZE is
>
> something that makes sense to use here?
> >
> >
> To determine the memory limit per backend in multiples of
> DSA_DEFAULT_INIT_SEGMENT_SIZE.
> Currently it is set to 1 * DSA_DEFAULT_INIT_SEGMENT_SIZE.
> Since a call to dsa_create would create a DSA segment of this size, I
> thought it makes sense
> to define a limit related to the segment size.

I strongly disagree. The limit should be in an understandable unit, not on
another subystems's defaults that might change at some point.

> > + /* If the dsm mapping could not be found, attach to the area */
> > > + if (dsm_seg != NULL)
> > > + return;
> >
> > I don't understand what we do here with the dsm? Why do we not need
> > cleanup
> > if we are already attached to the dsm segment?
> >
>
> I am not expecting to hit this case, since we are always detaching from the
> dsa.

Pretty sure it's reachable, consider a failure of dsa_allocate(). That'll
throw an error, while attached to the segment.

> This could be an assert but since it is a cleanup code, I thought returning
> would be a harmless step.

The problem is that the code seems wrong - if we are already attached we'll
leak the memory!

As I also mentioned, I don't understand why we're constantly
attaching/detaching from the dsa/dsm either. It just seems to make things more
complicated an dmore expensive.

Greetings,

Andres Freund

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-07 18:30:54
Message-ID:	CAH2L28tzfSdFawTQS45SWVR5mqk2iQBDz0hFahCZEQOf93HxrQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>
>
> That's not an argument against moving it to BaseInit() though, as that's
> called before procsignal is even initialized and before signals are
> unmasked.
>

Yes, OK.

> I don't really understand why DSA_DEFAULT_INIT_SEGMENT_SIZE is
> >
> > something that makes sense to use here?
> > >
> > >
> > To determine the memory limit per backend in multiples of
> > DSA_DEFAULT_INIT_SEGMENT_SIZE.
> > Currently it is set to 1 * DSA_DEFAULT_INIT_SEGMENT_SIZE.
> > Since a call to dsa_create would create a DSA segment of this size, I
> > thought it makes sense
> > to define a limit related to the segment size.
>
> I strongly disagree. The limit should be in an understandable unit, not on
> another subystems's defaults that might change at some point.
>

OK, makes sense.

>
>
> > > + /* If the dsm mapping could not be found, attach to the area */
> > > > + if (dsm_seg != NULL)
> > > > + return;
> > >
> > > I don't understand what we do here with the dsm? Why do we not need
> > > cleanup
> > > if we are already attached to the dsm segment?
> > >
> >
> > I am not expecting to hit this case, since we are always detaching from
> the
> > dsa.
>
> Pretty sure it's reachable, consider a failure of dsa_allocate(). That'll
> throw an error, while attached to the segment.
>
>
You are right, I did not think of this scenario.

>
> > This could be an assert but since it is a cleanup code, I thought
> returning
> > would be a harmless step.
>
> The problem is that the code seems wrong - if we are already attached we'll
> leak the memory!
>
>
I understand your concern. One issue I recall is that we do not have a
dsa_find_mapping
function similar to dsm_find_mapping(). If I understand correctly, the only
way to access
an already attached DSA is to ensure we store the DSA area mapping in a
global variable.
I'm considering using a global variable and accessing it from the cleanup
function in case
it is already mapped.
Does that sound fine?

> As I also mentioned, I don't understand why we're constantly
> attaching/detaching from the dsa/dsm either. It just seems to make things
> more
> complicated an dmore expensive.
>

OK, I see that this could be expensive if a process is periodically being
queried for
statistics. However, in scenarios where a process is queried only once for
memory,
statistics, keeping the area mapped would consume memory resources, correct?

Thank you,
Rahila Syed

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-07 23:17:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 7 Apr 2025, at 17:43, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2025-04-07 15:41:37 +0200, Daniel Gustafsson wrote:
>> I think this function can be a valuable debugging aid going forward.
>
> What I am most excited about for this is to be able to measure server-wide and
> fleet-wide memory usage over time. Today I have actually very little idea
> about what memory is being used for across all connections, not to speak of a
> larger number of servers.

Thanks for looking, Rahila and I took a collective stab at the review comments.

>> + before_shmem_exit(AtProcExit_memstats_dsa_free, 0);
>> +
>> SetProcessingMode(NormalProcessing);
>> }
>
> How about putting it into BaseInit()? Or maybe just register it when its
> first used?

Moved to BaseInit().

>> +MEM_CTX_PUBLISH "Waiting for a process to publish memory information."
>
> The memory context stuff abbreviates as cxt not ctx. There's a few more cases
> of that in the patch.

I never get that right. Fixed.

>> + return (context_type);
>
> Why these parens?

Must be a leftover from something, fixed. Sorry about that.

>> + * If the publishing backend does not respond before the condition variable
>> + * times out, which is set to MEMSTATS_WAIT_TIMEOUT, retry given that there is
>> + * time left within the timeout specified by the user, before giving up and
>> + * returning previously published statistics, if any. If no previous statistics
>> + * exist, return NULL.
>
> Why do we need to repeatedly wake up rather than just sleeping with the
> "remaining" amount of time based on the time the function was called and the
> time that has passed since?

Fair point, the current coding was a conversion from the previous retry-based
approach but your suggestion is clearly correct. There is still potential for
refactoring but at this point I don't want to change too much all at once.

>> + * A valid DSA pointer isn't proof that statistics are available, it can
>> + * be valid due to previously published stats.
>
> Somehow "valid DSA pointer" is a bit too much about the precise mechanics and
> not enough about what's actually happening. I'd rather say something like
>
> "Even if the proc has published statistics, they may not be due to the current
> request, but previously published stats."

Agreed, thats better. Changed.

>> + if (!IsUnderPostmaster)
>> + {
>> + Assert(!found);
>
> I don't really understand why this uses IsUnderPostmaster? Seems like this
> should just use found like most (or all) the other *ShmemInit() functions do?

Agreed, Fixed.

>> + LWLockInitialize(&memCtxArea->lw_lock, LWLockNewTrancheId());
>
> I think for builtin code we just hardcode the tranches in BuiltinTrancheIds.

Fixed.

> It feels a bit silly to duplicate the call to context->methods->stats three
> times. We've changed these parameters a bunch in the past, having more callers
> to fix makes that more work. Can't the switch just set up the args that are
> then passed to one call to context->methods->stats?

I don't disagree, but I prefer to do that as a separate refactoring to not
change too many things all at once.

>> +
>> + /* Compute the number of stats that can fit in the defined limit */
>> + max_stats = (MAX_SEGMENTS_PER_BACKEND * DSA_DEFAULT_INIT_SEGMENT_SIZE)
>> + / (MAX_MEMORY_CONTEXT_STATS_SIZE);
>
> MAX_SEGMENTS_PER_BACKEND sounds way too generic to me for something defined in
> memutils.h. I don't really understand why DSA_DEFAULT_INIT_SEGMENT_SIZE is
> something that makes sense to use here?

Renamed, and dependency on DSA_DEFAULT_INIT_SEGMENT_SIZE removed.

>> + /*
>> + * Hold the process lock to protect writes to process specific memory. Two
>> + * processes publishing statistics do not block each other.
>> + */
>
> s/specific/process specific/

That's what it says though.. isn't it? I might be missing something obvious.

>> + dsa_free(area, memCtxState[idx].memstats_dsa_pointer);
>> + memCtxState[idx].memstats_dsa_pointer = InvalidDsaPointer;
>
> Both callers to free_memorycontextstate_dsa() do these lines immediately after
> calling free_memorycontextstate_dsa(), why not do that inside?

Fixed.

>> + /* Copy statistics to DSA memory */
>> + PublishMemoryContext(meminfo, context_id, cur, path, stat, 1, area, 100);
>> + }
>> + else
>> + {
>> + /* Examine the context stats */
>> + memset(&stat, 0, sizeof(stat));
>> + (*cur->methods->stats) (cur, NULL, NULL, &stat, true);
>
> But do we really do it twice in a row? The lines are exactly the same, so it
> seems that should just be done before the if?

Fixed.

>> +signal_memorycontext_reporting(void)
>
> IMO somewhat confusing to release the lock in a function named
> signal_memorycontext_reporting(). Why do we do that after
> hash_destroy()/dsa_detach()?

The function has been renamed for clarity.

>> + /* context id starts with 1 */
>> + entry->context_id = ++(*stats_count);
>
> Given that we don't actually do anything here relating to starting with 1, I
> find that comment confusing.

Reworded, not sure if it's much better tbh.

>> + memctx_info[curr_id].name = dsa_allocate0(area, namelen + 1);
>
> Given the number of references to memctx_info[curr_id] I'd put it in a local variable.

I might be partial, but I sort of prefer this way since it makes the underlying
data structure clear to the reader.

> Why is this a dsa_allocate0 given that we're immediately overwriting it?

It doesn't need to be zeroed as it's immediately overwritten. Fixed.

>> + memctx_info[curr_id].ident = dsa_allocate0(area, idlen + 1);
>> + identptr = (char *) dsa_get_address(area, memctx_info[curr_id].ident);
>> + strlcpy(identptr, ident, idlen + 1);
>
> Hm. First I thought we'd leak memory if this second (and subsequent)
> dsa_allocate failed. Then I thought we'd be ok, because the memory would be
> memory because it'd be reachable from memCtxState[idx].memstats_dsa_pointer.
>
> But I think it wouldn't *quite* work, because memCtxState[idx].total_stats is
> only set *after* we would have failed.

Keeping a running total in .total_stats should make the leak window smaller.

>> + memctx_info[curr_id].type = ContextTypeToString(context->type);
>
> I don't think this works across platforms. On windows / EXEC_BACKEND builds
> the location of string constants can differ across backends. And: Why do we
> need the string here? You can just call ContextTypeToString when reading?

Correct, we can just store the type and call ContextTypeToString when
generating the tuple. Fixed.

>> +/*
>> + * Free the memory context statistics stored by this process
>> + * in DSA area.
>> + */
>> +void
>> +AtProcExit_memstats_dsa_free(int code, Datum arg)
>> +{
>
> FWIW, to me the fact that it does a dsa_free() is an implementation
> detail. It's also not the only thing this does.

Renamed.

> And, I don't think AtProcExit* really is accurate, given that it runs *before*
> shmem is cleaned up?
>
> I wonder if the best approach here wouldn't be to forgo the use of a
> before_shmem_exit() callback, but instead use on_dsm_detach(). That would
> require we'd not constantly detach from the dsm segment, but I don't
> understand why we do that in the first place?

The attach/detach has been removed.

>> + /* If the dsm mapping could not be found, attach to the area */
>> + if (dsm_seg != NULL)
>> + return;
>
> I don't understand what we do here with the dsm? Why do we not need cleanup
> if we are already attached to the dsm segment?

Fixed.

>> +} MemoryContextState;
>
> IMO that's too generic a name for something in a header.
>
>> +} MemoryContextId;
>
> This too. Particularly because MemoryContextData->ident exist but is
> something different.

Renamed both to use MemoryContextReporting* namespace, which leaves
MemoryContextReportingBackendState at an unwieldly long name. I'm running out
of ideas on how to improve and it does make purpose quite explicit at least.

>> + from pg_get_process_memory_contexts(launcher_pid, false, 20)
>> + where path = '{1}' into r;
>> + RAISE NOTICE '%', r;
>> + select type, name, ident
>> + from pg_get_process_memory_contexts(pg_backend_pid(), false, 20)
>> + where path = '{1}' into r;
>> + RAISE NOTICE '%', r;
>> +END $$;
>
> I'd also test an aux process. I think the AV launcher isn't one, because it
> actually does "table" access of shared relations.

Fixed, switched from the AV launcher.

--
Daniel Gustafsson

Attachment	Content-Type	Size
v27-0001-Add-function-to-get-memory-context-stats-for-pro.patch	application/octet-stream	65.5 KB
unknown_filename	text/plain	1 byte

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-08 00:03:36
Message-ID:	tesneyk3z2dtrjgwlmkw2wbr7e3olwkowlpke6kl463hfhxedb@fyyqsnwjcp4l
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-04-08 01:17:17 +0200, Daniel Gustafsson wrote:
> > On 7 Apr 2025, at 17:43, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> >> + /*
> >> + * Hold the process lock to protect writes to process specific memory. Two
> >> + * processes publishing statistics do not block each other.
> >> + */
> >
> > s/specific/process specific/
>
> That's what it says though.. isn't it? I might be missing something obvious.

Understandable confusion, not sure what my brain was doing anymore
either...

> >> +} MemoryContextState;
> >
> > IMO that's too generic a name for something in a header.
> >
> >> +} MemoryContextId;
> >
> > This too. Particularly because MemoryContextData->ident exist but is
> > something different.
>
> Renamed both to use MemoryContextReporting* namespace, which leaves
> MemoryContextReportingBackendState at an unwieldly long name. I'm running out
> of ideas on how to improve and it does make purpose quite explicit at least.

How about

MemoryContextReportingBackendState -> MemoryStatsBackendState
MemoryContextReportingId -> MemoryStatsContextId
MemoryContextReportingSharedState -> MemoryStatsCtl
MemoryContextReportingStatsEntry -> MemoryStatsEntry

> >> + /* context id starts with 1 */
> >> + entry->context_id = ++(*stats_count);
> >
> > Given that we don't actually do anything here relating to starting with 1, I
> > find that comment confusing.
>
> Reworded, not sure if it's much better tbh.

I'd probably just remove the comment.

> > Hm. First I thought we'd leak memory if this second (and subsequent)
> > dsa_allocate failed. Then I thought we'd be ok, because the memory would be
> > memory because it'd be reachable from memCtxState[idx].memstats_dsa_pointer.
> >
> > But I think it wouldn't *quite* work, because memCtxState[idx].total_stats is
> > only set *after* we would have failed.
>
> Keeping a running total in .total_stats should make the leak window smaller.

Why not just initialize .total_stats *before* calling any fallible code?
Afaict it's zero-allocated, so the free function should have no problem
dealing with the entries that haven't yet been populated/

Greetings,

Andres Freund

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-08 05:40:34
Message-ID:	CAH2L28tp8RMa0CrCgdCJw20vFzeGQMuHkXAoPgYC5JZuXY8_+g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Daniel, Andres,

>
> > >> +} MemoryContextState;
> > >
> > > IMO that's too generic a name for something in a header.
> > >
> > >> +} MemoryContextId;
> > >
> > > This too. Particularly because MemoryContextData->ident exist but is
> > > something different.
> >
> > Renamed both to use MemoryContextReporting* namespace, which leaves
> > MemoryContextReportingBackendState at an unwieldly long name. I'm
> running out
> > of ideas on how to improve and it does make purpose quite explicit at
> least.
>
> How about
>
> MemoryContextReportingBackendState -> MemoryStatsBackendState
> MemoryContextReportingId -> MemoryStatsContextId
> MemoryContextReportingSharedState -> MemoryStatsCtl
> MemoryContextReportingStatsEntry -> MemoryStatsEntry
>
>
>
Fixed accordingly.

> > >> + /* context id starts with 1 */
> > >> + entry->context_id = ++(*stats_count);
> > >
> > > Given that we don't actually do anything here relating to starting
> with 1, I
> > > find that comment confusing.
> >
> > Reworded, not sure if it's much better tbh.
>
> I'd probably just remove the comment.
>
>
Reworded to mention that we pre-increment stats_count to make sure
id starts with 1.

>
> > > Hm. First I thought we'd leak memory if this second (and subsequent)
> > > dsa_allocate failed. Then I thought we'd be ok, because the memory
> would be
> > > memory because it'd be reachable from
> memCtxState[idx].memstats_dsa_pointer.
> > >
> > > But I think it wouldn't *quite* work, because
> memCtxState[idx].total_stats is
> > > only set *after* we would have failed.
> >
> > Keeping a running total in .total_stats should make the leak window
> smaller.
>
> Why not just initialize .total_stats *before* calling any fallible code?
> Afaict it's zero-allocated, so the free function should have no problem
> dealing with the entries that haven't yet been populated/
>
>
Fixed accordingly.

PFA a v28 which passes all local and github CI tests.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
v28-0001-Add-function-to-get-memory-context-stats-for-process.patch	application/octet-stream	65.2 KB

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-08 08:03:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 8 Apr 2025, at 07:40, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

>
>> > Renamed both to use MemoryContextReporting* namespace, which leaves
>> > MemoryContextReportingBackendState at an unwieldly long name. I'm running out
>> > of ideas on how to improve and it does make purpose quite explicit at least.
>>
>> How about
>>
>> MemoryContextReportingBackendState -> MemoryStatsBackendState
>> MemoryContextReportingId -> MemoryStatsContextId
>> MemoryContextReportingSharedState -> MemoryStatsCtl
>> MemoryContextReportingStatsEntry -> MemoryStatsEntry
>
> Fixed accordingly.

That's much better, thanks.

There was a bug in the shmem init function which caused it to fail on Windows,
the attached fixes that.

--
Daniel Gustafsson

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-08 09:46:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 8 Apr 2025, at 10:03, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:

> There was a bug in the shmem init function which caused it to fail on Windows,
> the attached fixes that.

With this building green in CI over several re-builds, and another pass over
the docs and code with pgindent etc done, I pushed this earlier today. A few
BF animals have built green so far but I will continue to monitor it.

--
Daniel Gustafsson

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>, Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-08 16:41:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025/04/08 18:46, Daniel Gustafsson wrote:
>> On 8 Apr 2025, at 10:03, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>
>> There was a bug in the shmem init function which caused it to fail on Windows,
>> the attached fixes that.
>
> With this building green in CI over several re-builds, and another pass over
> the docs and code with pgindent etc done, I pushed this earlier today. A few
> BF animals have built green so far but I will continue to monitor it.

Thanks for committing this feature!

I noticed that the third argument of pg_get_process_memory_contexts() is named
"retries" in pg_proc.dat, while the documentation refers to it as "timeout".
Since "retries" is misleading, how about renaming it to "timeout" in pg_proc.dat?
Patch attached.

Also, as I mentioned earlier, I encountered an issue when calling
pg_get_process_memory_contexts() on the PID of a backend that had just
encountered an error but hadn't finished rolling back. It led to
the following situation:

Session 1 (PID=70011):
=# begin;
=# select 1/0;
ERROR: division by zero

Session 2:
=# select * from pg_get_process_memory_contexts(70011, false, 10);

Session 1 terminated with:
ERROR: ResourceOwnerEnlarge called after release started
FATAL: terminating connection because protocol synchronization was lost

Shouldn't this be addressed?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment	Content-Type	Size
v1-0001-Rename-misleading-argument-in-pg_get_process_memo.patch	text/plain	1.6 KB

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-08 16:44:41
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 8 Apr 2025, at 18:41, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> On 2025/04/08 18:46, Daniel Gustafsson wrote:
>>> On 8 Apr 2025, at 10:03, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>>> There was a bug in the shmem init function which caused it to fail on Windows,
>>> the attached fixes that.
>> With this building green in CI over several re-builds, and another pass over
>> the docs and code with pgindent etc done, I pushed this earlier today. A few
>> BF animals have built green so far but I will continue to monitor it.
>
> Thanks for committing this feature!
>
> I noticed that the third argument of pg_get_process_memory_contexts() is named
> "retries" in pg_proc.dat, while the documentation refers to it as "timeout".
> Since "retries" is misleading, how about renaming it to "timeout" in pg_proc.dat?
> Patch attached.

Ugh, that's my bad. It was changed from using retries to a timeout and I
missed that.

> Also, as I mentioned earlier, I encountered an issue when calling
> pg_get_process_memory_contexts() on the PID of a backend that had just
> encountered an error but hadn't finished rolling back. It led to
> the following situation:
>
> Session 1 (PID=70011):
> =# begin;
> =# select 1/0;
> ERROR: division by zero
>
> Session 2:
> =# select * from pg_get_process_memory_contexts(70011, false, 10);
>
> Session 1 terminated with:
> ERROR: ResourceOwnerEnlarge called after release started
> FATAL: terminating connection because protocol synchronization was lost
>
> Shouldn't this be addressed?

Sorry, this must've been missed in this fairly lon thread, will have a look at
it tonight.

--
Daniel Gustafsson

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-08 21:27:41
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 8 Apr 2025, at 18:41, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:

> I noticed that the third argument of pg_get_process_memory_contexts() is named
> "retries" in pg_proc.dat, while the documentation refers to it as "timeout".

I've committed this patch as it was obviously correct, thanks!

I reconfirmed that the bugfix that Rahila shared in [0] fixes this issue (and
will fix others like it, as it's not related to this patch in particular but is
a bug in DSM attaching). My plan is to take that for a more thorough review
and test tomorrow and see how far it can be safely backpatched. Thanks for
bringing this up, sorry about it getting a bit lost among all the emails.

--
Daniel Gustafsson

[0] CAH2L28shr0j3JE5V3CXDFmDH-agTSnh2V8pR23X0UhRMbDQD9Q(at)mail(dot)gmail(dot)com

From:	Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-08 23:28:09
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025/04/09 6:27, Daniel Gustafsson wrote:
>> On 8 Apr 2025, at 18:41, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
>> I noticed that the third argument of pg_get_process_memory_contexts() is named
>> "retries" in pg_proc.dat, while the documentation refers to it as "timeout".
>
> I've committed this patch as it was obviously correct, thanks!

Thanks a lot!

Since pg_proc.dat was modified, do we need to bump the catalog version?

>> Also, as I mentioned earlier, I encountered an issue when calling
>> pg_get_process_memory_contexts() on the PID of a backend that had just
>> encountered an error but hadn't finished rolling back. It led to
>> the following situation:
>
> I reconfirmed that the bugfix that Rahila shared in [0] fixes this issue (and
> will fix others like it, as it's not related to this patch in particular but is
> a bug in DSM attaching). My plan is to take that for a more thorough review
> and test tomorrow and see how far it can be safely backpatched. Thanks for
> bringing this up, sorry about it getting a bit lost among all the emails.

Appreciate your work on this!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-29 13:13:07
Message-ID:	CAH2L28vt16C9xTuK+K7QZvtA3kCNWXOEiT=gEekUw3Xxp9LVQw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Please find attached a patch with some comments and documentation changes.
Additionaly, added a missing '\0' termination to "Remaining Totals" string.
I think this became necessary after we replaced dsa_allocate0()
with dsa_allocate() is the latest version.

Thank you,
Rahila Syed

Attachment	Content-Type	Size
0001-Fix-typos-and-modify-few-comments.patch	application/octet-stream	3.5 KB

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-30 10:14:26
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29.04.25 15:13, Rahila Syed wrote:
> Please find attached a patch with some comments and documentation changes.
> Additionaly, added a missing '\0' termination to "Remaining Totals" string.
> I think this became necessary after we replaced dsa_allocate0()
> with dsa_allocate() is the latest version.

> strncpy(nameptr, "Remaining Totals", namelen);
> + nameptr[namelen] = '\0';

Looks like a case for strlcpy()?

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc:	Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-04-30 10:43:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 30 Apr 2025, at 12:14, Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
>
> On 29.04.25 15:13, Rahila Syed wrote:
>> Please find attached a patch with some comments and documentation changes.
>> Additionaly, added a missing '\0' termination to "Remaining Totals" string.
>> I think this became necessary after we replaced dsa_allocate0()
>> with dsa_allocate() is the latest version.
>
> > strncpy(nameptr, "Remaining Totals", namelen);
> > + nameptr[namelen] = '\0';
>
> Looks like a case for strlcpy()?

True. I did go ahead with the strncpy and nul terminator assignment, mostly
out of muscle memory, but I agree that this would be a good place for a
strlcpy() instead.

--
Daniel Gustafsson

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-07-11 15:31:12
Message-ID:	CAH2L28sc-rEhyntPLoaC2XUa0ZjS5ka6KzEbuSVxQBBnUYu1KQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Please find attached the latest memory context statistics monitoring patch.
It has been redesigned to address several issues highlighted in the thread
[1]
and [2].

Here are some key highlights of the new design:

- All DSA processing has been moved out of the CFI handler function. Now,
all the dynamic shared memory
needed to store the statistics is created and deleted in the client
function. This change addresses concerns
that DSA APIs are too high level to be safely called from interrupt
handlers. There was also a concern that
DSA API calls might not provide re-entrancy, which could cause issues if
CFI is invoked from a DSA function
in the future.

- The static shared memory array has been replaced with a DSHASH table
which now holds metadata such as
pointers to actual statistics for each process.

- dsm_registry.c APIs are used for creating and attaching to DSA and
DSHASH table, which helps prevent code
duplication.

-To address the memory leak concern, we create an exclusive memory context
under the NULL context, which
does not fall under the TopMemoryContext tree, to handle all the memory
allocations in ProcessGetMemoryContextInterrupt.
This ensures the memory context created by the function does not affect its
outcome.
The memory context is reset at the end of the function, which helps prevent
any memory leaks.

- Changes made to the mcxt.c file have been relocated to mcxtfuncs.c, which
now contains all the existing
memory statistics-related functions along with the code for the proposed
function.

The overall flow of a request is as follows:

1. A client backend running the pg_get_process_memory_contexts function
creates a DSA and allocates memory
to store statistics, tracked by DSA pointer. This pointer is stored in a
DSHASH entry for each client querying the
statistics of any process.
The client shares its DSHASH table key with the server process using a
static shared array of keys indexed
by the server's procNumber. It notifies the server process to publish
statistics by using SendProcSignal.

2. When a PostgreSQL server process handles the request for memory
statistics, the CFI function accesses the
client hash key stored in its procNumber slot of the shared keys array. The
server process then retrieves the
DSHASH entry to obtain the DSA pointer allocated by the client, for storing
the statistics.
After storing the statistics, it notifies the client through its condition
variable.

3. Although the DSA is created just once, the memory inside the DSA is
allocated and released by the client
process as soon as it finishes reading the statistics.
If it fails to do so, it is deleted by the before_shmem_exit callback when
the client exits. The client's entry in DSHASH
table is also deleted when the client exits.

4. The DSA and DSHASH table are not created
until pg_get_process_memory_context function is called.
Once created, any client backend querying statistics and any PostgreSQL
process publishing statistics will
attach to the same area and table.

Please let me know your thoughts.

Thank you,
Rahila Syed

[1]. PostgreSQL: Re: pgsql: Add function to get memory context stats for
processes
<https://www.postgresql.org/message-id/CA%2BTgmoaey-kOP1k5FaUnQFd1fR0majVebWcL8ogfLbG_nt-Ytg%40mail.gmail.com>
[2]. PostgreSQL: Re: Prevent an error on attaching/creating a DSM/DSA from
an interrupt handler.
<https://www.postgresql.org/message-id/flat/8B873D49-E0E5-4F9F-B8D6-CA4836B825CD%40yesql.se#7026d2fe4ab0de6dd5decd32eb9c585a>

On Wed, Apr 30, 2025 at 4:13 PM Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:

> > On 30 Apr 2025, at 12:14, Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
> >
> > On 29.04.25 15:13, Rahila Syed wrote:
> >> Please find attached a patch with some comments and documentation
> changes.
> >> Additionaly, added a missing '\0' termination to "Remaining Totals"
> string.
> >> I think this became necessary after we replaced dsa_allocate0()
> >> with dsa_allocate() is the latest version.
> >
> > > strncpy(nameptr, "Remaining Totals", namelen);
> > > + nameptr[namelen] = '\0';
> >
> > Looks like a case for strlcpy()?
>
> True. I did go ahead with the strncpy and nul terminator assignment,
> mostly
> out of muscle memory, but I agree that this would be a good place for a
> strlcpy() instead.
>
> --
> Daniel Gustafsson
>
>

Attachment	Content-Type	Size
v30-0001-Add-pg_get_process_memory_context-function.patch	application/octet-stream	60.0 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-07-29 13:40:26
Message-ID:	CAH2L28vCCgye_+kJt22RAFzZfYbO7ytSrp-hR6-SenBcm_cN+w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Please find attached an updated patch. It contains the following changes.

1. It needed a rebase as highlighted by cfbot
<https://cfbot.cputube.org/patch_5938.log>. The method for adding an
LWLock was updated in commit-2047ad068139f0b8c6da73d0b845ca9ba30fb33d, so
the patch has been adjusted to reflect this change.
2. Updated some comments to align with the latest patch design.
3. Eliminated an unnecessary assertion

Thank you,
Rahila Syed

On Fri, Jul 11, 2025 at 9:01 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

> Hi,
>
> Please find attached the latest memory context statistics monitoring
> patch.
> It has been redesigned to address several issues highlighted in the thread
> [1]
> and [2].
>
> Here are some key highlights of the new design:
>
> - All DSA processing has been moved out of the CFI handler function. Now,
> all the dynamic shared memory
> needed to store the statistics is created and deleted in the client
> function. This change addresses concerns
> that DSA APIs are too high level to be safely called from interrupt
> handlers. There was also a concern that
> DSA API calls might not provide re-entrancy, which could cause issues if
> CFI is invoked from a DSA function
> in the future.
>
> - The static shared memory array has been replaced with a DSHASH table
> which now holds metadata such as
> pointers to actual statistics for each process.
>
> - dsm_registry.c APIs are used for creating and attaching to DSA and
> DSHASH table, which helps prevent code
> duplication.
>
> -To address the memory leak concern, we create an exclusive memory context
> under the NULL context, which
> does not fall under the TopMemoryContext tree, to handle all the memory
> allocations in ProcessGetMemoryContextInterrupt.
> This ensures the memory context created by the function does not affect
> its outcome.
> The memory context is reset at the end of the function, which helps
> prevent any memory leaks.
>
> - Changes made to the mcxt.c file have been relocated to mcxtfuncs.c,
> which now contains all the existing
> memory statistics-related functions along with the code for the proposed
> function.
>
> The overall flow of a request is as follows:
>
> 1. A client backend running the pg_get_process_memory_contexts function
> creates a DSA and allocates memory
> to store statistics, tracked by DSA pointer. This pointer is stored in a
> DSHASH entry for each client querying the
> statistics of any process.
> The client shares its DSHASH table key with the server process using a
> static shared array of keys indexed
> by the server's procNumber. It notifies the server process to publish
> statistics by using SendProcSignal.
>
> 2. When a PostgreSQL server process handles the request for memory
> statistics, the CFI function accesses the
> client hash key stored in its procNumber slot of the shared keys array.
> The server process then retrieves the
> DSHASH entry to obtain the DSA pointer allocated by the client, for
> storing the statistics.
> After storing the statistics, it notifies the client through its
> condition variable.
>
> 3. Although the DSA is created just once, the memory inside the DSA is
> allocated and released by the client
> process as soon as it finishes reading the statistics.
> If it fails to do so, it is deleted by the before_shmem_exit callback when
> the client exits. The client's entry in DSHASH
> table is also deleted when the client exits.
>
> 4. The DSA and DSHASH table are not created
> until pg_get_process_memory_context function is called.
> Once created, any client backend querying statistics and any PostgreSQL
> process publishing statistics will
> attach to the same area and table.
>
> Please let me know your thoughts.
>
> Thank you,
> Rahila Syed
>
> [1]. PostgreSQL: Re: pgsql: Add function to get memory context stats for
> processes
> <https://www.postgresql.org/message-id/CA%2BTgmoaey-kOP1k5FaUnQFd1fR0majVebWcL8ogfLbG_nt-Ytg%40mail.gmail.com>
> [2]. PostgreSQL: Re: Prevent an error on attaching/creating a DSM/DSA
> from an interrupt handler.
> <https://www.postgresql.org/message-id/flat/8B873D49-E0E5-4F9F-B8D6-CA4836B825CD%40yesql.se#7026d2fe4ab0de6dd5decd32eb9c585a>
>
> On Wed, Apr 30, 2025 at 4:13 PM Daniel Gustafsson <daniel(at)yesql(dot)se> wrote:
>
>> > On 30 Apr 2025, at 12:14, Peter Eisentraut <peter(at)eisentraut(dot)org>
>> wrote:
>> >
>> > On 29.04.25 15:13, Rahila Syed wrote:
>> >> Please find attached a patch with some comments and documentation
>> changes.
>> >> Additionaly, added a missing '\0' termination to "Remaining Totals"
>> string.
>> >> I think this became necessary after we replaced dsa_allocate0()
>> >> with dsa_allocate() is the latest version.
>> >
>> > > strncpy(nameptr, "Remaining Totals", namelen);
>> > > + nameptr[namelen] = '\0';
>> >
>> > Looks like a case for strlcpy()?
>>
>> True. I did go ahead with the strncpy and nul terminator assignment,
>> mostly
>> out of muscle memory, but I agree that this would be a good place for a
>> strlcpy() instead.
>>
>> --
>> Daniel Gustafsson
>>
>>

Attachment	Content-Type	Size
v31-0001-Add-pg_get_process_memory_context-function.patch	application/octet-stream	59.6 KB

From:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Daniel Gustafsson <daniel(at)yesql(dot)se>
Subject:	Re: Enhancing Memory Context Statistics Reporting
Date:	2025-08-08 09:26:52
Message-ID:	CAH2L28t=O1k+5wdcP88rgty3OLZisTU72WGH8Dp2MxJjwn7=fw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

CFbot indicated that the patch requires a rebase, so I've attached an
updated version.
The documentation for this feature is now included in the new
func-admin.sgml file,
due to recent changes in the documentation of sql functions.

The following are results from a performance test:

pgbench is initialized as follows :
pgbench -i -s 100 postgres

Test1 -
pgbench -c 16 -j 16 postgres -T 100
TPS: 745.02 (average of 3 runs)

Test2-
pgbench -c 16 -j 16 postgres -T 100

while memory usage of any postgres process is monitored concurrently every
0.1 seconds,
using the following method:

SELECT * FROM pg_get_process_memory_contexts(
(SELECT pid FROM pg_stat_activity
ORDER BY random() LIMIT 1)
, false, 5);

TPS: 750.66 (average of 3 runs)

I have not observed any performance decline resulting from the concurrent
execution
of the memory monitoring function.

Thank you,
Rahila Syed

On Tue, Jul 29, 2025 at 7:10 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

> Hi,
>
> Please find attached an updated patch. It contains the following changes.
>
> 1. It needed a rebase as highlighted by cfbot
> <https://cfbot.cputube.org/patch_5938.log>. The method for adding an
> LWLock was updated in commit-2047ad068139f0b8c6da73d0b845ca9ba30fb33d, so
> the patch has been adjusted to reflect this change.
> 2. Updated some comments to align with the latest patch design.
> 3. Eliminated an unnecessary assertion
>
> Thank you,
> Rahila Syed
>
> On Fri, Jul 11, 2025 at 9:01 PM Rahila Syed <rahilasyed90(at)gmail(dot)com>
> wrote:
>
>> Hi,
>>
>> Please find attached the latest memory context statistics monitoring
>> patch.
>> It has been redesigned to address several issues highlighted in the
>> thread [1]
>> and [2].
>>
>> Here are some key highlights of the new design:
>>
>> - All DSA processing has been moved out of the CFI handler function. Now,
>> all the dynamic shared memory
>> needed to store the statistics is created and deleted in the client
>> function. This change addresses concerns
>> that DSA APIs are too high level to be safely called from interrupt
>> handlers. There was also a concern that
>> DSA API calls might not provide re-entrancy, which could cause issues if
>> CFI is invoked from a DSA function
>> in the future.
>>
>> - The static shared memory array has been replaced with a DSHASH table
>> which now holds metadata such as
>> pointers to actual statistics for each process.
>>
>> - dsm_registry.c APIs are used for creating and attaching to DSA and
>> DSHASH table, which helps prevent code
>> duplication.
>>
>> -To address the memory leak concern, we create an exclusive memory
>> context under the NULL context, which
>> does not fall under the TopMemoryContext tree, to handle all the memory
>> allocations in ProcessGetMemoryContextInterrupt.
>> This ensures the memory context created by the function does not affect
>> its outcome.
>> The memory context is reset at the end of the function, which helps
>> prevent any memory leaks.
>>
>> - Changes made to the mcxt.c file have been relocated to mcxtfuncs.c,
>> which now contains all the existing
>> memory statistics-related functions along with the code for the proposed
>> function.
>>
>> The overall flow of a request is as follows:
>>
>> 1. A client backend running the pg_get_process_memory_contexts function
>> creates a DSA and allocates memory
>> to store statistics, tracked by DSA pointer. This pointer is stored in a
>> DSHASH entry for each client querying the
>> statistics of any process.
>> The client shares its DSHASH table key with the server process using a
>> static shared array of keys indexed
>> by the server's procNumber. It notifies the server process to publish
>> statistics by using SendProcSignal.
>>
>> 2. When a PostgreSQL server process handles the request for memory
>> statistics, the CFI function accesses the
>> client hash key stored in its procNumber slot of the shared keys array.
>> The server process then retrieves the
>> DSHASH entry to obtain the DSA pointer allocated by the client, for
>> storing the statistics.
>> After storing the statistics, it notifies the client through its
>> condition variable.
>>
>> 3. Although the DSA is created just once, the memory inside the DSA is
>> allocated and released by the client
>> process as soon as it finishes reading the statistics.
>> If it fails to do so, it is deleted by the before_shmem_exit callback
>> when the client exits. The client's entry in DSHASH
>> table is also deleted when the client exits.
>>
>> 4. The DSA and DSHASH table are not created
>> until pg_get_process_memory_context function is called.
>> Once created, any client backend querying statistics and any PostgreSQL
>> process publishing statistics will
>> attach to the same area and table.
>>
>> Please let me know your thoughts.
>>
>> Thank you,
>> Rahila Syed
>>
>> [1]. PostgreSQL: Re: pgsql: Add function to get memory context stats for
>> processes
>> <https://www.postgresql.org/message-id/CA%2BTgmoaey-kOP1k5FaUnQFd1fR0majVebWcL8ogfLbG_nt-Ytg%40mail.gmail.com>
>> [2]. PostgreSQL: Re: Prevent an error on attaching/creating a DSM/DSA
>> from an interrupt handler.
>> <https://www.postgresql.org/message-id/flat/8B873D49-E0E5-4F9F-B8D6-CA4836B825CD%40yesql.se#7026d2fe4ab0de6dd5decd32eb9c585a>
>>
>> On Wed, Apr 30, 2025 at 4:13 PM Daniel Gustafsson <daniel(at)yesql(dot)se>
>> wrote:
>>
>>> > On 30 Apr 2025, at 12:14, Peter Eisentraut <peter(at)eisentraut(dot)org>
>>> wrote:
>>> >
>>> > On 29.04.25 15:13, Rahila Syed wrote:
>>> >> Please find attached a patch with some comments and documentation
>>> changes.
>>> >> Additionaly, added a missing '\0' termination to "Remaining Totals"
>>> string.
>>> >> I think this became necessary after we replaced dsa_allocate0()
>>> >> with dsa_allocate() is the latest version.
>>> >
>>> > > strncpy(nameptr, "Remaining Totals", namelen);
>>> > > + nameptr[namelen] = '\0';
>>> >
>>> > Looks like a case for strlcpy()?
>>>
>>> True. I did go ahead with the strncpy and nul terminator assignment,
>>> mostly
>>> out of muscle memory, but I agree that this would be a good place for a
>>> strlcpy() instead.
>>>
>>> --
>>> Daniel Gustafsson
>>>
>>>

Attachment	Content-Type	Size
v32-0001-Add-pg_get_process_memory_context-function.patch	application/octet-stream	59.6 KB