Re: Proposal: Progressive explain

Lists: pgsql-hackers
From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Proposal: Progressive explain
Date: 2024-12-30 01:18:58
Message-ID: CAG0ozMpD27fCyVFoa82G7uV4md4wVYw7DWC-PVA6P5=1v35tsA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello community,

CONTEXT:

Back in October I presented the talk "Debugging active queries with
mid-flight instrumented explain plans" at PGConf EU 2024
(recording: https://www.youtube.com/watch?v=6ahTb-7C05c) presenting
an experimental feature that enables visualization of in progress
EXPLAIN ANALYZE executions. Given the positive feedback and requests,
I am sending this patch with the feature, which I am calling Progressive
Explain.

PROPOSAL:

This proposal introduces a feature to print execution plans of active
queries in an in-memory shared hash object so that other sessions can
visualize them with a new view: pg_stat_progress_explain.

Plans are only printed if the new GUC parameter progressive_explain is
enabled.

For regular queries or queries started with EXPLAIN (without ANALYZE)
the plan is printed only once at the start.

For instrumented runs (started via EXPLAIN ANALYZE or when auto_explain
flag log_analyze is enabled), the plan is printed on a fixed interval
controlled by the new GUC parameter progressive_explain_interval. This plan
includes all instrumentation stats computed so far (per node rows and
execution time).

New view:
- pg_stat_progress_explain
- pid: PID of the process running the query
- last_explain: timestamp when plan was last printed
- explain_count: amount of times plan was printed
- total_explain_time: accumulated time spend printing plans (in ms)
- explain: the actual plan (limited read privileges)

New GUCs:
- progressive_explain: if progressive plans are printed for local
session.
- type: bool
- default: off
- context: user
- progressive_explain_interval: interval between each explain print.
- type: int
- default: 1s
- min: 10ms
- context: user
- progressive_explain_sample_rate: fraction of rows processed by the
query until progressive_explain_interval is evaluated to print a
progressive plan
- type: floating point
- default: 0.1
- range: (0.0 - 1.0)
- context: user
- progressive_explain_output_size: max output size of the plan
printed in the in-memory hash table.
- type: int
- default: 4096
- min: 100
- context: postmaster
- progressive_explain_format: format used to print the plans.
- type: enum
- default: text
- context: user
- progressive_explain_settings: controls whether information about
modified configuration is added to the printed plan.
- type: bool
- default: off
- context: user
- progressive_explain_verbose: controls whether verbose details are
added to the printed plan.
- type: bool
- default: off
- context: user

DEMONSTRATION:

postgres=# SET progressive_explain = ON;
SET
postgres=# EXPLAIN ANALYZE SELECT *
FROM test t1
UNION ALL
SELECT *
FROM test t1;

postgres=# select * from pg_stat_progress_explain;
-[ RECORD 1
]------+---------------------------------------------------------------------------------------------------------------------------------------
pid | 299663
last_explain | 2024-12-29 22:40:33.016833+00
explain_count | 5
total_explain_time | 0.205
explain | Append (cost=0.00..466670.40 rows=20000160 width=37)
(actual time=0.052..3372.318 rows=14013813 loops=1)
| Buffers: shared hit=4288 read=112501
| -> Seq Scan on test t1 (cost=0.00..183334.80
rows=10000080 width=37) (actual time=0.052..1177.428 rows=10000000 loops=1)
| Buffers: shared hit=4288 read=79046
| -> Seq Scan on test t1_1 (cost=0.00..183334.80
rows=10000080 width=37) (actual time=0.072..608.481 rows=4013813 loops=1)
(current)
| Buffers: shared read=33455
|

IMPLEMENTATION HIGHLIGHTS:

- The initial plan is printed at the end of standard_ExecutorStart
if progressive_explain is enabled, for both regular queries and
instrumented ones (EXPLAIN ANALYZE):

/*
* Start progressive explain if enabled.
*/
if (progressive_explain)
ProgressiveExplainBegin(queryDesc);

- The incremental plan print for instrumented runs uses a dedicated
ExecProcNode if progressive_explain is enabled:

if (node->instrument)
if (progressive_explain)
node->ExecProcNode = ExecProcNodeInstrExplain;
else
node->ExecProcNode = ExecProcNodeInstr;
else
node->ExecProcNode = node->ExecProcNodeReal;

- ExecProcNodeInstrExplain is identical to ExecProcNodeInstr with an
additional part to print plans based on a sampling logic:

/*
* Update progressive explain based on sampling.
*/
if (pg_prng_double(&pg_global_prng_state) < progressive_explain_sample_rate)
ProgressiveExplainUpdate(node);

That logic was added because ExecProcNodeInstrExplain is called once per
row processed (a lot of times) and performing the timestamp interval
check with progressive_explain_interval to decide whether to print
the plan (done inside ProgressiveExplainUpdate) is expensive. Benchmarks
(shared at the end of this email) show that sampling + timestamp check
gives much better results than performing the timestamp check at every
ExecProcNodeInstrExplain call.

- The plans are stored in a shared hash object (explainArray) allocated
at database start, similar to procArray. ExplainHashShmemSize() computes
shared memory needed for it, based on max_connections + max_parallel_workers
for the amount of elements in the array and progressive_explain_output_size
for the size per element.

- A memory context release callback is configured in the memory context
where the query is running, being responsible for updating explainArray
even when the query doesn't finish gracefully.

- Instrumented plans being printed incrementally need to clone
instrumentation
objects to change them, so each print uses a dedicated memory context
that gets released after the output is constructed. This avoids extended
private memory usage:

/* Dedicated memory context for the current plan being printed */
tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
"Progressive Explain Temporary Context",
ALLOCSET_DEFAULT_SIZES);

- A new version of InstrEndLoop (InstrEndLoopForce) was created that allows
to be called targeting in-progress instrumented objects. Those are common
when traversing the plan tree of an active query.

- Column explain from pg_stat_progress_explain can only be visualized by
superusers or the same role that is running the query. If none of those
conditions are met, users will see "<insufficient privilege>".

- For instrumented runs, the printed includes 2 per node modifiers when
appropriate:

<current>: the plan node currently being processed.
<never executed>: a plan node not processed yet.

IMPLEMENTATION OVERHEAD:

When not used, the overhead added is:

- One IF at standard_ExecutorStart to check if progressive_explain is
enabled
- For instrumented runs (EXPLAIN ANALYZE), one IF at ExecProcNodeFirst
to define ExecProcNode wrapper

BENCHMARKS:

Performed 3 scenarios of benchmarks:

A) Comparison between unpatched PG18, patched with progressive explain
disabled and patched with feature enabled globally (all queries printing
the plan at query start:

- PG18 without patch:

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2173978
latency average = 1.655 ms
tps = 18127.977529 (without initial connection time)

- PG18 with patch:

-- progressive_explain = off

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2198806
latency average = 1.636 ms
tps = 18333.157809 (without initial connection time)

-- progressive_explain = on (prints plan only once per query)

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
number of transactions actually processed: 2047459
latency average = 1.756 ms
tps = 17081.477199 (without initial connection time)

B) EXPLAIN ANALYZE performance with different progressive_explain_interval
settings in patched:

-- progressive_explain = off

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 27
latency average = 4492.845 ms

-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 0.01 (default)

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1 -f
script.sql
number of transactions actually processed: 26
latency average = 4656.067 ms

-- progressive_explain = on
-- progressive_explain_interval = 10ms
-- progressive_explain_sample_rate = 0.01 (default)

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1 -f
script.sql
number of transactions actually processed: 26
latency average = 4785.608 ms

C) EXPLAIN ANALYZE performance in patched with and without
progressive_explain_sample_rate, ie, sampling with 2 different values
and also no sampling logic:

-- progressive_explain = off

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 27
latency average = 4492.845 ms

-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 0.01 (default)

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 26
latency average = 4656.067 ms

-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- progressive_explain_sample_rate = 1

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 19
latency average = 6432.902 ms

-- progressive_explain = on
-- progressive_explain_interval = 1s (default)
-- NO SAMPLING LOGIC

postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1 -f
script.sql
number of transactions actually processed: 21
latency average = 5864.820 ms

BENCHMARK RESULTS:

It definitely needs more testing but preliminary results show that:

- From (A) we see that the patch adds negligible overhead when the
feature is not used. Enabling globally reduces overall TPS as all
queries are spending time printing the plan. The idea is to enable
progressive_explain on a per-need basis, only to a subset of sessions
that need it.

- From (B) we see that using progressive explain slightly increases
total execution time. Difference between using progressive_explain_interval
set to 1s (plan printed 4 times per query in the test) and to
10ms (plan printed ~460 times per query in the test) is very small.
The actual overhead appears when changing progressive_explain_sample_rate.

- From (C) we see that progressive_explain_sample_rate with a low
value (default 0.01) performs better than not using sampling or
using progressive_explain_sample_rate = 1. So the overhead of having
the sampling logic is much lower than not sampling at all.

TESTS:

Currently working on tests for a second version of the patch.

DOCUMENTATION:

Added documentation for the new view pg_stat_progress_explain,
new GUCs and a new item in section 14.1:

14.1. Using EXPLAIN
14.1.1. EXPLAIN Basics
14.1.2. EXPLAIN ANALYZE
14.1.3. Caveats
14.1.4. Progressive EXPLAIN

FURTHER DISCUSSION:

Considering that this patch introduces a new major feature with
several new components (view, GUCs, etc), there is open room for
discussion such as:

- Do the columns in pg_stat_progress_explain make sense? Are we
missing or adding unnecessary columns?

- Do the new GUCs make sense and are their default values appropriate?

- Do we want progressive explain to print plans of regular queries
started without EXPLAIN if progressive_explain is enabled or should
the feature be restricted to instrumented queries (EXPLAIN ANALYZE)?

- Is the size of explainHash based on max_connections + max_parallel_workers
large enough or are there other types of backends that use the
executor and will print plans too?

Regards,

Rafael Castro.

Attachment Content-Type Size
v1-0001-Proposal-for-progressive-explains.patch application/octet-stream 57.8 KB

From: Greg Sabino Mullane <htamfids(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2024-12-30 16:37:14
Message-ID: CAKAnmm+PrrEewDktzDFD+v-PRTmQbfjOnJksKvCTtrb9rkGOHA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 29, 2024 at 8:19 PM Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
wrote:

> Plans are only printed if the new GUC parameter progressive_explain is
> enabled.
>

Maybe track_explain instead? In the spirit of track_activity.

- progressive_explain_output_size: max output size of the plan printed in
> the in-memory hash table.
> - default: 4096
> - min: 100
>

4096 seems low, if this means the explain plan is truncated at that size.
Also, the 100 minimum seems arbitrary.

So we can enable verbose and settings - but not wal? I could see that one
being useful. Not so much the rest (timing, summary). And buffers has
recently changed, so no need to worry about that. :)

> - The plans are stored in a shared hash object (explainArray) allocated at
> database start, similar to procArray. ExplainHashShmemSize() computes
> shared memory needed for it, based on max_connections +
> max_parallel_workers for the amount of elements in the array and
> progressive_explain_output_size for the size per element.
>

Hmmm...don't have a solution/suggestion offhand, but using max_connections
would seem to be allocating a chunk of memory that is never used 99% of the
time, as most people don't run active queries near max_connections.

(Actually, on re-reading my draft, I would prefer a rotating pool like
pg_stat_statements does.)

- Column explain from pg_stat_progress_explain can only be visualized by
> superusers or the same role that is running the query. If none of those
> conditions are met, users will see "<insufficient privilege>".
>

Or pg_read_all_stats I presume? Are those other columns (e.g.
explain_count) being visible to anyone really useful, or can we throw them
all behind the same permission restriction?

- From (B) we see that using progressive explain slightly increases total
> execution time.
>

Is this using the default dirt-simple pgbench queries? What about queries
that generate very large explain plans?

- Do the columns in pg_stat_progress_explain make sense? Are we missing or
> adding unnecessary columns?
>

Perhaps having the interval and sample rate in here as well, since they are
user-level and thus could be different from other rows in the view. It is
tempting to throw in other things as well like the query_start and datname,
but we don't want to copy all of pg_stat_activity...

It's not clear if total_explain_time is now() - query_start or something
else. If not, I would love to see an elapsed time interval column.

Perhaps add a leader_pid column. That's something I would always be joining
with pg_stat_activity to find out.

- Do we want progressive explain to print plans of regular queries started
> without EXPLAIN if progressive_explain is enabled or should
> the feature be restricted to instrumented queries (EXPLAIN ANALYZE)?
>

The latter, but no strong opinion.

id="guc-progressive-explain"

The new docs should mention the view name here, IMO, in addition to the
existing link that has details.

Random idea: track_explain_min_duration

Looks very cool overall, +1.

Cheers,
Greg


From: Sami Imseih <samimseih(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2024-12-30 17:43:26
Message-ID: CAA5RZ0uGDKWxqUCMrsWKV425T2f6mqJsXKg6chq+WuyCwNPUGw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> This proposal introduces a feature to print execution plans of active
> queries in an in-memory shared hash object so that other sessions can
> visualize them with a new view: pg_stat_progress_explain.

Thanks for this thread and for sharing the presentation
material. +1 for the idea of adding instrumentation that
will help users understand the bottlenecks in execution
plans. I want to share my perspective on this topic.

A DBA will want to know:
1/ Where is the bottleneck for a long running query currently
in flight?
2/ For a OLTP workload with many quick plans that
could be further optimized; what plan and what
part of the plan is contributing to the database load?

Having a view like pg_stat_progress_explain ( maybe a more
appropriate name is pg_stat_progress_plan ) will be
extremely useful to allow a user to build monitoring
dashboards to be able to answer such questions.

I do not think however this instrumentation should only be
made available if a user runs EXPLAIN ANALYZE.
In my opinion, this will severely limit the usefulness of this
instrumentation in production. Of course, one can use auto_explain,
but users will be hesitant to enable auto_explain with analyze in
production for all their workloads. Also, there should not be an
auto_explain dependency for this feature.

One approach will be for the view to expose the
explain plan and the current node being executed. I think the
plan_node_id can be exposed for this purpose but have not looked
into this in much detail yet. The plan_node_id can then be used
to locate the part of the plan that is a potential bottleneck ( if that
plan node is the one constantly being called ).

This may also be work that is better suited for an extension, but
core will need to add a hook in ExecProcNode so an extension can
have access to PlanState.

Regards,

Sami Imseih
Amazon Web Services (AWS)


From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2024-12-31 06:00:18
Message-ID: CACJufxFvU34OvfjFKKkvDDm743KZ6RThtKL+Xmd-mhqwF7fHTg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

hi.

[48/208] Compiling C object
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
FAILED: contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
/usr/local/gcc-14.1.0/bin/gcc-14.1.0
-Icontrib/postgres_fdw/postgres_fdw.so.p -Isrc/include
-I../../Desktop/pg_src/src5/postgres/src/include
-Isrc/interfaces/libpq
-I../../Desktop/pg_src/src5/postgres/src/interfaces/libpq
-I/usr/include/libxml2 -fdiagnostics-color=always --coverage
-D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE
-Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3
-Wcast-function-type -Wshadow=compatible-local -Wformat-security
-Wdeclaration-after-statement -Wmissing-variable-declarations
-Wno-format-truncation -Wno-stringop-truncation -Wunused-variable
-Wuninitialized -Werror=maybe-uninitialized -Wreturn-type
-DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES
-DREALLOCATE_BITMAPSETS -DLOCK_DEBUG -DRELCACHE_FORCE_RELEASE
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS
-DRAW_EXPRESSION_COVERAGE_TEST -fno-omit-frame-pointer -fPIC -pthread
-fvisibility=hidden -MD -MQ
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o -MF
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o.d -o
contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o -c
../../Desktop/pg_src/src5/postgres/contrib/postgres_fdw/postgres_fdw.c
In file included from
../../Desktop/pg_src/src5/postgres/contrib/postgres_fdw/postgres_fdw.c:22:
../../Desktop/pg_src/src5/postgres/src/include/commands/explain.h:86:9:
error: unknown type name ‘TimestampTz’
86 | TimestampTz last_explain;
| ^~~~~~~~~~~
[58/188] Linking target contrib/sslinfo/sslinfo.so
ninja: build stopped: subcommand failed.

compile failed. the above is the error message.


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2024-12-31 16:23:03
Message-ID: CAG0ozMrtK_u8Uf5KNZUmRNuMphV5tnC5DEhRBNRGK+K4L506xw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Thanks Greg, Sami and Jian for the feedback so far.

> Maybe track_explain instead? In the spirit of track_activity.

That was the original name, and all other GUCs were following the
track_activity_* logic. Changed to the name of the feature after
discussion with colleagues at EDB. This is definitely open for
discussion.

> 4096 seems low, if this means the explain plan is truncated at
> that size. Also, the 100 minimum seems arbitrary.

Min (100) and max (1048576) are the same as the values for GUC
track_activity_query_size, which has a very similar purpose: controls
the size of pg_stat_activity.query column.

> So we can enable verbose and settings - but not wal? I could see
> that one being useful. Not so much the rest (timing, summary). And
> buffers has recently changed, so no need to worry about that. :)

The logic I used for adding GUCs that control explain options is that
none of these settings should change QueryDesc->instrument_options,
which would change instrumentation options added to the actual
execution. GUCs available modify only the ExplainState object, which
affects only the output printed to pg_stat_progress_explain.

> Hmmm...don't have a solution/suggestion offhand, but using max_connections
> would seem to be allocating a chunk of memory that is never
> used 99% of the time, as most people don't run active queries
> near max_connections.
> (Actually, on re-reading my draft, I would prefer a rotating pool
> like pg_stat_statements does.)

Agreed. Thought about using a similar logic as pg_stat_statements
but the pool size there is very large by default, 5000. The difference
is that pg_stat_statements keeps the data in disk and I wanted to
avoid that as instrumented plans can print new plans very often,
affecting performance.

Maybe one idea would be to include a new GUC (progressive_explain_max_size)
that controls how many rows explainArray can have. If limit is reached
a backend won't print anything in that iteration.

> It's not clear if total_explain_time is now() - query_start or something
> else. If not, I would love to see an elapsed time interval column.

total_explain_time is accumulated time computed only printing
the plan. It does not include execution time.

> Perhaps add a leader_pid column. That's something I would always
> be joining with pg_stat_activity to find out.

For prints done by parallel workers? That information is available
in pg_stat_activity. The idea is to use the pid column and join with
pg_stat_activity to get all other relevant details.

> I do not think however this instrumentation should only be
> made available if a user runs EXPLAIN ANALYZE.
> In my opinion, this will severely limit the usefulness of this
> instrumentation in production. Of course, one can use auto_explain,
> but users will be hesitant to enable auto_explain with analyze in
> production for all their workloads. Also, there should not be an
> auto_explain dependency for this feature.

> One approach will be for the view to expose the
> explain plan and the current node being executed. I think the
> plan_node_id can be exposed for this purpose but have not looked
> into this in much detail yet. The plan_node_id can then be used
> to locate the part of the plan that is a potential bottleneck ( if that
> plan node is the one constantly being called ).

You mean that we could include the current node being executed even
for non instrumented runs? In that case it would print the plain
plan + current node? That is a valid point and shouldn't be
difficult to implement. The problem is that this would require
adding overhead to ExecProcNode() (non instrumented) and that
can be a performance killer.

> This may also be work that is better suited for an extension, but
> core will need to add a hook in ExecProcNode so an extension can
> have access to PlanState.

Are you talking about implementing your proposal (also printing
plan with current node for non instrumented runs) as an extension
or this whole patch as an extension?

If the whole patch, I thought about that. The thing is that the
proposal also changes ExplainNode() function, the core function
to print a plan. To implement it as an extension we would have to
duplicate 95% of that code.

I do think there is merit in having this feature as part of the
core and use existing extensions (auto_explain for example) to
increment it, like adding your suggestion to use a hook in
ExecProcNode().

> compile failed. the above is the error message.

Thanks. It was indeed missing an include. It complained only for
a complete build (including contrib), so I failed to catch it.

Sending a second version with the fix.

Rafael.

On Tue, Dec 31, 2024 at 3:00 AM jian he <jian(dot)universality(at)gmail(dot)com> wrote:

> hi.
>
> [48/208] Compiling C object
> contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
> FAILED: contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o
> /usr/local/gcc-14.1.0/bin/gcc-14.1.0
> -Icontrib/postgres_fdw/postgres_fdw.so.p -Isrc/include
> -I../../Desktop/pg_src/src5/postgres/src/include
> -Isrc/interfaces/libpq
> -I../../Desktop/pg_src/src5/postgres/src/interfaces/libpq
> -I/usr/include/libxml2 -fdiagnostics-color=always --coverage
> -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g
> -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE
> -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wimplicit-fallthrough=3
> -Wcast-function-type -Wshadow=compatible-local -Wformat-security
> -Wdeclaration-after-statement -Wmissing-variable-declarations
> -Wno-format-truncation -Wno-stringop-truncation -Wunused-variable
> -Wuninitialized -Werror=maybe-uninitialized -Wreturn-type
> -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES
> -DREALLOCATE_BITMAPSETS -DLOCK_DEBUG -DRELCACHE_FORCE_RELEASE
> -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS
> -DRAW_EXPRESSION_COVERAGE_TEST -fno-omit-frame-pointer -fPIC -pthread
> -fvisibility=hidden -MD -MQ
> contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o -MF
> contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o.d -o
> contrib/postgres_fdw/postgres_fdw.so.p/postgres_fdw.c.o -c
> ../../Desktop/pg_src/src5/postgres/contrib/postgres_fdw/postgres_fdw.c
> In file included from
> ../../Desktop/pg_src/src5/postgres/contrib/postgres_fdw/postgres_fdw.c:22:
> ../../Desktop/pg_src/src5/postgres/src/include/commands/explain.h:86:9:
> error: unknown type name ‘TimestampTz’
> 86 | TimestampTz last_explain;
> | ^~~~~~~~~~~
> [58/188] Linking target contrib/sslinfo/sslinfo.so
> ninja: build stopped: subcommand failed.
>
> compile failed. the above is the error message.
>

Attachment Content-Type Size
v2-0001-Proposal-for-progressive-explains.patch application/octet-stream 57.9 KB

From: jian he <jian(dot)universality(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2025-01-02 08:52:58
Message-ID: CACJufxGwaGq6ScSW9P0812c1BzNWJUcWx4qLbYibqXAW-O-e-A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

hi.

all the newly added GUC
progressive_explain;
progressive_explain_verbose;
progressive_explain_settings;
progressive_explain_interval;
progressive_explain_output_size;
progressive_explain_format;
progressive_explain_sample_rate;
also need to add to postgresql.conf.sample?

in doc/src/sgml/monitoring.sgml, we also need add
view pg_stat_progress_explain
to the section
<table id="monitoring-stats-dynamic-views-table">
<title>Dynamic Statistics Views</title>
(Table 27.1. Dynamic Statistics Views)

pg_stat_progress_explain.explain will be truncated after 4096 byte.
(default value of progressive_explain_output_size)
so if the progressive_explain_format is json,
and the plan is bigger (imagine two partitioned tables joined
together, each having many partitions)
the column "explain" text may not be a valid json.
Should we be concerned about this?

I don't really understand the actual usage of
pg_stat_progress_explain.explain_count.
Other column usage makes sense to me.
Can you share your idea why we need this column?

select name, category from pg_settings
where category = 'Query Tuning / Planner Method Configuration';
you will see that in config_group as QUERY_TUNING_METHOD
all the GUC names generally begin with "enable".
all the GUC names begin with "progressive" set the config_group as
QUERY_TUNING_METHOD
may not be appropriate? also it is not related to query tuning.

#include "utils/backend_status.h"
#include "storage/procarray.h"
#include "executor/spi.h"
#include "utils/guc.h"

src/backend/commands/explain.c
the header generally should be sorted in alphabetical ascending order.
apply the order to ipci.c, execMain.c, execProcnode.c

else
/* Node in progress */
if (es->progressive && planstate ==
planstate->state->progressive_explain_current_node)
appendStringInfo(es->str,
" (actual rows=%.0f loops=%.0f) (current)",
rows, nloops);
else
appendStringInfo(es->str,
" (actual rows=%.0f loops=%.0f)",
rows, nloops);

the above part in src/backend/commands/explain.c ExplainNode, the
indentation looks wrong to me.


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2025-01-07 22:27:25
Message-ID: CAG0ozMoyzsxW2U__PMbqQkA7ny0fKyxV0MNjM45PDXKvyHg8oA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Sending rebased version to fix cfbot tests.

Attachment Content-Type Size
v3-0001-Proposal-for-progressive-explains.patch application/octet-stream 57.9 KB

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2025-01-08 01:03:25
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi Rafael,

This sounds like a great feature, thanks for working on it and sharing
the patch. Let me share some comments / questions after a quick review.

On 12/30/24 02:18, Rafael Thofehrn Castro wrote:
> Hello community,
>
> CONTEXT:
>
> Back in October I presented the talk "Debugging active queries with
> mid-flight instrumented explain plans" at PGConf EU 2024
> (recording: https://www.youtube.com/watch?v=6ahTb-7C05c <https://
> www.youtube.com/watch?v=6ahTb-7C05c>) presenting
> an experimental feature that enables visualization of in progress
> EXPLAIN ANALYZE executions. Given the positive feedback and requests,
> I am sending this patch with the feature, which I am calling Progressive
> Explain.
>

Really nice talk, I enjoyed watching it a couple weeks back.

> PROPOSAL:
>
> This proposal introduces a feature to print execution plans of active
> queries in an in-memory shared hash object so that other sessions can
> visualize them with a new view: pg_stat_progress_explain.
>
> Plans are only printed if the new GUC parameter progressive_explain is
> enabled.
>

Aren't the names of the view / GUC are a bit misleading? Because this is
really about EXPLAIN ANALYZE, not just EXPLAIN. Not sure.

> For regular queries or queries started with EXPLAIN (without ANALYZE)
> the plan is printed only once at the start.
>

Initially I thought this is a bit weird, and also inconsistent with the
auto_explain (which prints plans at the end). But I think for "progress"
that makes sense, even if it's not updated after that.

> For instrumented runs (started via EXPLAIN ANALYZE or when auto_explain
> flag log_analyze is enabled), the plan is printed on a fixed interval
> controlled by the new GUC parameter progressive_explain_interval. This plan
> includes all instrumentation stats computed so far (per node rows and
> execution time).
>

OK, I understand why it works like this. But I rather dislike that it
relies on auto_explain to enable the progressive updates. If we want to
make this dependent on auto_explain, then most of this could/should be
moved to auto_explain, and called through a hook. Or something similar.

If we want to add this to core, then I think the progressive_explain GUC
should have "off|explain|analyze" values, or a similar way to enable the
instrumentation and regular updates. But that would probably also need
to duplicate a number of auto_explain options (e.g. to allow disabling
timings).

> New view:
> - pg_stat_progress_explain
>   - pid: PID of the process running the query
>   - last_explain: timestamp when plan was last printed
>   - explain_count: amount of times plan was printed
>   - total_explain_time: accumulated time spend printing plans (in ms)
>   - explain: the actual plan (limited read privileges)
>

I find the "explain_time" a bit misleading. On the one hand - yes, it
does account for time spent generating the EXPLAIN output. But on the
other hand, that's only a part of the overhead - it ignores the overhead
of the extra instrumentation. Also, wouldn't it be better to rename
"explain" to "query plan"? That's what the EXPLAIN result says too.

> New GUCs:
> - progressive_explain: if progressive plans are printed for local
> session.
>   - type: bool
>   - default: off
>   - context: user
> - progressive_explain_interval: interval between each explain print.
>   - type: int
>   - default: 1s
>   - min: 10ms
>   - context: user

Seems reasonable (but I already commented on progressive_explain).

> - progressive_explain_sample_rate: fraction of rows processed by the
> query until progressive_explain_interval is evaluated to print a
> progressive plan
>   - type: floating point
>   - default: 0.1
>   - range: (0.0 - 1.0)
>   - context: user

I find this pretty weird / unnecessary. More comments later.

> - progressive_explain_output_size: max output size of the plan
> printed in the in-memory hash table.
>   - type: int
>   - default: 4096
>   - min: 100
>   - context: postmaster

Seems far too small.

> - progressive_explain_format: format used to print the plans.
>   - type: enum
>   - default: text
>   - context: user

Good idea. When processing the view automatically, JSON would be much
better than text.

> - progressive_explain_settings: controls whether information about
> modified configuration is added to the printed plan.
>   - type: bool
>   - default: off
>   - context: user
> - progressive_explain_verbose: controls whether verbose details are
> added to the printed plan.
>   - type: bool
>   - default: off
>   - context: user
>

This seems to be the duplication of some auto_explain parameters that I
mentioned above. Except that it only duplicates some of them. I'd like
the ability to disable collecting timing info, which is usually by far
the main overhead.

> DEMONSTRATION:
>
> postgres=# SET progressive_explain = ON;
> SET
> postgres=# EXPLAIN ANALYZE SELECT *
> FROM test t1
> UNION ALL
> SELECT *
> FROM test t1;
>
> postgres=# select * from pg_stat_progress_explain;
> -[ RECORD 1 ]------
> +---------------------------------------------------------------------------------------------------------------------------------------
> pid                | 299663
> last_explain       | 2024-12-29 22:40:33.016833+00
> explain_count      | 5
> total_explain_time | 0.205
> explain            | Append  (cost=0.00..466670.40 rows=20000160
> width=37) (actual time=0.052..3372.318 rows=14013813 loops=1)
>                    |   Buffers: shared hit=4288 read=112501
>                    |   ->  Seq Scan on test t1  (cost=0.00..183334.80
> rows=10000080 width=37) (actual time=0.052..1177.428 rows=10000000 loops=1)
>                    |         Buffers: shared hit=4288 read=79046
>                    |   ->  Seq Scan on test t1_1  (cost=0.00..183334.80
> rows=10000080 width=37) (actual time=0.072..608.481 rows=4013813
> loops=1) (current)
>                    |         Buffers: shared read=33455
>                    |
>
> IMPLEMENTATION HIGHLIGHTS:
>
> - The initial plan is printed at the end of standard_ExecutorStart
> if progressive_explain is enabled, for both regular queries and
> instrumented ones (EXPLAIN ANALYZE):
>
> /*
> * Start progressive explain if enabled.
> */
> if (progressive_explain)
>   ProgressiveExplainBegin(queryDesc);
>
> - The incremental plan print for instrumented runs uses a dedicated
> ExecProcNode if progressive_explain is enabled:
>
> if (node->instrument)
>   if (progressive_explain)
>     node->ExecProcNode = ExecProcNodeInstrExplain;
>   else
>     node->ExecProcNode = ExecProcNodeInstr;
> else
>   node->ExecProcNode = node->ExecProcNodeReal;
>
> - ExecProcNodeInstrExplain is identical to ExecProcNodeInstr with an
> additional part to print plans based on a sampling logic:
>

I haven't looked at the details yet, but this seems mostly reasonable.

> /*
> * Update progressive explain based on sampling.
> */
> if (pg_prng_double(&pg_global_prng_state) < progressive_explain_sample_rate)
>   ProgressiveExplainUpdate(node);
>
> That logic was added because ExecProcNodeInstrExplain is called once per
> row processed (a lot of times) and performing the timestamp interval
> check with progressive_explain_interval to decide whether to print
> the plan (done inside ProgressiveExplainUpdate) is expensive. Benchmarks
> (shared at the end of this email) show that sampling + timestamp check
> gives much better results than performing the timestamp check at every
> ExecProcNodeInstrExplain call.
>

I don't think this is the way to deal with the high cost of collecting
timing information. It just adds unpredictability, because how would you
know what's a good sample rate? What if you pick a far too low value,
and then a plan gets logged much less frequently because for that
particular query it takes much longer to return a tuple?

I think the right solution to deal with high cost of timing are
timeouts. Set a timeout, with a handler that sets a "pending" flag, and
check the flag before calling ProgressiveExplainUpdate(). See for
example TransactionTimeoutHandler in postinit.c.

> - The plans are stored in a shared hash object (explainArray) allocated
> at database start, similar to procArray. ExplainHashShmemSize() computes
> shared memory needed for it, based on max_connections + max_parallel_workers
> for the amount of elements in the array and progressive_explain_output_size
> for the size per element.
>

Why do we need a shared hash table?

> - A memory context release callback is configured in the memory context
> where the query is running, being responsible for updating explainArray
> even when the query doesn't finish gracefully.
>

OK

> - Instrumented plans being printed incrementally need to clone
> instrumentation
> objects to change them, so each print uses a dedicated memory context
> that gets released after the output is constructed. This avoids extended
> private memory usage:
>

Yeah. I was wondering how you're going to deal with this. Wouldn't it be
cheaper to just use a static variable? I don't think this is called
recursively, and that'd save the palloc/pfree. Haven't tried and not
sure if it's worth it.

>
> /* Dedicated memory context for the current plan being printed */
> tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
>   "Progressive Explain Temporary Context",
>   ALLOCSET_DEFAULT_SIZES);
>

Maybe it'd be better to keep this memory context for the query duration
(e.g. by adding it to queryDesc), and just reset it before the calls?
That'd cache the memory, and it shouldn't really use a lot of it, no?

Also, ProgressiveExplainPrint() does this at the end:

+ /*
+ * Free local explain state before exiting as this function may be
+ * called multiple times in the same memory context.
+ */
+ pfree(es->str);
+ pfree(es);

Isn't this actually pointless with the local memory context?

> - A new version of InstrEndLoop (InstrEndLoopForce) was created that allows
> to be called targeting in-progress instrumented objects. Those are common
> when traversing the plan tree of an active query.
>

No opinion yet.

> - Column explain from pg_stat_progress_explain can only be visualized by
> superusers or the same role that is running the query. If none of those
> conditions are met, users will see "<insufficient privilege>".
>

I think this is in line with how we restrict access to similar catalogs.

> - For instrumented runs, the printed includes 2 per node modifiers when
> appropriate:
>
>   <current>: the plan node currently being processed.
>   <never executed>: a plan node not processed yet.
>

Not sure the <current> label is all that useful. It seems misleading at
best, because it's simply the last node that generated the explain. But
we might have already moved to a different node. And that node may be
stuck / very expensive, yet we'll see the plan as seemingly waiting in
some other node.

> IMPLEMENTATION OVERHEAD:
>
> When not used, the overhead added is:
>
> - One IF at standard_ExecutorStart to check if progressive_explain is
> enabled
> - For instrumented runs (EXPLAIN ANALYZE), one IF at ExecProcNodeFirst
> to define ExecProcNode wrapper
>
> BENCHMARKS:
>
> Performed 3 scenarios of benchmarks:
>
> A) Comparison between unpatched PG18, patched with progressive explain
> disabled and patched with feature enabled globally (all queries printing
> the plan at query start:
>
> - PG18 without patch:
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
> number of transactions actually processed: 2173978
> latency average = 1.655 ms
> tps = 18127.977529 (without initial connection time)
>
> - PG18 with patch:
>
> -- progressive_explain = off
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
> number of transactions actually processed: 2198806
> latency average = 1.636 ms
> tps = 18333.157809 (without initial connection time)
>
> -- progressive_explain = on (prints plan only once per query)
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -S -n -T 120 -c 30
> number of transactions actually processed: 2047459
> latency average = 1.756 ms
> tps = 17081.477199 (without initial connection time)
>
> B) EXPLAIN ANALYZE performance with different progressive_explain_interval
> settings in patched:
>
> -- progressive_explain = off
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 27
> latency average = 4492.845 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 1s (default)
> -- progressive_explain_sample_rate = 0.01 (default)
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1
> -f script.sql
> number of transactions actually processed: 26
> latency average = 4656.067 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 10ms
> -- progressive_explain_sample_rate = 0.01 (default)
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 300 -c 1
> -f script.sql
> number of transactions actually processed: 26
> latency average = 4785.608 ms
>
> C) EXPLAIN ANALYZE performance in patched with and without
> progressive_explain_sample_rate, ie, sampling with 2 different values
> and also no sampling logic:
>
> -- progressive_explain = off
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 27
> latency average = 4492.845 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 1s (default)
> -- progressive_explain_sample_rate = 0.01 (default)
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 26
> latency average = 4656.067 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 1s (default)
> -- progressive_explain_sample_rate = 1
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 19
> latency average = 6432.902 ms
>
> -- progressive_explain = on
> -- progressive_explain_interval = 1s (default)
> -- NO SAMPLING LOGIC
>
> postgres(at)ip-172-31-39-191:~$ /usr/local/pgsql/bin/pgbench -n -T 120 -c 1
> -f script.sql
> number of transactions actually processed: 21
> latency average = 5864.820 ms
>
> BENCHMARK RESULTS:
>
> It definitely needs more testing but preliminary results show that:
>
> - From (A) we see that the patch adds negligible overhead when the
> feature is not used. Enabling globally reduces overall TPS as all
> queries are spending time printing the plan. The idea is to enable
> progressive_explain on a per-need basis, only to a subset of sessions
> that need it.
>
> - From (B) we see that using progressive explain slightly increases
> total execution time. Difference between using progressive_explain_interval
> set to 1s (plan printed 4 times per query in the test) and to
> 10ms (plan printed ~460 times per query in the test) is very small.
> The actual overhead appears when changing progressive_explain_sample_rate.
>
> - From (C) we see that progressive_explain_sample_rate with a low
> value (default 0.01) performs better than not using sampling or
> using progressive_explain_sample_rate = 1. So the overhead of having
> the sampling logic is much lower than not sampling at all.
>

No opinion. I need to do some testing / benchmarking myself.

> TESTS:
>
> Currently working on tests for a second version of the patch.
>
> DOCUMENTATION:
>
> Added documentation for the new view pg_stat_progress_explain,
> new GUCs and a new item in section 14.1:
>
> 14.1. Using EXPLAIN
>   14.1.1. EXPLAIN Basics
>   14.1.2. EXPLAIN ANALYZE
>   14.1.3. Caveats
>   14.1.4. Progressive EXPLAIN
>
> FURTHER DISCUSSION:
>
> Considering that this patch introduces a new major feature with
> several new components (view, GUCs, etc), there is open room for
> discussion such as:
>
> - Do the columns in pg_stat_progress_explain make sense? Are we
> missing or adding unnecessary columns?
>
> - Do the new GUCs make sense and are their default values appropriate?
>
> - Do we want progressive explain to print plans of regular queries
> started without EXPLAIN if progressive_explain is enabled or should
> the feature be restricted to instrumented queries (EXPLAIN ANALYZE)?
>
> - Is the size of explainHash based on max_connections + max_parallel_workers
> large enough or are there other types of backends that use the
> executor and will print plans too?
>

I've commented on some of these items earlier.

regards

--
Tomas Vondra


From: Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2025-01-14 21:55:15
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 12/30/24 2:18 AM, Rafael Thofehrn Castro wrote:
> Hello community,
>
> CONTEXT:
>
> Back in October I presented the talk "Debugging active queries with
> mid-flight instrumented explain plans" at PGConf EU 2024
> (recording: https://www.youtube.com/watch?v=6ahTb-7C05c) presenting
> an experimental feature that enables visualization of in progress
> EXPLAIN ANALYZE executions. Given the positive feedback and requests,
> I am sending this patch with the feature, which I am calling Progressive
> Explain.
>
> PROPOSAL:
>
> This proposal introduces a feature to print execution plans of active
> queries in an in-memory shared hash object so that other sessions can
> visualize them with a new view: pg_stat_progress_explain.

Hello
Thanks for your contribution! Such feature would be very useful.
I did not look carefully about the implementation. I just wanted to let
you know someone already posted a similar feature in this thread :
https://www.postgresql.org/message-id/CADdR5ny_0dFwnD%2BsuBnV1Vz6NDKbFHeWoV1EDv9buhDCtc3aAA%40mail.gmail.com

Maybe it could give you some ideas.

Regards,

--
Adrien NAYRAT


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2025-01-29 17:41:15
Message-ID: CAG0ozMoVOAGZgXAFaD2ceM2mFk2ip03sWS1oJ9Eownzfhz4kUg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Thanks for the valuable feedback Tomas. I am sending a new version of the
patch
that includes:

- Changed instrumented plan printing based on timeouts instead of sampling.
This works perfectly and benchmarks are promising. So new GUC
progressive_explain_sampe_rate is removed.
- Removed all parts of the code where allocated memory for instrumentation
and explain objects is released as this is done automatically when the
custom memory context is released. Comments added later on.
- Adjusted regression tests expected objects so tests pass.

> OK, I understand why it works like this. But I rather dislike that it
> relies on auto_explain to enable the progressive updates. If we want to
> make this dependent on auto_explain, then most of this could/should be
> moved to auto_explain, and called through a hook. Or something similar.

> If we want to add this to core, then I think the progressive_explain GUC
> should have "off|explain|analyze" values, or a similar way to enable the
> instrumentation and regular updates. But that would probably also need
> to duplicate a number of auto_explain options (e.g. to allow disabling
> timings).

> This seems to be the duplication of some auto_explain parameters that I
> mentioned above. Except that it only duplicates some of them. I'd like
> the ability to disable collecting timing info, which is usually by far
> the main overhead.

I implemented using the current logic with the premise that this new feature
shouldn't change how queries are executed. If a query is not using
EXPLAIN ANALYZE then it shouldn't transparently enable instrumentation,
which will then cause additional overhead. If someone wants an instrumented
run, then it explicitly calls EXPLAIN ANALYZE.

That is the same reasoning for coming up with the GUCs. The only ones
available are progressive_explain_format, progressive_explain_settings
and progressive_explain_verbose, which will only affect the explain
output without changing how the query is executed.

But If we come to an agreement that "off|explain|analyze" values for
progressive_explain GUC is a better approach, changing the logic should
be easy. Now that you mentioned the idea, it does make sense.

> I think the right solution to deal with high cost of timing are
> timeouts. Set a timeout, with a handler that sets a "pending" flag, and
> check the flag before calling ProgressiveExplainUpdate(). See for
> example TransactionTimeoutHandler in postinit.c.

Thanks for that. The new patch sent in this message uses a timeout instead.

> Why do we need a shared hash table?

This is where I am storing the printed plans so other backends can read
from. I saw that as the only option apart from printing to files. Do
you recommend something else?

> Yeah. I was wondering how you're going to deal with this. Wouldn't it be
> cheaper to just use a static variable? I don't think this is called
> recursively, and that'd save the palloc/pfree. Haven't tried and not
> sure if it's worth it.

You mean an instrumented object allocated only once that will be updated
with the counters of the currently processed node? That is also an option.
Maybe it is worth testing considering that it may save millions of
palloc/pfree.
Will give it a go.

> Maybe it'd be better to keep this memory context for the query duration
> (e.g. by adding it to queryDesc), and just reset it before the calls?
> That'd cache the memory, and it shouldn't really use a lot of it, no?

Instead of having a dedicated memory context, use the same context as the
one running the query? This will then require manually freeing the
ExplainEstate
allocated in each ProgressiveExplainPrint() call (I removed in this patch
that
manual allocation that wasn't needed currently) and also the allocated
instrumentation objects (which will no longer be needed if we use a static
variable). This will save memory context creation/release and is definitely
a better option. Will test it.

> Not sure the <current> label is all that useful. It seems misleading at
> best, because it's simply the last node that generated the explain. But
> we might have already moved to a different node. And that node may be
> stuck / very expensive, yet we'll see the plan as seemingly waiting in
> some other node.

Right. The idea here is that, for some specific cases, the execution is
momentarily limited to a subpart of the plan. For example, if a HashJoin
is present that whole set of nodes inside the hash creation need to be
complete before other operations are done. If we see a <current> as part
of that hash creation we know that the query is currently in that region,
regardless of which specific node it is.

This could become a feature to be created as part of an extension with the
help of a new hook inside ExplainNode() if we come to the agreement that
it is not really needed out of the box.

> Thanks for your contribution! Such feature would be very useful.
> I did not look carefully about the implementation. I just wanted to let
> you know someone already posted a similar feature in this thread :
> Maybe it could give you some ideas.

Thanks Adrien! Will have a look.

Regards,

Rafael Castro.

Attachment Content-Type Size
v4-0001-Proposal-for-progressive-explains.patch application/octet-stream 59.0 KB

From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2025-02-18 18:31:18
Message-ID: CAG0ozMqfQjO=xK-zwfvNGvpZnWtynB6iXxCShcH5t5vsYV7_LA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello all,

Sending a new version of the patch that includes important changes
addressing
feedback provided by Greg and Tomas. So, including the previous version (v5)
sent on Jan 29, these are the highlights of what has changed:

- Progressive plan printed on regular interval defined by
progressive_explain_timeout
now uses timeouts. GUC progressive_explain_sampe_rate is removed.

- Objects allocated per plan print (Instrument and ExplainState) were
replaced
by reusable objects allocated at query start (during progressive explain
setup
phase). So currently we are allocating only 2 objects for the complete
duration
of the feature. With that, removed the temporary memory context that was
being
allocated per iteration.

- progressive_explain GUC was changed from boolean to enum, accepting
values 'off',
'explain' and 'analyze'. This allows using instrumented progressive explains
for any query and not only the ones started via EXPLAIN ANALYZE. If GUC is
set to 'explain' the plan will be printed only once at query start. If set
to 'analyze' instrumentation will be enabled in QueryDesc and the detailed
plan will be printed iteratively. Considering that now we can enable
instrumentation
for regular queries, added the following GUCs to control what instruments
are enabled: progressive_explain_buffers, progressive_explain_timing and
progressive_explain_wals.

- better handling of shared memory space where plans are printed and shared
with other backends. In previous version we had a shared hash with elements
holding all data related to progressive explains, including the complete
plan string:

typedef struct explainHashEntry
{
explainHashKey key; /* hash key of entry - MUST BE FIRST */
int pid;
TimestampTz last_explain;
int explain_count;
float explain_duration;
char plan[];
} explainHashEntry;

The allocated size per element used to be defined by
progressive_explain_output_size,
which would essentially control the space available for plan[].

Greg raised the concern of PG having to allocate too much shared memory
at database start considering that we need enough space for max_connections
+
max_parallel_workers, and that is a totally valid point.

So the new version takes advantage of DSAs. Each backend creates its own
DSA at query start (if progressive explain is enabled) where the shared
data is stored. That DSA is shared with other backends via hash structure
through dsa_handle and dsa_pointer pointers:

typedef struct progressiveExplainHashEntry
{
progressiveExplainHashKey key; /* hash key of entry - MUST BE FIRST */
dsa_handle h;
dsa_pointer p;
} progressiveExplainHashEntry;

typedef struct progressiveExplainData
{
int pid;
TimestampTz last_print;
char plan[];
} progressiveExplainData;

That allows us to allocate areas of custom sizes for plan[]. The strategy
being used currently is to allocate an initial space with the size of the
initial plan output + PROGRESSIVE_EXPLAIN_ALLOC_SIZE (4096 currently), which
gives PG enough room for subsequent iterations where the new string may
increase a bit, without having to reallocate space. The code checks sizes
and
will reallocate if needed. With that, GUC progressive_explain_output_size
was removed.

- Adjusted columns of pg_stat_progress_explain. Columns explain_count and
total_explain_time were removed. Column last_explain was renamed to
last_print.
Column explain was renamed to query_plan, as this is the name used by PG
when a plan is printed with EXPLAIN.

Rafael.

Attachment Content-Type Size
v5-0001-Proposal-for-progressive-explains.patch application/octet-stream 68.3 KB

From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2025-02-18 20:47:58
Message-ID: CAG0ozMoaV_C1_Sf7QMjfCjUODL8U959gvYWPrVSjgvc2gcRyxQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Fixed a corner case where pg_stat_progress_explain is looking at its own
plan.

Previous message in this thread contains all relevant implementation
details of
the last patch.

On Tue, Feb 18, 2025 at 3:31 PM Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
wrote:

> Hello all,
>
> Sending a new version of the patch that includes important changes
> addressing
> feedback provided by Greg and Tomas. So, including the previous version
> (v5)
> sent on Jan 29, these are the highlights of what has changed:
>
> - Progressive plan printed on regular interval defined by
> progressive_explain_timeout
> now uses timeouts. GUC progressive_explain_sampe_rate is removed.
>
> - Objects allocated per plan print (Instrument and ExplainState) were
> replaced
> by reusable objects allocated at query start (during progressive explain
> setup
> phase). So currently we are allocating only 2 objects for the complete
> duration
> of the feature. With that, removed the temporary memory context that was
> being
> allocated per iteration.
>
> - progressive_explain GUC was changed from boolean to enum, accepting
> values 'off',
> 'explain' and 'analyze'. This allows using instrumented progressive
> explains
> for any query and not only the ones started via EXPLAIN ANALYZE. If GUC is
> set to 'explain' the plan will be printed only once at query start. If set
> to 'analyze' instrumentation will be enabled in QueryDesc and the detailed
> plan will be printed iteratively. Considering that now we can enable
> instrumentation
> for regular queries, added the following GUCs to control what instruments
> are enabled: progressive_explain_buffers, progressive_explain_timing and
> progressive_explain_wals.
>
> - better handling of shared memory space where plans are printed and shared
> with other backends. In previous version we had a shared hash with elements
> holding all data related to progressive explains, including the complete
> plan string:
>
> typedef struct explainHashEntry
> {
> explainHashKey key; /* hash key of entry - MUST BE FIRST */
> int pid;
> TimestampTz last_explain;
> int explain_count;
> float explain_duration;
> char plan[];
> } explainHashEntry;
>
> The allocated size per element used to be defined by
> progressive_explain_output_size,
> which would essentially control the space available for plan[].
>
> Greg raised the concern of PG having to allocate too much shared memory
> at database start considering that we need enough space for
> max_connections +
> max_parallel_workers, and that is a totally valid point.
>
> So the new version takes advantage of DSAs. Each backend creates its own
> DSA at query start (if progressive explain is enabled) where the shared
> data is stored. That DSA is shared with other backends via hash structure
> through dsa_handle and dsa_pointer pointers:
>
> typedef struct progressiveExplainHashEntry
> {
> progressiveExplainHashKey key; /* hash key of entry - MUST BE FIRST */
> dsa_handle h;
> dsa_pointer p;
> } progressiveExplainHashEntry;
>
> typedef struct progressiveExplainData
> {
> int pid;
> TimestampTz last_print;
> char plan[];
> } progressiveExplainData;
>
> That allows us to allocate areas of custom sizes for plan[]. The strategy
> being used currently is to allocate an initial space with the size of the
> initial plan output + PROGRESSIVE_EXPLAIN_ALLOC_SIZE (4096 currently),
> which
> gives PG enough room for subsequent iterations where the new string may
> increase a bit, without having to reallocate space. The code checks sizes
> and
> will reallocate if needed. With that, GUC progressive_explain_output_size
> was removed.
>
> - Adjusted columns of pg_stat_progress_explain. Columns explain_count and
> total_explain_time were removed. Column last_explain was renamed to
> last_print.
> Column explain was renamed to query_plan, as this is the name used by PG
> when a plan is printed with EXPLAIN.
>
> Rafael.
>

Attachment Content-Type Size
v6-0001-Proposal-for-progressive-explains.patch application/octet-stream 68.4 KB

From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Progressive explain
Date: 2025-03-06 21:43:07
Message-ID: CAG0ozMo30smtXXOR8bSCbhaZAQHo4=ezerLitpERk85Q0ga+Fw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Was recently benchmarking the last version of the patch and found room for
improvement when GUC progressive_explain is enabled globally.

Results with the last published version of the patch:

- progressive_explain = off:

/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18249.363540 (without initial connection time)

- progressive_explain = 'explain':

/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 3536.635125 (without initial connection time)

This is because progressive explains are being printed for every query,
including
the ones that finish instantly.

If we think about it, those printed plans for instant queries are useless as
other backends won't have a chance to look at the plans before they get
removed
from pg_stat_progress_explain.

So this new version of the patch implements new GUC
progressive_explain_min_duration
to be a used as a threshold for the plan to be printed for the first time:

- progressive_explain_min_duration: min query duration until progressive
explain starts.
- type: int
- default: 1s
- min: 0
- context: user

Results with the new version:

- progressive_explain = off:

/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18871.800242 (without initial connection time)

- progressive_explain = 'explain' and progressive_explain_min_duration =
'5s':

/usr/local/pgsql/bin/pgbench -S -n -T 10 -c 30
tps = 18896.266538 (without initial connection time)

Implementation of the new GUC progressive_explain_min_duration was done with
timeouts. The timeout callback function is used to initialize the
progressive
explain.

There is a catch to this implementation. In thread
https://www.postgresql.org/message-id/flat/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru
where torikoshia proposes logging of query plans it was raised concerns
about
logging plans in the CFI, a sensible part of the code. So torikoshia
implemented
a smart workaround consisting in adjusting the execProcNode wrapper of all
nodes
so that the plan printing can be done there.

I'm not sure if this same concern applies to timeout callbacks so I also
implemented
a second version of the latest patch that uses that execProcNode wrapper
strategy.

The wrapper code was implemented by torikoshia (torikoshia(at)oss(dot)nttdata(dot)com),
so
adding the credits here.

Rafael.

Attachment Content-Type Size
v7-0001-Proposal-for-progressive-explains.patch application/octet-stream 75.0 KB
v7-0001-Proposal-for-progressive-explains-with-execprocnode-wrapper.patch.nocfbot application/octet-stream 80.4 KB

From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: torikoshia(at)oss(dot)nttdata(dot)com
Subject: Re: Proposal: Progressive explain
Date: 2025-03-07 20:34:32
Message-ID: CAG0ozMo4eaJJqqiHP7dHXm=u8vK-qqxo5Q1DVu=amHisZiX5_g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> Implementation of the new GUC progressive_explain_min_duration was done
> with
> timeouts. The timeout callback function is used to initialize the
> progressive
> explain.
>
> There is a catch to this implementation. In thread
> https://www.postgresql.org/message-id/flat/d68c3ae31672664876b22d2dcbb526d2%40postgrespro.ru
> where torikoshia proposes logging of query plans it was raised concerns
> about
> logging plans in the CFI, a sensible part of the code. So torikoshia
> implemented
> a smart workaround consisting in adjusting the execProcNode wrapper of all
> nodes
> so that the plan printing can be done there.
>
> I'm not sure if this same concern applies to timeout callbacks so I also
> implemented
> a second version of the latest patch that uses that execProcNode wrapper
> strategy.
>
> The wrapper code was implemented by torikoshia (torikoshia(at)oss(dot)nttdata(dot)com),
> so
> adding the credits here.
>

Did additional benchmarks and found issues with the patch that doesn't do
execProcNode
wrapping. There are sporadic crashes with *double free or corruption (top)*

So making the patch that uses the wrapper the current one. Again, giving
the credits to
torikoshia as being the owner of that section of the code.

Rafael.

>

Attachment Content-Type Size
v8-0001-Proposal-for-progressive-explains.patch application/octet-stream 80.0 KB

From: "Euler Taveira" <euler(at)eulerto(dot)com>
To: "Rafael Thofehrn Castro" <rafaelthca(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: torikoshia(at)oss(dot)nttdata(dot)com
Subject: Re: Proposal: Progressive explain
Date: 2025-03-12 14:25:27
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 7, 2025, at 5:34 PM, Rafael Thofehrn Castro wrote:
> Did additional benchmarks and found issues with the patch that doesn't do execProcNode
> wrapping. There are sporadic crashes with **double free or corruption (top)**
> ****
> So making the patch that uses the wrapper the current one. Again, giving the credits to
> torikoshia as being the owner of that section of the code.
>

Rafael, thanks for working on it. It is a step forward in observability. I
started with some performance tests and the last improvements seem to fix the
overhead imposed in the initial patch version. I didn't notice any of these new
function in the perf report while executing fast queries.

I found a crash. It is simple to reproduce.

Session A:

select * from pg_stat_progress_explain;
\watch 2

Session B:

explain select pg_sleep(30);
\watch 2

8<--------------------------------------------------------------------8<

Backtrace:

Core was generated by `postgres: euler postgres [lo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 WrapExecProcNodeWithExplain (ps=0x7f7f7f7f7f7f7f7f) at explain.c:5401
5401 if (ps->ExecProcNodeOriginal != NULL)
#0 WrapExecProcNodeWithExplain (ps=0x7f7f7f7f7f7f7f7f) at explain.c:5401
#1 0x00005624173829aa in handle_sig_alarm (postgres_signal_arg=<optimized out>) at timeout.c:414
#2 0x00005624173ba02c in wrapper_handler (postgres_signal_arg=14) at pqsignal.c:110
#3 <signal handler called>
#4 0x00007f20fa529e63 in epoll_wait (epfd=6, events=0x56244ef37e58, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#5 0x00005624171fb02f in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffdd9e62080, cur_timeout=-1, set=0x56244ef37dd8) at waiteventset.c:1190
#6 WaitEventSetWait (set=0x56244ef37dd8, timeout=timeout(at)entry=-1, occurred_events=occurred_events(at)entry=0x7ffdd9e62080, nevents=nevents(at)entry=1, wait_event_info=wait_event_info(at)entry=100663296) at waiteventset.c:1138
#7 0x000056241709513c in secure_read (port=0x56244eeb90e0, ptr=0x56241775a9a0 <PqRecvBuffer>, len=8192) at be-secure.c:218
#8 0x000056241709bf2e in pq_recvbuf () at pqcomm.c:924
#9 0x000056241709ceb5 in pq_getbyte () at pqcomm.c:970
#10 0x000056241721b617 in SocketBackend (inBuf=0x7ffdd9e622a0) at postgres.c:361
#11 ReadCommand (inBuf=0x7ffdd9e622a0) at postgres.c:484
#12 PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4625
#13 0x00005624172167ed in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:107
#14 0x000056241717519b in postmaster_child_launch (child_type=<optimized out>, child_slot=2, startup_data=startup_data(at)entry=0x7ffdd9e6253c, startup_data_len=startup_data_len(at)entry=4, client_sock=client_sock(at)entry=0x7ffdd9e62540) at launch_backend.c:274
#15 0x0000562417178c32 in BackendStartup (client_sock=0x7ffdd9e62540) at postmaster.c:3519
#16 ServerLoop () at postmaster.c:1688
#17 0x000056241717a6da in PostmasterMain (argc=argc(at)entry=1, argv=argv(at)entry=0x56244eeb81b0) at postmaster.c:1386
#18 0x0000562416e64f9a in main (argc=1, argv=0x56244eeb81b0) at main.c:230

8<--------------------------------------------------------------------8<

You call this feature "progressive explain". My first impression is that it
will only provide the execution plans for EXPLAIN commands. Instead of
"progressive explain", I would suggest "query progress" that is a general
database terminology. It seems natural to use "progressive explain" since you
are using the explain infrastructure (including the same options -- format,
settings, wal, ...) -- to print the execution plan.

+CREATE VIEW pg_stat_progress_explain AS
+ SELECT
+ *
+ FROM pg_stat_progress_explain(true);
+

There is no use for the function argument. If you decide to keep this function,
remove it.

Why don't you use the pgstat_progress_XXX() API? Since you are using a
pg_stat_progress_XXX view name I would expect using the command progress
reporting infrastructure (see backend_progress.c).

Maybe you could include datid and datname as the other progress reporting
views. It would avoid a join to figure out what the database is.

+static const struct config_enum_entry explain_format_options[] = {
+ {"text", EXPLAIN_FORMAT_TEXT, false},
+ {"xml", EXPLAIN_FORMAT_XML, false},
+ {"json", EXPLAIN_FORMAT_JSON, false},
+ {"yaml", EXPLAIN_FORMAT_YAML, false},
+ {NULL, 0, false}
+};

Isn't it the same definition as in auto_explain.c? Use only one definition for
auto_explain and this feature. You can put this struct into explain.c, use an
extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion into
guc.h. See wal_level_options, for an example.

+static const struct config_enum_entry progressive_explain_options[] = {
+ {"off", PROGRESSIVE_EXPLAIN_NONE, false},
+ {"explain", PROGRESSIVE_EXPLAIN_EXPLAIN, false},
+ {"analyze", PROGRESSIVE_EXPLAIN_ANALYZE, false},
+ {"false", PROGRESSIVE_EXPLAIN_NONE, true},
+ {NULL, 0, false}
+};

The "analyze" is a separate option in auto_explain. Should we have 2 options?
One that enable/disable this feature and another one that enable/disable
analyze option.

Don't the other EXPLAIN options make sense here? Like serialize and summary.

TupleDesc tupDesc; /* descriptor for result tuples */
EState *estate; /* executor's query-wide state */
PlanState *planstate; /* tree of per-plan-node state */
+ struct ExplainState *pe_es; /* progressive explain state if enabled */

Should you use the same name pattern here? pestate, for example.

PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ExplainHash)

Could you use a specific name? Even if you keep the proposed name, you should
use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.

+$node->init;
+# Configure progressive explain to be logged immediatelly
+$node->append_conf('postgresql.conf', 'progressive_explain_min_duration = 0');
+$node->start;

s/immediatelly/immediately/

+typedef struct progressiveExplainHashKey
+{
+ int pid; /* PID */
+} progressiveExplainHashKey;
+
+typedef struct progressiveExplainHashEntry
+{
+ progressiveExplainHashKey key; /* hash key of entry - MUST BE FIRST */
+ dsa_handle h;
+ dsa_pointer p;
+} progressiveExplainHashEntry;

You don't need progressiveExplainHashKey. Use pid as key directly.

+ /* Update shared memory with new data */
+ strcpy(pe_data->plan, es->str->data);
+ pe_data->last_print = GetCurrentTimestamp();

I don't think last_print is accurate because it is not the time the execution plan
is printed but the time it was updated. I suggest last_updated_time.

+/* Flag set by timeouts to control when to print progressive explains */
+bool ProgressiveExplainPending = false;

s/print/update/

There are other point that you use "print" but it is better to replace it with
"update".

+ progressiveExplainArray = ShmemInitHash("progressive explain hash",
+ 50, 50,
+ &info,
+ HASH_ELEM | HASH_BLOBS);

I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
better name.

+ entry->p = dsa_allocate(es->pe_a,
+ add_size(sizeof(progressiveExplainData),
+ add_size(strlen(es->str->data),
+ PROGRESSIVE_EXPLAIN_ALLOC_SIZE)));

I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because it
doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
PROGRESSIVE_EXPLAIN_AVAIL_SIZE?

Maybe you can use dshash.

There are some comments that still refers to the wrong function name.

+/*
+ * ExecProcNodeWithExplain
+ * ExecProcNode wrapper that initializes progressive explain
+ * and uwraps ExecProdNode to the original function.
+ */
+static TupleTableSlot *
+ExecProcNodeExplain(PlanState *node)

and

+/*
+ * ExecProcNodeWithExplain
+ * Responsible for initialization of all structures related to progressive
+ * explains.
+ *

+ /* state related to progressive explains */
+ struct PlanState *pe_curr_node;
+ struct Instrumentation *pe_local_instr;
+ dsa_area *pe_a;

Could you add some comments saying what each of these variables are for?

I didn't experiment but I was wondering if there is a way to avoid the
duplicates that you added to avoid the overhead.

--
Euler Taveira
EDB https://www.enterprisedb.com/


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Euler Taveira <euler(at)eulerto(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, torikoshia(at)oss(dot)nttdata(dot)com
Subject: Re: Proposal: Progressive explain
Date: 2025-03-14 15:36:34
Message-ID: CAG0ozMpwRQFNraWLGVa0zAFBYuY1HasZ6NX4BAM4H3OOrQHtDg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for the valuable inputs Euler. Adjusted most of your recommendations.

> I found a crash. It is simple to reproduce.

Indeed, I failed to test plain EXPLAIN after the addition of the new GUC
progressive_explain_min_duration. This is fixed.

> You call this feature "progressive explain". My first impression is that
it
> will only provide the execution plans for EXPLAIN commands. Instead of
> "progressive explain", I would suggest "query progress" that is a general
> database terminology. It seems natural to use "progressive explain" since
you
> are using the explain infrastructure (including the same options --
format,
> settings, wal, ...) -- to print the execution plan.

Makes sense. Kept progressive explain for now but this is still open for
discussion.

> There is no use for the function argument. If you decide to keep this
function,
remove it.

Done.

> Why don't you use the pgstat_progress_XXX() API? Since you are using a
> pg_stat_progress_XXX view name I would expect using the command progress
> reporting infrastructure (see backend_progress.c).

I haven't changed that part as of now. My implementation and underlying data
structure may not work well with that API, but I am investigating.

> Maybe you could include datid and datname as the other progress reporting
> views. It would avoid a join to figure out what the database is.

Done.

> Isn't it the same definition as in auto_explain.c? Use only one
definition for
> auto_explain and this feature. You can put this struct into explain.c,
use an
> extern definition for guc_tables.c and put a extern PGDLLIMPORT defintion
into
> guc.h. See wal_level_options, for an example.

Done.

> The "analyze" is a separate option in auto_explain. Should we have 2
options?
> One that enable/disable this feature and another one that enable/disable
> analyze option.

Tomas Vondra proposed the current logic and I think it makes sense. Having
a
single GUC to control the execution behavior keeps the feature simpler IMO.

> Don't the other EXPLAIN options make sense here? Like serialize and
summary.

I added a missing GUC for option COSTS (progressive_explain_costs). Adding
the other ones doesn't make much sense IMO. SUMMARY, SERIALIZE and MEMORY
are information added at the end of the query execution (or plan creation
for plain
EXPLAIN) in the summary section but at that point the progressive explain
will be
already finished, with no more information in pg_stat_progress_explain.

> TupleDesc tupDesc; /* descriptor for result tuples */
> EState *estate; /* executor's query-wide state */
> PlanState *planstate; /* tree of per-plan-node state */
> + struct ExplainState *pe_es; /* progressive explain state if enabled */

> Should you use the same name pattern here? pestate, for example.

Done.

> PG_LWLOCK(52, SerialControl)
> +PG_LWLOCK(53, ExplainHash)

> Could you use a specific name? Even if you keep the proposed name, you
should
> use ProgressiveExplainHash, ProgressiveExplain or QueryProgress.

Changed to ProgressiveExplainHash.

> You don't need progressiveExplainHashKey. Use pid as key directly.

Done.

> I don't think last_print is accurate because it is not the time the
execution plan
> is printed but the time it was updated. I suggest last_updated_time.

Changed from last_print to last_update. This is still open for discussion.

> I'm wondering why you use "array" in the name. ProgressiveExplainHash is a
> better name.

Used to be compatible with the ProcArray (that is also a hash). But what you
proposed is better indeed. Changed.

> I think you need a better name for PROGRESSIVE_EXPLAIN_ALLOC_SIZE because
it
> doesn't reflect what it is. PROGRESSIVE_EXPLAIN_FREE_SIZE or
> PROGRESSIVE_EXPLAIN_AVAIL_SIZE?

Changed to PROGRESSIVE_EXPLAIN_FREE_SIZE.

Fixed the wrong function names in the comments and changed the format of
those
comments in function headers to be comptible with other functions in
explain.c.

> + /* state related to progressive explains */
> + struct PlanState *pe_curr_node;
> + struct Instrumentation *pe_local_instr;
> + dsa_area *pe_a;

> Could you add some comments saying what each of these variables are for?

Done.

> I didn't experiment but I was wondering if there is a way to avoid the
> duplicates that you added to avoid the overhead.

You mean the local instrumentation object reused for each node?

Rafael.

Attachment Content-Type Size
v9-0001-Proposal-for-progressive-explains.patch application/octet-stream 84.7 KB

From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-19 17:46:56
Message-ID: CAG0ozMrZ+dG9TA+X54Tjwxm76O2_u97_Yvhdf3ixx+0caGF4Mg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Sending a new version as rebase was required.

Rafael.

Attachment Content-Type Size
v10-0001-Proposal-for-progressive-explains.patch application/octet-stream 84.7 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-19 20:35:24
Message-ID: CA+TgmoaD985+VLwR93c8PjSaoBqxw72Eu7pfBJcArzhjJ71aRw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 19, 2025 at 1:47 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> Sending a new version as rebase was required.

Reading this thread, it seems to me that there has been a good deal of
discussion about things like the name of the feature and what the UI
ought to be, but not much discussion of whether the feature is
actually safe, and not much detailed review of the code. I'm willing
to bet that everybody wants some version of this feature if we can
convince ourselves that it's not going to do horrible things like
cause server crashes, and that even people who don't get their first
choice in terms of how the feature is named or how the GUCs work will
still be pretty happy to have it overall. However, if it breaks stuff
and there's no easy way to fix the breakage, that's going to be a
problem.

In broad strokes, the danger here is that doing stuff in the middle of
query execution that we currently only do at the end of query
execution will turn out to be problematic. The biggest problem, I
think, is whether it's safe to do all of the things that EXPLAIN does
while we're at some random point in query execution. It looks to me
like the wrap/unwrap stuff is more-or-less consistent with previous
discussions of how a feature of this kind should work, though I don't
recall the previous discussion and I think the patch should contain
some comments about why it works the way that it works. I do notice
that WrapExecProcNodeWithExplain does not walk the ps->initPlan list,
which I think is an oversight.

Without having the prior discussion near to hand, I *think* that the
reason we wanted to do this wrap/unwrap stuff is to make it so that
the progressive EXPLAIN code could only execute when entering a new
plan node rather than at any random point inside of that plan node,
and that does seem a lot safer than the alternative. For example, I
think it means that we won't start trying to do progressive EXPLAIN
while already holding some random LWLock, which is very good. However,
this still means we could do a bunch of catalog access (and thus
potentially receive invalidations) at places where that can't happen
today. I'm not sure that's a problem -- the same thing could happen at
any place in the executor where we evaluate a user-supplied
expression, and there are many such places -- but it's certainly worth
a few senior people thinking real carefully about it and trying to
imagine whether there's any scenario in which it might break something
that works today.

One way in which this proposal seems safer than previous proposals is
that previous proposals have involved session A poking session B and
trying to get session B to emit an EXPLAIN on the fly with no prior
setup. That would be very useful, but I think it's more difficult and
more risky than this proposal, where all the configuration happens in
the session that is going to emit the EXPLAIN output. It knows from
the beginning that it's going to maybe be doing this, and so it can do
whatever setup it likes to accomplish that goal. So I think this
design looks pretty good from that point of view.

I don't understand how this would be safe against interrupts or
errors. If a running query is interrupted, what would cause
ProgressiveExplainCleanup() to be called? If the answer is "nothing,"
why isn't that unsafe?

ExecProcNodeOriginal looks like it's basically the same thing as
ExecProcNodeReal, except that it's for a different purpose. But you
would be hard-pressed to guess which one is used for what purpose
based on the field names or the comments. Maybe it's also worth
worrying about whether this is a scalable design. Can we find a way to
use the existing fields here instead of having to add a new one?

The documentation for the progressive_explain = { off | explain |
analyze } option seems like it should go into more detail about how
the "explain" and "analyze" values are different. I'm not 100% sure I
know the answer, and I'm not the least-experienced person who will
ever read this documentation.

WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It
invites confusion with MultiExecProcNode, to which it is unrelated.

I just went to some trouble to start breaking up the monolith that is
explain.c, so I'm not that happy about seeing this patch dump another
800+ lines of source code into that file. Probably we should have a
new source file for some or this, or maybe even more than one.

The changes to explain.h add three new data types. Two of them begin
with an uppercase letter and one with a lowercase letter. That seems
inconsistent. I also don't think that the declaration of char plan[]
is per PostgreSQL coding style. I believe we always write char
plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something
other than plan, since it's really a string-ified explain-y kind of
thing, not literally a Plan. Also, can we please not have structure
members with single letter names? "h" and "p" are going to be
completely ungreppable, and I like grepping.

It looks very strange to me that ProgressiveExplainPrint() seems to
have a theoretical possibility of generating the output and then
throwing it away if we end up with entry == NULL. I guess maybe that
case is not supposed to happen because ProgressiveExplainInit() is
supposed to create the entry, but then why isn't this an elog(ERROR)
or something instead of a no-op?

It seems like when we replace a longer entry with a shorter one, we
forget that it was originally longer. Hence, if the length of a
progressive EXPLAIN is alternately 2922 characters and 2923
characters, we'll reallocate on every other progressive EXPLAIN
instead of at most once.

I'll try to look at this some more tomorrow. It seems like a very
interesting patch, but time is very short for this release and it
doesn't look to me like we have all the kinks sorted out here just
yet.

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-19 22:53:05
Message-ID: CAG0ozMqcsZFmaeodUA9tYbbZsYJFYgJVYxTpa=ggee+ShvWQCA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi Robert,

Thanks for sparing part of your precious time to look at the patch.
I acknowledge it is a very complex one. Since you're going to take
another look, providing some preliminary comments related to some
of the implementation concerns.

> I don't understand how this would be safe against interrupts or
> errors. If a running query is interrupted, what would cause
> ProgressiveExplainCleanup() to be called? If the answer is "nothing,"
> why isn't that unsafe?

The strategy I used here is to use a MemoryContextCallback
(ProgressiveExplainReleaseFunc), configured in the memory context
where the query is being executed, being responsible for calling
ProgressiveExplainCleanup() if the query doesn't end gracefully.

> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?

Agreed. Will fix this.

> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.

Are you talking about re-printing the plan in the same query execution?
The logic for the code, using your example, would be to allocate 2922 +
PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans
alternate between 2922 and 2923 no additional allocation will be done.
A reallocation will be needed only if the plan length ends up exceeding
2922+4096. At the end of query execution (or cancellation) that DSA will
be freed and a next query execution will have to allocate again using the
same logic.

Regarding the execProcNode wrapper strategy. It used it precisely because
of the discussion in that other patch. I actually tried not using it here,
and call ProgressiveExplainPrint() in the timeout callback. This resulted
in sporadic crashes, confirming the suspicion that it is not a good
idea.

Regarding all other comments related to variable/function names and having
the feature in a separate file, agree with all the comments. Will send a
new version with the fixes.

Rafael.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-20 20:06:59
Message-ID: CA+TgmoZw+xy3WyrPD4X5ChP+=kJ4ncRZY8xQu0XfChDa4fa7yg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 19, 2025 at 6:53 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> The strategy I used here is to use a MemoryContextCallback
> (ProgressiveExplainReleaseFunc), configured in the memory context
> where the query is being executed, being responsible for calling
> ProgressiveExplainCleanup() if the query doesn't end gracefully.

Thanks for the pointer. I'm a bit skeptical about what's going on here
in ProgressiveExplainReleaseFunc(). It seems like we're depending on
shared memory to tell us whether we need to do purely backend-local
cleanup, like calling disable_timeout() and resetting
ProgressiveExplainPending and activeQueryDesc. I would have expected
us to keep track in local memory of whether this kind of work needs to
be done. It seems roundabout to take an LWLock, do a hash table lookup
to see if there's an entry there, release the LWLock, and then very
shortly thereafter take the lock a second time to release the entry
that we now know is there.

The comment in ProgressiveExplainCleanup about only detaching from the
DSA if the query ended gracefully is not ideal from my point of view
because it says what the code does instead of why the code does that
thing. Also, the function is seemingly called with queryDesc as an
argument not because you need it for anything but because you're going
to test whether it's null. In that case, you could just pass a
Boolean. Even then, something seems odd about this: why do we have to
be holding ProgressiveExplainHashLock to dsa_detach the
somewhat-inscrutably named area pe_a? And why are we not detaching it
in case of error?

I am wondering why you chose this relatively unusual error cleanup
strategy. What I would have done is invent AtEOXact_ProgressiveExplain
and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler,
because it doesn't need separate handling for transactions and
subtransactions, but it's so unlike what we do in most places that
it's hard for me to tell whether it's correct. I feel like the
approach you've chosen here would make sense if what we wanted to do
was basically release memory or some memory-like resource associated
closely with the context -- e.g. expandedrecord.c releases a
TupleDesc, but this is doing more than that.

I think the effect of this choice is that cleanup of the
progressive-EXPLAIN stuff happens much later than it normally would.
Most of the time, in the case of an abort, we would want
AbortTransaction() to release as many resources as possible, leaving
basically no work to do at CleanupTransaction() time. This is so that
if a user types "SELECT 1/0;" we release resources, as far as
possible, right away, and don't wait for them to subsequently type
"ROLLBACK;". The transaction lives on only as a shell. But these
resources, if I'm reading this right, are going to stick around until
the transaction is actually rolled back, because memory is not freed
until CleanupTransaction() time. I wonder what happens if a query
inside of an explicit transaction aborts after putting an entry in the
progressive-explain view. My guess is that the entry will stick around
until the actual rollback happens.

In fact, now that I think about it, this is probably why we don't
dsa_detach() in ProgressiveExplainCleanup() in cases of error -- the
resource owner cleanup will have already released the DSA segments
long before the memory context is deleted.

I'm sort of inclined to say that this should be rewritten to do error
cleanup in a more normal way. It's probably more code to do it that
way, but I think having it be more similar to other subsystems is
probably worth quite a bit.

> > It seems like when we replace a longer entry with a shorter one, we
> > forget that it was originally longer. Hence, if the length of a
> > progressive EXPLAIN is alternately 2922 characters and 2923
> > characters, we'll reallocate on every other progressive EXPLAIN
> > instead of at most once.
>
> Are you talking about re-printing the plan in the same query execution?

Yes.

> The logic for the code, using your example, would be to allocate 2922 +
> PROGRESSIVE_EXPLAIN_FREE_SIZE (4096, currently) initially. If next plans
> alternate between 2922 and 2923 no additional allocation will be done.
> A reallocation will be needed only if the plan length ends up exceeding
> 2922+4096. At the end of query execution (or cancellation) that DSA will
> be freed and a next query execution will have to allocate again using the
> same logic.

It seems to me that if ProgressiveExplainPrint() reaches /* Update
shared memory with new data */ without reallocating,
strlen(pe_data->plan) can be reduced. On the next trip through the
function, we don't know whether the string we're seeing is the
original string -- for which strlen()+PROGRESSIVE_EXPLAIN_FREE_SIZE)
gives us the original allocation size -- or whether the string we're
seeing is a shorter one that was copied over the original, longer
string. PROGRESSIVE_EXPLAIN_FREE_SIZE is big enough that this probably
isn't much of a problem in practice, because consecutive EXPLAIN
outputs for the same query probably won't vary in size by enough to
cause any reallocation. Looking at this more carefully, I think that
the query plan would have to shrink in size by >4kB and then expand
again in order to trigger reallocation, which seems unlikely. But it
still seems unclean to me. Normally, we track how much space we have
actually allocated explicitly, instead of reconstructing that number
from something else, especially something that isn't guaranteed to
produce an accurate result. I think you should just store the number
of available payload bytes at the beginning of the chunk and then
reallocate if it isn't big enough to hold the payload that you have
got.

> Regarding the execProcNode wrapper strategy. It used it precisely because
> of the discussion in that other patch. I actually tried not using it here,
> and call ProgressiveExplainPrint() in the timeout callback. This resulted
> in sporadic crashes, confirming the suspicion that it is not a good
> idea.

Makes sense, but we need adequate documentation of what we did and why
it works (or at least why we think it works).

Another thing I just noticed is that pg_stat_progress_explain() uses
beentry->st_procpid == ped->pid as the permissions check, but a more
typical test is HAS_PGSTAT_PERMISSIONS(beentry->st_userid). I know
that's only in pgstatfuncs.c, but I think it would be OK to duplicate
the associated test here (i.e. has_privs_of_role(GetUserId(),
ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role)). I
don't really see a reason why this shouldn't use the same permission
rules as other pg_stat_ things, in particular pg_stat_get_activity().

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-26 00:52:03
Message-ID: CAG0ozMpjEFThDGNqSoFTj1v_cpEs0rRQw5moVKU6Oy88oQ=Yhw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello Robert,

Fixed most of the recommendations. Going over one at a time.

> The documentation for the progressive_explain = { off | explain |
> analyze } option seems like it should go into more detail about how
> the "explain" and "analyze" values are different. I'm not 100% sure I
> know the answer, and I'm not the least-experienced person who will
> ever read this documentation.

Added details about behavior of each option. In the doc for that GUC
there is a link to another section that explains in detail how progressive
explains work.

> WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It
> invites confusion with MultiExecProcNode, to which it is unrelated.

Renamed that function (and the unwrap equivalent) to
WrapMemberNodesExecProcNodesWithExplain
as MemberNodes is the name used by explain when parsing those types of
nodes.

> I do notice that WrapExecProcNodeWithExplain does not walk
> the ps->initPlan list, which I think is an oversight.

Fixed.

> I just went to some trouble to start breaking up the monolith that is
> explain.c, so I'm not that happy about seeing this patch dump another
> 800+ lines of source code into that file. Probably we should have a
> new source file for some or this, or maybe even more than one.

The whole progressive explain code was moved to a new set of files
explain_progressive.h and explain_progressive.c.

> The changes to explain.h add three new data types. Two of them begin
> with an uppercase letter and one with a lowercase letter. That seems
> inconsistent. I also don't think that the declaration of char plan[]
> is per PostgreSQL coding style. I believe we always write char
> plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something
> other than plan, since it's really a string-ified explain-y kind of
> thing, not literally a Plan. Also, can we please not have structure
> members with single letter names? "h" and "p" are going to be
> completely ungreppable, and I like grepping

Done. Adjusted all data types to be uppercase. Added FLEXIBLE_ARRAY_MEMBER
and renamed "h" and "p".

> It looks very strange to me that ProgressiveExplainPrint() seems to
> have a theoretical possibility of generating the output and then
> throwing it away if we end up with entry == NULL. I guess maybe that
> case is not supposed to happen because ProgressiveExplainInit() is
> supposed to create the entry, but then why isn't this an elog(ERROR)
> or something instead of a no-op?

Changed to throw ERROR, which shouldn't happen.

> It seems like when we replace a longer entry with a shorter one, we
> forget that it was originally longer. Hence, if the length of a
> progressive EXPLAIN is alternately 2922 characters and 2923
> characters, we'll reallocate on every other progressive EXPLAIN
> instead of at most once.

Adjusted this logic. Structure ProgressiveExplainHashEntry now contains
a field to store the allocated size, which is used to compare with
the new plan being printed.

> Thanks for the pointer. I'm a bit skeptical about what's going on here
> in ProgressiveExplainReleaseFunc(). It seems like we're depending on
> shared memory to tell us whether we need to do purely backend-local
> cleanup, like calling disable_timeout() and resetting
> ProgressiveExplainPending and activeQueryDesc. I would have expected
> us to keep track in local memory of whether this kind of work needs to
> be done. It seems roundabout to take an LWLock, do a hash table lookup
> to see if there's an entry there, release the LWLock, and then very
> shortly thereafter take the lock a second time to release the entry
> that we now know is there.

> The comment in ProgressiveExplainCleanup about only detaching from the
> DSA if the query ended gracefully is not ideal from my point of view
> because it says what the code does instead of why the code does that
> thing. Also, the function is seemingly called with queryDesc as an
> argument not because you need it for anything but because you're going
> to test whether it's null. In that case, you could just pass a
> Boolean. Even then, something seems odd about this: why do we have to
> be holding ProgressiveExplainHashLock to dsa_detach the
> somewhat-inscrutably named area pe_a? And why are we not detaching it
> in case of error?

> I am wondering why you chose this relatively unusual error cleanup
> strategy. What I would have done is invent AtEOXact_ProgressiveExplain
> and AtSubEOXact_ProgressiveExplain. In some sense this looks simpler,
> because it doesn't need separate handling for transactions and
> subtransactions, but it's so unlike what we do in most places that
> it's hard for me to tell whether it's correct. I feel like the
> approach you've chosen here would make sense if what we wanted to do
> was basically release memory or some memory-like resource associated
> closely with the context -- e.g. expandedrecord.c releases a
> TupleDesc, but this is doing more than that.

> I think the effect of this choice is that cleanup of the
> progressive-EXPLAIN stuff happens much later than it normally would.
> Most of the time, in the case of an abort, we would want
> AbortTransaction() to release as many resources as possible, leaving
> basically no work to do at CleanupTransaction() time. This is so that
> if a user types "SELECT 1/0;" we release resources, as far as
> possible, right away, and don't wait for them to subsequently type
> "ROLLBACK;". The transaction lives on only as a shell. But these
> resources, if I'm reading this right, are going to stick around until
> the transaction is actually rolled back, because memory is not freed
> until CleanupTransaction() time. I wonder what happens if a query
> inside of an explicit transaction aborts after putting an entry in the
> progressive-explain view. My guess is that the entry will stick around
> until the actual rollback happens.

> In fact, now that I think about it, this is probably why we don't
> dsa_detach() in ProgressiveExplainCleanup() in cases of error -- the
> resource owner cleanup will have already released the DSA segments
> long before the memory context is deleted.

> I'm sort of inclined to say that this should be rewritten to do error
> cleanup in a more normal way. It's probably more code to do it that
> way, but I think having it be more similar to other subsystems is
> probably worth quite a bit.

Right. I use dsa_detach() only when the query finishes gracefully. This
is needed otherwise PG will complain with:

WARNING: resource was not closed: dynamic shared memory segment 32349774

That WARNING doesn't appear in case of error and according to my tests
shared memory is correctly being released.

I completely removed the MemoryContextCallback strategy and switched to
using the new function AtEOXact_ProgressiveExplain() called from
AbortTransaction().

This first version of the progressive explain feature was designed to only
keep
track of initial query called by the backend, ignoring all subquery calls.
So
I believe we don't need to worry about having to add custom logic in
AbortSubTransaction(). In case query errors out AbortTransaction() will be
called
and everything related to progressive explains will be cleaned.

> Another thing I just noticed is that pg_stat_progress_explain() uses
> beentry->st_procpid == ped->pid as the permissions check, but a more
> typical test is HAS_PGSTAT_PERMISSIONS(beentry->st_userid). I know
> that's only in pgstatfuncs.c, but I think it would be OK to duplicate
> the associated test here (i.e. has_privs_of_role(GetUserId(),
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role)). I
> don't really see a reason why this shouldn't use the same permission
> rules as other pg_stat_ things, in particular pg_stat_get_activity().

Adjusted the logic to use has_privs_of_role(GetUserId(),
> ROLE_PG_READ_ALL_STATS) || has_privs_of_role(GetUserId(), role) when
checking the privileges.

Rafael.

Attachment Content-Type Size
v11-0001-Proposal-for-progressive-explains.patch application/octet-stream 88.9 KB

From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-26 13:10:39
Message-ID: CAG0ozMq=v-dMes4Jjy1GtJ3oLZNq5axVexn3sDGirDTG-dkjgQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Sending a new version that includes the new explain_progressive.c file in
meson.build.

Rafael.

Attachment Content-Type Size
v12-0001-Proposal-for-progressive-explains.patch application/octet-stream 89.4 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-26 13:44:43
Message-ID: CA+TgmoZ51Hc7o_vFo8rbPqkLpSVzH6i=ccPrt2mHBPyk=zMxwQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 25, 2025 at 8:52 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> This first version of the progressive explain feature was designed to only keep
> track of initial query called by the backend, ignoring all subquery calls. So
> I believe we don't need to worry about having to add custom logic in
> AbortSubTransaction(). In case query errors out AbortTransaction() will be called
> and everything related to progressive explains will be cleaned.

Suppose:

BEGIN;
SELECT 1;
SAVEPOINT bob;
progressively explain something that aborts

I think in this case we will call AbortSubTransaction(), not AbortTransaction().

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-26 21:57:54
Message-ID: CAG0ozMq48+ODFoLW4HHexN8BKHfh_4c-n2b=yH9oaUnCZNZj6Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> Suppose:
> BEGIN;
> SELECT 1;
> SAVEPOINT bob;
> progressively explain something that aborts
> I think in this case we will call AbortSubTransaction(), not
AbortTransaction().

Indeed. We need special treatment for subtransactions. There are 2
scenarios where
AbortSubTransaction() will be called alone and can affect progressive
explains:
savepoints and PL/pgSQL exception blocks.

We don't want subxact aborts in PL/pgSQL exception blocks to perform cleanup
as the original query will continue to run and progressive explain tracking
is still applicable there.

So implementation was done based on transaction nested level. Cleanup is
only
performed when AbortSubTransaction() is called in the same transaction
nested
level as the one where the query is running. This covers both PL/pgSQL
exception
blocks and savepoints.

Rafael.

Attachment Content-Type Size
v13-0001-Proposal-for-progressive-explains.patch application/octet-stream 90.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-27 18:26:40
Message-ID: CA+TgmobrzeDep+Z1BPQqGNsCqTQ8M58wNNKJB_8Lwpwbqbz3GQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 26, 2025 at 5:58 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> So implementation was done based on transaction nested level. Cleanup is only
> performed when AbortSubTransaction() is called in the same transaction nested
> level as the one where the query is running. This covers both PL/pgSQL exception
> blocks and savepoints.

Right. I think this is much closer to being correct. However, I
actually think it should look more like this:

void
AtEOSubXact_ProgressiveExplain(bool isCommit, int nestDepth)
{
if (activeQueryDesc != NULL &&
activeQueryXactNestLevel >= nestDepth)
{
if (isCommit)
elog(WARNING, "leaked progressive explain query descriptor");
ProgressiveExplainCleanup(NULL);
}
}

By including the is-commit case in there, we can catch any bugs where
things aren't cleaned up properly before a transaction is committed.
We generally want to test >= nestDepth instead of == nestDepth in case
multiple subtransaction levels abort all at once; I'm not sure it
matters here, but even if it isn't, it's best to be consistent with
the practice elsewhere. Having {Commit,Abort}SubTransaction pass the
nestDepth instead of calling GetCurrentTransactionNestLevel() also has
precedent e.g. see AtEOSubXact_HashTables.

As a further refinement, consider initializing
activeQueryXactNestLevel to -1 and resetting it to that value each
time you end a progressive EXPLAIN, so that activeQueryDesc != NULL if
and only if activeQueryXactNestLevel >= 0. Then the test above can be
simplified to just if (activeQueryXactNestLevel >= nestDepth).

The comment for ProgressiveExplainIsActive() is a copy-paste from
another function that you forgot to update. Also, the function body
could be reduced to a one-liner: return queryDesc == activeQueryDesc;

There is a comment in ExplainOnePlan() that says "Handle progressive
explain cleanup manually if enabled as it is not called as part of
ExecutorFinish," but standard_ExecutorFinish does indeed call
ProgressiveExplainFinish(), so either the comment is misleading or the
code is wrong.

standard_ExecutorFinish() makes its call to ProgressiveExplainFinish()
dependent on the value of the progressive_explain GUC, but that GUC
could be changed in mid-query via set_config() or a plpgsql function
that calls SET, which could result in skipping the cleanup even when
it's needed. I think you should make the call unconditional and make
it entirely the job of ProgressiveExplainFinish() to decide whether
any cleanup is needed.

ProgressiveExplainFinish() calls ProgressiveExplainCleanup() in most
cases, but sometimes just disables the timeout instead. I think this
is weird. I think you should just always call
ProgressiveExplainCleanup() and make sure it's robust and cleans up
however much or little is appropriate in all cases.

On the flip side, I can't really see why
dsa_detach(queryDesc->pestate->pe_a) needs to be done while holding
ProgressiveExplainHashLock. Why not just have
ProgressiveExplainFinish() call ProgressiveExplainCleanup(), and then
afterwards it can do the dsa_detach() in the caller? Then
ProgressiveExplainCleanup() no longer needs an argument.

ProgressiveExplainPrint() can save a level of indentation in a large
chunk of the function by understanding that elog(ERROR) does not
return. You don't need to wrap everything that follows in else {}.

In the documentation table called pg-stat-progress-explain-view, some
descriptions end with a period and others do not.

Calling ProgressiveExplainTrigger() directly from
ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
we've tried hard to make our signal handlers do only very simple
things like setting a flag, and this one runs around the entire
PlanState tree modifying Very Important Things. I fear that's going to
be hard to make robust. Like, what happens if we're going around
trying to change ExecProcNode pointers while the calling code was also
going around trying to change ExecProcNode pointers? I can't quite see
how this won't result in chaos.

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 01:37:56
Message-ID: CAG0ozMr=2_1Q1SDvzErxKczNxhz-0XpXamvohywB5Pwt=+gb7A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Thanks Robert. Very thorough analysis there.

Things I don't comment on will be fixed without further discussion.

> There is a comment in ExplainOnePlan() that says "Handle progressive
> explain cleanup manually if enabled as it is not called as part of
> ExecutorFinish," but standard_ExecutorFinish does indeed call
> ProgressiveExplainFinish(), so either the comment is misleading or the
> code is wrong.

The comment is misleading. What I meant to say is that queries executed via
EXPLAIN (without analyze) don't call ExecutorFinish, so
ProgressiveExplainFinish
isn't called in that context. I will fix the comment.

> Calling ProgressiveExplainTrigger() directly from
> ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
> we've tried hard to make our signal handlers do only very simple
> things like setting a flag, and this one runs around the entire
> PlanState tree modifying Very Important Things. I fear that's going to
> be hard to make robust. Like, what happens if we're going around
> trying to change ExecProcNode pointers while the calling code was also
> going around trying to change ExecProcNode pointers? I can't quite see
> how this won't result in chaos.

Agreed. In that other similar patch to log query plans after a signal is
sent
from another session there was the same discussion and concerns.

I don't see another way of doing it here. This patch became 100x more
complex
after I added GUC progressive_explain_min_duration, that required changing
the
execProcNode wrapper on the fly.

I can see some alternatives here:

A) Use a temporary execProcNode wrapper set at query start here:

/*
* If instrumentation is required, change the wrapper to one that just
* does instrumentation. Otherwise we can dispense with all wrappers and
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
{
/*
* Use wrapper for progressive explains if enabled and the node
* belongs to the currently tracked query descriptor.
*/
if (progressive_explain == PROGRESSIVE_EXPLAIN_ANALYZE &&
ProgressiveExplainIsActive(node->state->query_desc))
node->ExecProcNode = ExecProcNodeInstrExplain;
else
node->ExecProcNode = ExecProcNodeInstr;

This wrapper will have an additional logic that will check if a boolean set
by the timeout function has changed. When that happens the initial
progressive
explain setup is done and execProcNode is unwrapped in place. This should be
safe.

The problem here is that all queries would always be using the custom
wrapper until timeout is triggered, and that can add unnecessary overhead.

B) We get rid of the idea of applying progressive explains to non
instrumented
runs (which was my original idea). My main use-case here is to see progress
of instrumented runs anyways. For that idea we have 2 possibilities:

B1) Keep implementation as is, with all the existing GUCs to control what
is included in the printed plan. We just change GUC progressive_explain to
be
a boolean again. If true, instrumentation will be enabled for the query
being
executed and progressive explain will be printed consecutively.

B2) Get rid of all new GUCs I added and change the progressive explain
feature
to become an option of the EXPLAIN command. Something like:

EXPLAIN (ANALYZE, PROGRESSIVE) SELECT * FROM ...

(B1) would allow progressive explans in regular queries, but I'm not sure
people
would be willing to enable it globally as it adds instrumentation overhead.

What do you think of the options?

Rafael.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 14:30:02
Message-ID: CA+TgmoYKovPZTiqhz2=X_Yp7QDm8FOwtkxvtTaO3NhGdgNo1Ng@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 27, 2025 at 9:38 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> > Calling ProgressiveExplainTrigger() directly from
> > ProgressiveExplainStartupTimeoutHandler() seems extremely scary --
>
> Agreed. In that other similar patch to log query plans after a signal is sent
> from another session there was the same discussion and concerns.
>
> I don't see another way of doing it here. This patch became 100x more complex
> after I added GUC progressive_explain_min_duration, that required changing the
> execProcNode wrapper on the fly.
>
> I can see some alternatives here:
>
> A) Use a temporary execProcNode wrapper set at query start here:
>
> The problem here is that all queries would always be using the custom
> wrapper until timeout is triggered, and that can add unnecessary overhead.
>
> B) We get rid of the idea of applying progressive explains to non instrumented
> runs (which was my original idea). My main use-case here is to see progress
> of instrumented runs anyways. For that idea we have 2 possibilities:
>
> B1) Keep implementation as is, with all the existing GUCs to control what
> is included in the printed plan. We just change GUC progressive_explain to be
> a boolean again. If true, instrumentation will be enabled for the query being
> executed and progressive explain will be printed consecutively.
>
> B2) Get rid of all new GUCs I added and change the progressive explain feature
> to become an option of the EXPLAIN command. Something like:
>
> EXPLAIN (ANALYZE, PROGRESSIVE) SELECT * FROM ...
>
> (B1) would allow progressive explans in regular queries, but I'm not sure people
> would be willing to enable it globally as it adds instrumentation overhead.

I'm inclined to think that there isn't much value to this feature with
progressive_explain=explain. If you just want to print plans after a
certain amount of elapsed runtime, you can already use auto_explain to
do that. Unless there's some pretty large benefit to doing it with
this feature over what that already does, I think we shouldn't invent
a second method. Granted, auto_explain is a contrib module and this is
proposed as a core feature, but I feel like that use case of printing
the plan once is so different from what I see as the core value
proposition of this feature that I think it might just be confusing to
include it in scope. There's nothing actually "progressive" about
that, because you're just doing it once.

But having said that, I'm not quite sure I understand why you're
proposing (A) and (B1) as separate alternatives. Changing
progressive_explain to be a Boolean doesn't seem like it solves the
problem of needing to wrap ExecProcNode from a signal handler. The
only thing that seems to solve that problem is to instead do the
wrapping at the start of the query, which AIUI is (A). So I feel like
you should do (A) to solve the problem I thought we were talking about
and (B1) to make things simpler. Am I misunderstanding the trade-offs
here?

I did consider proposing (B2) yesterday, not so much because of this
issue but because I wondered whether it might just be a better
interface. But on reflection I decided it wasn't, because it removes
the option to just turn this feature on for all of your queries, which
somebody might want to do. Also, it would actually be kind of a weird
interface, because every other form of EXPLAIN returns the plan as
output and throws away the original query output; but this form would
store the plan ephemerally in a view and return the original output.
I'm sure we could make that work and find a way to document it but it
seems odd.

In general, I think we should err on the side of making this feature
safe even at some performance cost. If we put in place a version of
this feature that imposes a great performance overhead but does not
crash the server, some people will be unhappy, complain, and/or be
unable to use the feature. That will be sad. But, if we put in place a
version of this feature that performs great and occasionally crashes
the server, that will be much sadder. So let's do something we're very
confident is safe. There is always the opportunity to change things in
the future if we're more confident about some questionable choice then
than we are now -- performance optimization can be done after the
fact. But if it's not stable, the only thing we're likely to change in
the future is to revert the whole thing.

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 15:09:39
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 3/26/25 22:57, Rafael Thofehrn Castro wrote:
> So implementation was done based on transaction nested level. Cleanup is
> only
> performed when AbortSubTransaction() is called in the same transaction
> nested
> level as the one where the query is running. This covers both PL/pgSQL
> exception
> blocks and savepoints.
Thanks for your efforts!
Let me provide an alternative view of your code.
Postgres has a massive gap between the start of the execution and the
end. All the stuff that happens at that time can't be discovered. That's
sad. If you add a new hook into the ExecuteNode (it may be designed as a
pointer in the PlanState to affect only necessary nodes), you may push
the code out of the core and give other extensions additional tools.
I see your reasoning here [1], but I think with the commit 4fd02bf7cf9,
it should be revised.
As I see after reviewing your code, to design it as an extension, some
parallel query execution stuff needs to be exported, and Instrumentation
needs to be changed a little. For example, with progressive explain we
have a new node state - 'not visited yet' that is different from 'not
executed'.
What is the purpose of a new hook from a broad perspective? I designed
at least two features with such a hook in the Postgres fork: 1)
real-time execution statistics and 2) query interruption on the
arbitrary trigger (planning error detected, a signal arrived, too much
temp memory allocated, etc.). I am not sure if it would be interesting
for the community, but when a query is executed for more than one
minute, I certainly want to control what is happening and have some
tools except query abortion.
The benefit of such an approach is that it is doable to change the
Instrumentation, add a hook now, and develop this extension without a
rush until it is stable - I think at least the case of parallel
execution may be enhanced.

[1]
https://www.postgresql.org/message-id/CAG0ozMrtK_u8Uf5KNZUmRNuMphV5tnC5DEhRBNRGK%2BK4L506xw%40mail.gmail.com

--
regards, Andrei Lepikhov


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 16:09:21
Message-ID: CAG0ozMpPhK1L5eUBq-FJU+uYTPoJ-EPfhPAHpyJvtJpBJHTcvQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> But having said that, I'm not quite sure I understand why you're
> proposing (A) and (B1) as separate alternatives. Changing
> progressive_explain to be a Boolean doesn't seem like it solves the
> problem of needing to wrap ExecProcNode from a signal handler. The
> only thing that seems to solve that problem is to instead do the
> wrapping at the start of the query, which AIUI is (A). So I feel like
> you should do (A) to solve the problem I thought we were talking about
> and (B1) to make things simpler. Am I misunderstanding the trade-offs
> here?

Both (A) and (B1) use the same strategy, which is to wrap at query start.
What
changes is that (A), where we keep the 'explain' option, allows users to
still
see the plan without having to include instrumentation overhead.

But (A) is doomed from the start as the custom wrapper will have custom
logic
that will add some overhead. That is what I was able to avoid with the
current
patch that does wrapping in the timeout handler function.

In (B1) it is ok to have a custom wrapper that does the progressive explain
in fixed internals as that overhead is far less noticeable than the overhead
in the already existing ExecProcNodeInstr wrapper. I tested that.

Notice that the gist of (B1) is already part of the existing patch. If
progressive_explain
is set to 'analyze' I set a wrapper ExecProcNodeInstrExplain at query start.

So I think the way to go is with (B1), where the ExecProcNodeInstrExplain
wrapper will continue to do what it does, but only after
progressive_explain_min_duration
has passed (boolean flag set by the timeout handler function). And I get rid
of the 'explain'.

As you said, visibility on the non instrumented query plan is already a
feature
of auto_explain.

> The benefit of such an approach is that it is doable to change the
> Instrumentation, add a hook now, and develop this extension without a
> rush until it is stable - I think at least the case of parallel
> execution may be enhanced.

Thanks for the input Andrei! The main issue is that progressive explain
touches a bunch of different parts of the code that would require additional
hooks. For example, to adjust the execProcNode wrapper at query start. Also
hook for xact/subxact cleanups, I don't think it exists yet.

The biggest challenge is what lies inside ExplainNode(), an already complex
function that calls several other helper functions. Progressive explain had
to change several parts of it.

The only solution I found back then while thinking about this idea was to
make
ExplainNode itself a hook, and the extension would duplicate the whole code
with
the additional custom logic. And that is definitely not a good idea.

With the new hooks Robert added it is indeed one big step ahead, but we
still need
to deal with instrumentation and other nuances as you said.

I am definitely not the authority here to talk about the best way forward.
If there is an agreement in turning this into an extension, it can be a new
feature in auto_explain.

Rafael.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 16:42:48
Message-ID: CA+TgmoYqJCqDPPbu3XNiE5irELuFqp2pn0aT=A5g6KxNNf27sg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 28, 2025 at 12:09 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> I am definitely not the authority here to talk about the best way forward.
> If there is an agreement in turning this into an extension, it can be a new
> feature in auto_explain.

I'm not against adding some more hooks to explain.c, but I don't
really see how you could add hooks to explain.c that would allow
somebody to implement this feature out of core, unless either (a) you
are thinking that they will copy and paste a lot of core code or (b)
you make the hooks extremely fine-grained and special-purpose,
designed to accomplish the extremely precise thing that this feature
would need them to do. But that just seems like bad hook design. I
don't think hooks have to be amazingly general in order to be
accepted, but if there's no use of the hook that doesn't involve a big
cut-and-paste operation, that's poor, and if the hook can only ever be
used to implement one thing, you might as well put the thing in core.

I feel like there are just two separate things here. There is Rafael's
progressive EXPLAIN patch which, as far as I can see currently, is
really only going to work as in-core feature, and then there is
Andrei's desire for more hooks, which may very well also be a good
idea but not the same idea.

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 19:04:40
Message-ID: CA+TgmoY0hj+P68iB_JrVMoEV4mbAKthiY2i36ksgFPfU7i80SQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 28, 2025 at 12:09 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> As you said, visibility on the non instrumented query plan is already a feature
> of auto_explain.

True, although thinking about it more, they're not being sent to the
same place. auto_explain goes to the log, and this goes to a view.
What about something like this:

progressive_explain = on | off
progressive_explain_inteval = 5s

If progressive_explain is off, then we don't populate the view. If
it's on, then we populate the view during query startup. Every time
the interval elapses, we update it again. If you want to update the
view just once at query startup and not thereafter, you can set
progressive_explain_interval = 0.

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 19:58:51
Message-ID: CAG0ozMo=quwiEfU40n94NGqxB4DydKxNWyeD6U2OX68Z9RhSbg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> True, although thinking about it more, they're not being sent to the
> same place. auto_explain goes to the log, and this goes to a view.
> What about something like this:
> progressive_explain = on | off
> progressive_explain_inteval = 5s
> If progressive_explain is off, then we don't populate the view. If
> it's on, then we populate the view during query startup. Every time
> the interval elapses, we update it again. If you want to update the
> view just once at query startup and not thereafter, you can set
> progressive_explain_interval = 0.

That also works. My concern is that with that approach there is a huge
"hidden" change of execution behavior between setting
progressive_explain_inteval
to 0 and something greater than that.

Setting to 0 doesn't do anything other than dumping the plain plan at
query start. Setting any other value will enable instrumentation under the
hood. This would have to be very well documented.

I'm still more inclined to use:

progressive_explain = off | explain | analyze
progressive_explain_interval = any millisecond greater than a min threshold
(currently 10ms). It doesn't make sense to be dumping the instrumented plan
every 1ms for example, IMHO.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 20:01:55
Message-ID: CA+TgmobGZseEpTeyhzstFs6rVKaB-_CfiC7gTuvz=QhWc=aR=w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 28, 2025 at 3:59 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> > True, although thinking about it more, they're not being sent to the
> > same place. auto_explain goes to the log, and this goes to a view.
> > What about something like this:
> > progressive_explain = on | off
> > progressive_explain_inteval = 5s
>
> I'm still more inclined to use:
>
> progressive_explain = off | explain | analyze
> progressive_explain_interval = any millisecond greater than a min threshold
> (currently 10ms). It doesn't make sense to be dumping the instrumented plan
> every 1ms for example, IMHO.

I still have trouble understanding what that means. Is the interval
irrelevant except when progressive_explain = analyze?

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 20:04:04
Message-ID: CAG0ozMovsrnnLGEM6h0puBr-ihLjYDp11kvnCrnym3D28BZ7eg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> I still have trouble understanding what that means. Is the interval
> irrelevant except when progressive_explain = analyze?

That is correct. Interval currently is only used when instrumentation
is enabled through progressive_explain = analyze.


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 20:12:40
Message-ID: CAG0ozMo2HQjmpn9AHGXOo726-PgH=+XmkoPUVxAqvcOS6puq8g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> That is correct. Interval currently is only used when instrumentation
> is enabled through progressive_explain = analyze.

I guess there would still be a point in printing the plan on a regular
interval
when instrumentation is disabled. In that case the only thing we would see
changing in the plan is the node currently being executed. But with that we
add more overhead to progressive_explain = 'explain', that in that case will
also require a custom execProcNode wrapper.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-28 20:29:22
Message-ID: CA+Tgmobjnj0FJj-0GUSXCga3-mUe819Wj=EeZqY=QydMWGOguA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 28, 2025 at 4:12 PM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> > That is correct. Interval currently is only used when instrumentation
> > is enabled through progressive_explain = analyze.
>
> I guess there would still be a point in printing the plan on a regular interval
> when instrumentation is disabled. In that case the only thing we would see
> changing in the plan is the node currently being executed. But with that we
> add more overhead to progressive_explain = 'explain', that in that case will
> also require a custom execProcNode wrapper.

I don't think that's at all worth it. I had even considered asking for
the current node stuff to be removed altogether. Doing regular updates
to only change that seems like a real waste. It's barely ever going to
be useful to see the node being executed right this exact instant -
what you care about is which nodes suck up resources over time.

As far as option naming is concerned, I'm open to input from others,
but personally I feel like ANALYZE is a bit opaque as an EXPLAIN
option in general. We've had previous discussions about the fact that
we might have wanted to name it EXECUTE if EXPLAIN EXECUTE didn't
already mean something else. I think for most people, what ANALYZE
means - somewhat counterintuitively - is that we're actually going to
run the query. But here we're always going to run the query, so my
brain just goes into a tailspin trying to figure out what ANALYZE
means. So, personally, I would like to see progressive_explain = off |
explain | analyze go away in favor of progressive_explain = off | on.

I think we have two ideas on how to get there so far:

(1) Just get rid of what you call progressive_explain = explain. After
all, that amounts to just serializing the plan once, and maybe that's
not actually such a great feature. With that change, then we only have
two remaining values for the progressive_explain, and we can call them
"off" and "on".

(2) Merge progressive_explain = explain and progressive_explain =
analyze into a single value, progressive_explain = on. To tell which
is intended, just check progresive_explain_interval. If it's zero,
then there's no repeat, so we mean what you currently call
progressive_explain = explain i.e. serialize once. Otherwise we mean
progressive_explain = analyze.

I'm open to idea #3, but I'm pretty resistant to the idea of
progressive_explain remaining three-valued as it is today. If 27
people show up to say that they understand what that means perfectly
and Robert's a big fat idiot, I shall of course defer to the
consensus, but I kind of think I might not be alone in finding this
naming confusing.

For what it's worth, I have trouble seeing how anyone gets confused if
we do what I propose as #2. I mean, I agree with your point that they
could misunderstand how much overhead different settings will cause,
so we would need to document that carefully. But if you're just
wondering about the behavior, then I feel like people have a pretty
good chance of guessing that progressive_explain=on,
progressive_explain_interval=0 means do it once. You could think that
means do it at maximum speed, but that seems like a somewhat naive
interpretation. You could also think it disables the feature,
effectively negating progressive_explain=on, but then you should start
to wonder why there are two GUCs. If you don't think either of those
things, then I think "do it once" is the only other reasonable
interpretation.

Of course, sometimes what I think turns out to be completely wrong!

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-29 17:51:20
Message-ID: CAG0ozMpug5icxk8Mh_uiGc3OX3zsx67RsRA9Q-o+3ZXn-7M9mQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> (2) Merge progressive_explain = explain and progressive_explain =
> analyze into a single value, progressive_explain = on. To tell which
> is intended, just check progresive_explain_interval. If it's zero,
> then there's no repeat, so we mean what you currently call
> progressive_explain = explain i.e. serialize once. Otherwise we mean
> progressive_explain = analyze.

Implemented this version. New patch has the following characteristics:

- progressive_explain is a boolean

- GUC progressive_explain_min_duration is removed

- if progresive_explain_interval = 0, update plan in memory only at query
start

- if progresive_explain_interval > 0, share the plan in memory at query
start and update every progresive_explain_interval (instrumentation is
enabled automatically)

- The plan shared in pg_stat_progress_explain at query start does not
contain any instrumentation detail ((never executed), (actual time), etc)
if progresive_explain_interval = 0, even if query is started with EXPLAIN
ANALYZE. My reasoning here is that if the plan will be shared only once we
are actually interested in seeing the plan itself and not instrumentation
progress.

Fixed other comments you shared in previous emails too:

> void
> AtEOSubXact_ProgressiveExplain(bool isCommit, int nestDepth)
> {
> if (activeQueryDesc != NULL &&
> activeQueryXactNestLevel >= nestDepth)
> {
> if (isCommit)
> elog(WARNING, "leaked progressive explain query descriptor");
> ProgressiveExplainCleanup(NULL);
> }
>}
> By including the is-commit case in there, we can catch any bugs where
> things aren't cleaned up properly before a transaction is committed.

Added the isCommit logic both to Transaction and Subtransaction commits
and aborts that will notify about leakage if cleanup was not properly
done in a commit. Changed function names back to AtEOXact and AtEOSubXact,
as opposed to AtAbort and AtSubAbort.

> We generally want to test >= nestDepth instead of == nestDepth in case
> multiple subtransaction levels abort all at once; I'm not sure it
> matters here, but even if it isn't, it's best to be consistent with
> the practice elsewhere. Having {Commit,Abort}SubTransaction pass the
> nestDepth instead of calling GetCurrentTransactionNestLevel() also has
> precedent e.g. see AtEOSubXact_HashTables.

Done.

> As a further refinement, consider initializing
> activeQueryXactNestLevel to -1 and resetting it to that value each
> time you end a progressive EXPLAIN, so that activeQueryDesc != NULL if
> and only if activeQueryXactNestLevel >= 0. Then the test above can be
> simplified to just if (activeQueryXactNestLevel >= nestDepth).

Done.

> standard_ExecutorFinish() makes its call to ProgressiveExplainFinish()
> dependent on the value of the progressive_explain GUC, but that GUC
> could be changed in mid-query via set_config() or a plpgsql function
> that calls SET, which could result in skipping the cleanup even when
> it's needed. I think you should make the call unconditional and make
> it entirely the job of ProgressiveExplainFinish() to decide whether
> any cleanup is needed.

Done.

> ProgressiveExplainFinish() calls ProgressiveExplainCleanup() in most
> cases, but sometimes just disables the timeout instead. I think this
> is weird. I think you should just always call
> ProgressiveExplainCleanup() and make sure it's robust and cleans up
> however much or little is appropriate in all cases.

Now that I removed all the execProcNode wrapping and the conditional cleanup
based on progressive_explain_min_duration (that got removed), this
part became much simpler. ProgressiveExplainFinish always calls
ProgressiveExplainCleanup.

> On the flip side, I can't really see why
> dsa_detach(queryDesc->pestate->pe_a) needs to be done while holding
> ProgressiveExplainHashLock. Why not just have
> ProgressiveExplainFinish() call ProgressiveExplainCleanup(), and then
> afterwards it can do the dsa_detach() in the caller? Then
> ProgressiveExplainCleanup() no longer needs an argument.

That is perfect. Implemented.

> ProgressiveExplainPrint() can save a level of indentation in a large
> chunk of the function by understanding that elog(ERROR) does not
> return. You don't need to wrap everything that follows in else {}.

Fixed.

I will not fix documentation for now as we are not done with
implementation changes yet. Once we agree that the code logic is
safe and sound we can discuss cosmetics (feature name, GUCs, view, etc).

Rafael.

Attachment Content-Type Size
v14-0001-Proposal-for-progressive-explains.patch application/octet-stream 83.9 KB

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com
Cc: Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-31 00:23:38
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 7, 2025 at 6:43 AM Rafael Thofehrn Castro
<rafaelthca(at)gmail(dot)com> wrote:
> The wrapper code was implemented by torikoshia
> (torikoshia(at)oss(dot)nttdata(dot)com),
> so adding the credits here.

On Thu, Mar 20, 2025 at 5:35 AM Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> Without having the prior discussion near to hand, I *think* that the
> reason we wanted to do this wrap/unwrap stuff is to make it so that
> the progressive EXPLAIN code could only execute when entering a new
> plan node rather than at any random point inside of that plan node,
> and that does seem a lot safer than the alternative.

Your assumption is correct. Various approaches were suggested, such as
picking a small number of safe and sufficient places for this feature or
classifying CFI() calls into safe and unsafe ones. However, in the end,
the wrapping approach [1] was the only one that remained

On 2025-03-30 02:51, Rafael Thofehrn Castro wrote:

> Implemented this version. New patch has the following characteristics:

I haven't looked into the code yet, but when I ran below commands during
make installcheck, there was an error and an assertion failure

=# select * from pg_stat_progress_explain;
=# \watch 0.1

ERROR: could not attach to dynamic shared area

WARNING: terminating connection because of crash of another server
process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
"ruleutils.c", Line: 8802, PID: 73180
TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
"ruleutils.c", Line: 8802, PID: 73181
0 postgres 0x000000010365f5c4
ExceptionalCondition + 236
1 postgres 0x00000001035a7830
get_parameter + 1076
0 postgres 0x000000010365f5c4
ExceptionalCondition + 236
2 postgres 0x000000010359ff2c
get_rule_expr + 276
1 postgres 0x00000001035a7830
get_parameter + 1076
3 postgres 0x00000001035a841c
get_rule_expr_paren + 168
2 postgres 0x000000010359ff2c
get_rule_expr + 276
4 postgres 0x00000001035a82c4
get_oper_expr + 292
3 postgres 0x00000001035a841c
get_rule_expr_paren + 168
5 postgres 0x00000001035a01f8
get_rule_expr + 992
4 postgres 0x00000001035a82c4
get_oper_expr + 292
6 postgres 0x0000000103598520
deparse_expression_pretty + 176
5 postgres 0x00000001035a01f8
get_rule_expr + 992
7 postgres 0x0000000103598e78
deparse_expression + 76
6 postgres 0x0000000103598520
deparse_expression_pretty + 176
8 postgres 0x0000000102f94198
show_expression + 100
7 postgres 0x0000000103598e78
deparse_expression + 76
9 postgres 0x0000000102f97690 show_qual +
112
8 postgres 0x0000000102f94198
show_expression + 100
10 postgres 0x0000000102f93734
show_scan_qual + 132
9 postgres 0x0000000102f97690 show_qual +
112
TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
"ruleutils.c", Line: 8802, PID: 73183
11 postgres 0x0000000102f90680 ExplainNode
+ 6828
10 postgres 0x0000000102f93734
show_scan_qual + 132
12 postgres 0x0000000102f8d398
ExplainPrintPlan + 540
0 postgres 0x000000010365f5c4
ExceptionalCondition + 236
11 postgres 0x0000000102f90680 ExplainNode
+ 6828
13 postgres 0x0000000102f9b974
ProgressiveExplainPrint + 72
12 postgres 0x0000000102f8d398
ExplainPrintPlan + 540
1 postgres 0x00000001035a7830
get_parameter + 1076
14 postgres 0x0000000102f9b920
ProgressiveExplainStart + 660
13 postgres 0x0000000102f9b974
ProgressiveExplainPrint + 72
2 postgres 0x000000010359ff2c
get_rule_expr + 276
15 postgres 0x00000001030771e0
standard_ExecutorStart + 984
14 postgres 0x0000000102f9b920
ProgressiveExplainStart + 660
3 postgres 0x00000001035a841c
get_rule_expr_paren + 168
16 postgres 0x0000000103076de8
ExecutorStart + 112
15 postgres 0x00000001030771e0
standard_ExecutorStart + 984
4 postgres 0x00000001035a82c4
get_oper_expr + 292
17 postgres 0x0000000103080d90
ParallelQueryMain + 292
16 postgres 0x0000000103076de8
ExecutorStart + 112
5 postgres 0x00000001035a01f8
get_rule_expr + 992
18 postgres 0x0000000102df7ef8
ParallelWorkerMain + 1712
17 postgres 0x0000000103080d90
ParallelQueryMain + 292
6 postgres 0x0000000103598520
deparse_expression_pretty + 176
19 postgres 0x00000001032a5d60
BackgroundWorkerMain + 824
18 postgres 0x0000000102df7ef8
ParallelWorkerMain + 1712
7 postgres 0x0000000103598e78
deparse_expression + 76
20 postgres 0x00000001032a9ee8
postmaster_child_launch + 492
19 postgres 0x00000001032a5d60
BackgroundWorkerMain + 824
8 postgres 0x0000000102f94198
show_expression + 100
21 postgres 0x00000001032b4c10
StartBackgroundWorker + 416
20 postgres 0x00000001032a9ee8
postmaster_child_launch + 492
9 postgres 0x0000000102f97690 show_qual +
112
22 postgres 0x00000001032af9d8
maybe_start_bgworkers + 552
21 postgres 0x00000001032b4c10
StartBackgroundWorker + 416
10 postgres 0x0000000102f93734
show_scan_qual + 132
23 postgres 0x00000001032b26cc
LaunchMissingBackgroundProcesses + 1316
22 postgres 0x00000001032af9d8
maybe_start_bgworkers + 552
11 postgres 0x0000000102f90680 ExplainNode
+ 6828
24 postgres 0x00000001032afcb0 ServerLoop
+ 616
23 postgres 0x00000001032b26cc
LaunchMissingBackgroundProcesses + 1316
12 postgres 0x0000000102f8d398
ExplainPrintPlan + 540
25 postgres 0x00000001032ae55c
PostmasterMain + 6632
13 postgres 0x0000000102f9b974
ProgressiveExplainPrint + 72
24 postgres 0x00000001032afcb0 ServerLoop
+ 616
25 postgres 0x00000001032ae55c
PostmasterMain + 6632
26 postgres 0x0000000103121160 main + 952
27 dyld 0x000000019cdc0274 start +
2840
26 postgres 0x0000000103121160 main + 952
27 dyld 0x000000019cdc0274 start +
2840
TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
"ruleutils.c", Line: 8802, PID: 73182

[1]
https://www.postgresql.org/message-id/ac6c51071316279bf903078cf264c37a%40oss.nttdata.com

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.


From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-31 06:24:26
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 3/31/25 02:23, torikoshia wrote:
> On Fri, Mar 7, 2025 at 6:43 AM Rafael Thofehrn Castro
>> Implemented this version. New patch has the following characteristics:
>
>
> I haven't looked into the code yet, but when I ran below commands during
> make installcheck, there was an error and an assertion failure
>
>   =# select * from pg_stat_progress_explain;
>   =# \watch 0.1
Yeah, that's to be expected.
I think many corner cases may be found: hash table in the middle of
filling, opened file descriptors, an incorrect combination of variables,
'not yet visited' subtrees - who knows what else? So, last time, I
just ended up with the idea that using the explain code is a bit
dangerous - in the middle of execution, it is enough to expose only
basic data - rows, numbers and timings. It seems safe to gather.

--
regards, Andrei Lepikhov


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-31 13:30:34
Message-ID: CA+TgmobUpRO4j-cYzOHnDMuksZiRy87nPER2bkFamtpthmxoaw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for this valuable testing. I think this is actually a really
good idea for how to test something like this, because the regression
tests contain lots of different queries that do lots of weird things.

On Sun, Mar 30, 2025 at 8:23 PM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> I haven't looked into the code yet, but when I ran below commands during
> make installcheck, there was an error and an assertion failure
>
> =# select * from pg_stat_progress_explain;
> =# \watch 0.1
>
> ERROR: could not attach to dynamic shared area

This seems like a race condition. Probably some backend's dsa_area
went away between the time we got a pointer to it and the time we
actually attached to it. We should be able to find some way of
handling this without an error, like treating the case where the DSA
area is missing the same as the case where there was no DSA pointer in
the first place. However, this is also making me wonder if we
shouldn't be using one DSA shared by all backends rather than a
separate DSA area for every backend. That would require more care to
avoid leaks, but I'm not sure that it's a good idea to be creating and
destroying a DSA area for every single query. But I'm not 100% sure
that's a problem.

> TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
> "ruleutils.c", Line: 8802, PID: 73180
> TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
> "ruleutils.c", Line: 8802, PID: 73181

I wonder what is happening here. One systemic danger of the patch is
that there can be a difference between what must be true at the *end*
of a query and what must be true *during* a query. Anything that can't
happen at the end but can happen in the middle is something that the
patch will need to do something about in order to work properly. But I
don't see how that can explain this failure, because AFAIR the patch
just prints the same things that would have been printed by any other
EXPLAIN, with the same stack of namespaces. It seems possible that
this is a pre-existing bug: the regression tests might contain a query
that would cause EXPLAIN to fail, but because the regression tests
don't actually EXPLAIN that query, no failure occurs. But it could
also be something else; for example, maybe this patch is trying to
EXPLAIN something that couldn't be used with a regular EXPLAIN for
some reason. Or maybe the patch doesn't actually succeed in doing the
EXPLAIN with the correct namespace stack in all cases.

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-03-31 14:43:21
Message-ID: CAG0ozMqDBp7ubg96ukY8nbAGb6aTsLxDVQ-q-1Gv6UAe-PW-AA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> I haven't looked into the code yet, but when I ran below commands during
> make installcheck, there was an error and an assertion failure

Thanks for the report. I actually made a nasty mistake in the last
patch after code refactoring, which is to not properly check that
a QueryDesc is already being tracked. So every subquery call
in the same query is allocating DSAs and those segments are not
being properly cleared. So the patch is broken and probably explains
your crashes.

Just made a 1 line fix and make installcheck looks clean now. Will
do more tests before sending another version.


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-04-01 06:23:43
Message-ID: CAG0ozMrqMKcbcU1L9ON44himeEd=6qZ2_jH1fvF0yYD0zSgDYQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hello again,

> ERROR: could not attach to dynamic shared area

In addition to that refactoring issue, the current patch had a race
condition in pg_stat_progress_explain to access the DSA of a process
running a query that gets aborted.

While discussing with Robert we agreed that it would be wiser to take
a step back and change the strategy used to share progressive explain
data in shared memory.

Instead of using per backend's DSAs shared via a hash structure I now
define a dsa_pointer and a LWLock in each backend's PGPROC.

A global DSA is created by the first backend that attempts to use
the progressive explain feature. After the DSA is created, subsequent
uses of the feature will just allocate memory there and reference
via PGPROC's dsa_pointer.

This solves the race condition reported by Torikoshi and improves
concurrency performance as now we don't have a global LWLock
controlling shared memory access, but a per-backend LWLock.

Performed the same tests done by Torikoshi and it looks like we are
good now. Even with more frequent inspects in pg_stat_progress_explain
(\watch 0.01).

Rafael.

Attachment Content-Type Size
v15-0001-Proposal-for-progressive-explains.patch application/octet-stream 86.1 KB

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-04-01 13:38:23
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 2025-04-01 15:23, Rafael Thofehrn Castro wrote:
> Hello again,
>
>> ERROR: could not attach to dynamic shared area
>
> In addition to that refactoring issue, the current patch had a race
> condition in pg_stat_progress_explain to access the DSA of a process
> running a query that gets aborted.
>
> While discussing with Robert we agreed that it would be wiser to take
> a step back and change the strategy used to share progressive explain
> data in shared memory.
>
> Instead of using per backend's DSAs shared via a hash structure I now
> define a dsa_pointer and a LWLock in each backend's PGPROC.
>
> A global DSA is created by the first backend that attempts to use
> the progressive explain feature. After the DSA is created, subsequent
> uses of the feature will just allocate memory there and reference
> via PGPROC's dsa_pointer.
>
> This solves the race condition reported by Torikoshi and improves
> concurrency performance as now we don't have a global LWLock
> controlling shared memory access, but a per-backend LWLock.
>
> Performed the same tests done by Torikoshi and it looks like we are
> good now. Even with more frequent inspects in pg_stat_progress_explain
> (\watch 0.01).

Thanks for updating the patch!
Have you tested enabling progressive_explain?

When I ran the 'make installcheck' test again setting
progressive_explain to on, there was the same assertion failure:

TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
"ruleutils.c", Line: 8802, PID: 116832
postgres: parallel worker for PID 116822
(ExceptionalCondition+0x98)[0xb7311ea8bf80]
postgres: parallel worker for PID 116822 (+0x89de6c)[0xb7311e9ede6c]
postgres: parallel worker for PID 116822 (+0x89eb68)[0xb7311e9eeb68]
postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
postgres: parallel worker for PID 116822 (+0x8a1d10)[0xb7311e9f1d10]
postgres: parallel worker for PID 116822 (+0x89ed80)[0xb7311e9eed80]
postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
postgres: parallel worker for PID 116822 (+0x89f174)[0xb7311e9ef174]
postgres: parallel worker for PID 116822 (+0x89e78c)[0xb7311e9ee78c]
postgres: parallel worker for PID 116822 (+0x89f0b8)[0xb7311e9ef0b8]
postgres: parallel worker for PID 116822 (+0x8928dc)[0xb7311e9e28dc]
postgres: parallel worker for PID 116822
(deparse_expression+0x34)[0xb7311e9e2834]
postgres: parallel worker for PID 116822 (+0x347870)[0xb7311e497870]
postgres: parallel worker for PID 116822 (+0x3478e4)[0xb7311e4978e4]
postgres: parallel worker for PID 116822 (+0x347970)[0xb7311e497970]
...
TRAP: failed Assert("param->paramkind == PARAM_EXTERN"), File:
"ruleutils.c", Line: 8802, PID: 116831
[115650] LOG: 00000: background worker "parallel worker" (PID 116831)
was terminated by signal 6: Aborted
[115650] DETAIL: Failed process was running: explain (analyze, costs
off, summary off, timing off, buffers off) select count(*) from ab where
(a = (select 1) or a = (select 3)) and b = 2
[115650] LOCATION: LogChildExit, postmaster.c:2846

We can reproduce it as follows:

show progressive_explain;
progressive_explain
---------------------
on

create table ab (a int not null, b int not null) partition by list
(a);
create table ab_a2 partition of ab for values in(2) partition by list
(b);
create table ab_a2_b1 partition of ab_a2 for values in (1);
create table ab_a2_b2 partition of ab_a2 for values in (2);
create table ab_a2_b3 partition of ab_a2 for values in (3);
create table ab_a1 partition of ab for values in(1) partition by list
(b);
create table ab_a1_b1 partition of ab_a1 for values in (1);
create table ab_a1_b2 partition of ab_a1 for values in (2);
create table ab_a1_b3 partition of ab_a1 for values in (3);
create table ab_a3 partition of ab for values in(3) partition by list
(b);
create table ab_a3_b1 partition of ab_a3 for values in (1);
create table ab_a3_b2 partition of ab_a3 for values in (2);
create table ab_a3_b3 partition of ab_a3 for values in (3);

set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
set min_parallel_table_scan_size = 0;
set max_parallel_workers_per_gather = 2;

explain (analyze, costs off, summary off, timing off, buffers off)
select count(*) from ab where (a = (select 1) or a = (select 3)) and b =
2;
WARNING: terminating connection because of crash of another server
process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Note that there is no need to access pg_stat_progress_explain.

Could you please check if you can reproduce this?

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-04-01 14:37:58
Message-ID: CA+Tgmobd5RH156_QEmTqHtgbtqDy=G2WJ1aJ0m-BOW4qqWVAfA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 1, 2025 at 9:38 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> Could you please check if you can reproduce this?

I can, and I now see that this patch has a pretty big design problem.
The assertion failure occurs when a background worker tries to call
ruleutils.c's get_parameter(), which tries to find the expression from
which the parameter was computed. To do that, it essentially looks
upward in the plan tree until it finds the source of the parameter.
For example, if the parameter came from the bar_id column of relation
foo, we would then deparse the parameter as foo.bar_id, rather than as
$1 or similar. However, this doesn't work when the deparsing is
attempted from within a parallel worker, because the parallel worker
only gets the portion of the plan it's attempting to execute, not the
whole thing. In your test case, the whole plan looks like this:

Aggregate
InitPlan 1
-> Result
InitPlan 2
-> Result
-> Gather
Workers Planned: 2
-> Parallel Append
-> Parallel Seq Scan on ab_a1_b2 ab_1
Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR
(a = (InitPlan 2).col1)))
-> Parallel Seq Scan on ab_a2_b2 ab_2
Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR
(a = (InitPlan 2).col1)))
-> Parallel Seq Scan on ab_a3_b2 ab_3
Filter: ((b = 2) AND ((a = (InitPlan 1).col1) OR
(a = (InitPlan 2).col1)))

Those references to (InitPlan 1).col1 are actually Params. I think
what's happening here is approximately that the worker tries to find
the source of those Params, but (InitPlan 1) is above the Gather node
and thus not available to the worker, and so the worker can't find it
and the assertion fails.

In one sense, it is not hard to fix this: the workers shouldn't really
be doing progressive_explain at all, because then we get a
progressive_explain from each process individually instead of one for
the query as a whole, so we could just think of having the workers
ignore the progressive_explain GUC. However, one thing I realized
earlier this morning is that the progressive EXPLAIN can't show any of
the instrumentation that is relayed back from workers to the leader
only at the end of the execution. See the code in ParallelQueryMain()
just after ExecutorFinish().

What I think this means is that this patch needs significant
rethinking to cope with parallel query. I don't think users are going
to be happy with a progressive EXPLAIN that just ignores what the
workers are doing, and I don't think they're going to be happy with N
separate progressive explains that they have to merge together to get
an overall picture of what the query is doing, and I'm absolutely
positive that they're not going to be happy with something that
crashes. I think there may be a way to come up with a good design here
that avoids these problems, but we definitely don't have time before
feature freeze (not that we were necessarily in a great place to think
of committing this before feature freeze anyway, but it definitely
doesn't make sense now that I understand this problem).

--
Robert Haas
EDB: http://www.enterprisedb.com


From: Rafael Thofehrn Castro <rafaelthca(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Progressive explain
Date: 2025-04-01 17:17:58
Message-ID: CAG0ozMpr67NqfsktboC9o=JJ7jSZpMWXpNn45uJ7aZOuT4vhYQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

My bad, I mistakenly did the tests without assertion enabled in the last 2
days,
so couldn't catch that Assertion failure. Was able to reproduce it, thanks.

I guess when the code was designed we were not expecting to be doing
explains
in parallel workers.

One comment is that this has nothing to do with instrumentation. So the
hacks
done with instrument objects are not to blame here.

What is interesting is that removing that Assert (not saying we should do
that)
fixes it and there doesn't seem to be other Asserts complaining anywhere.
The
feature works as expected and there are no crashes.

> What I think this means is that this patch needs significant
> rethinking to cope with parallel query. I don't think users are going
> to be happy with a progressive EXPLAIN that just ignores what the
> workers are doing, and I don't think they're going to be happy with N
> separate progressive explains that they have to merge together to get
> an overall picture of what the query is doing, and I'm absolutely
> positive that they're not going to be happy with something that
> crashes. I think there may be a way to come up with a good design here
> that avoids these problems, but we definitely don't have time before
> feature freeze (not that we were necessarily in a great place to think
> of committing this before feature freeze anyway, but it definitely
> doesn't make sense now that I understand this problem).

Yeah, that is a fair point. But I actually envisioned this feature to also
target parallel workers, from the start. I see a lot of value in being able
to debug what each parallel worker is doing in their blackboxes. It could be
that a worker is lagging behind others as it is dealing with non cached
data blocks for example.

According to my tests the parallel workers can actually push instrumentation
to the parent while still running. It all depends on the types of operations
being performed in the plan.

For example, if the parallel worker is doing a parallel seq scan, then it
will
continuously send chunks of rows to the parent and instrumentation goes with
the data. So we don't actually need to wait for the workers to finish until
instrumentation on the parent gets updated. Here is what I got from
Torikoshi's
test after removing that Assert and enabling instrumented progressive
explain (progressive_explain_interval = '10ms'):

-[ RECORD 1
]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169555
last_update | 2025-04-01 12:44:07.36775-03
query_plan | Gather (cost=0.02..137998.03 rows=10000002 width=8) (actual
time=0.715..868.619 rows=9232106.00 loops=1) (current)
+
| Workers Planned: 2

+
| Workers Launched: 2

+
| InitPlan 1

+
| -> Result (cost=0.00..0.01 rows=1 width=4) (actual
time=0.002..0.002 rows=1.00 loops=1)
+
| InitPlan 2

+
| -> Result (cost=0.00..0.01 rows=1 width=4) (actual
time=0.000..0.000 rows=1.00 loops=1)
+
| -> Parallel Append (cost=0.00..137998.01 rows=4166669
width=8) (actual time=0.364..264.804 rows=1533011.00 loops=1)
+
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00
rows=1 width=8) (never executed)
+
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1)
OR (a = (InitPlan 2).col1)))
+
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00
rows=1 width=8) (never executed)
+
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1)
OR (a = (InitPlan 2).col1)))
+
| -> Parallel Seq Scan on ab_a1_b1 ab_1
(cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.361..192.896
rows=1533011.00 loops=1)+
| Filter: ((b = 1) AND ((a = (InitPlan 1).col1)
OR (a = (InitPlan 2).col1)))
+
|
-[ RECORD 2
]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169596
last_update | 2025-04-01 12:44:07.361846-03
query_plan | Parallel Append (cost=0.00..137998.01 rows=4166669 width=8)
(actual time=0.990..666.845 rows=3839515.00 loops=1) (current)
+
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00 rows=1
width=8) (never executed)
+
| Filter: ((b = 1) AND ((a = $0) OR (a = $1)))

+
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00 rows=1
width=8) (never executed)
+
| Filter: ((b = 1) AND ((a = $0) OR (a = $1)))

+
| -> Parallel Seq Scan on ab_a1_b1 ab_1
(cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.985..480.531
rows=3839515.00 loops=1) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1)))

+
|
-[ RECORD 3
]-------------------------------------------------------------------------------------------------------------------------------------------------
datid | 5
datname | postgres
pid | 169597
last_update | 2025-04-01 12:44:07.36181-03
query_plan | Parallel Append (cost=0.00..137998.01 rows=4166669 width=8)
(actual time=1.003..669.398 rows=3830293.00 loops=1) (current)
+
| -> Seq Scan on ab_a2_b1 ab_2 (cost=0.00..0.00 rows=1
width=8) (never executed)
+
| Filter: ((b = 1) AND ((a = $0) OR (a = $1)))

+
| -> Seq Scan on ab_a3_b1 ab_3 (cost=0.00..0.00 rows=1
width=8) (actual time=0.019..0.019 rows=0.00 loops=1)
+
| Filter: ((b = 1) AND ((a = $0) OR (a = $1)))

+
| -> Parallel Seq Scan on ab_a1_b1 ab_1
(cost=0.00..117164.67 rows=4166667 width=8) (actual time=0.979..482.420
rows=3830293.00 loops=1) +
| Filter: ((b = 1) AND ((a = $0) OR (a = $1)))

+
|

But yeah, if the parallel worker does a hash join and the HASH node is a
huge
block of sub operations, then yes, upstream will not see anything until the
HASH
is computed. That is why IMO having visibility on background workers has
value
too.

I even designed a query visualizer that combines plans of the parent and
parallel workers in side-by-side windows that includes other stats (per
process wait events, CPU, memory consumption, temp file generation) allowing
us to correlate with specific operations being done in the plans.

But Robert is right, the engineering required to make this happen isn't
easy. Changing that Assert can open a pandora box of other issues and we
have no time to do that for PG18.

Rafael.