Quick Links

Adding skip scan (including MDAM style range skip scan) to nbtree

Lists:	pgsql-hackers

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-06-26 19:16:07
Message-ID:	CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached is a POC patch that adds skip scan to nbtree. The patch
teaches nbtree index scans to efficiently use a composite index on
'(a, b)' for queries with a predicate such as "WHERE b = 5". This is
feasible in cases where the total number of distinct values in the
column 'a' is reasonably small (think tens or hundreds, perhaps even
thousands for very large composite indexes).

In effect, a skip scan treats this composite index on '(a, b)' as if
it was a series of subindexes -- one subindex per distinct value in
'a'. We can exhaustively "search every subindex" using an index qual
that behaves just like "WHERE a = ANY(<every possible 'a' value>) AND
b = 5" would behave.

This approach might be much less efficient than an index scan that can
use an index on 'b' alone, but skip scanning can still be orders of
magnitude faster than a sequential scan. The user may very well not
have a dedicated index on 'b' alone, for whatever reason.

Note that the patch doesn't just target these simpler "skip leading
index column omitted from the predicate" cases. It's far more general
than that -- skipping attributes (or what the patch refers to as skip
arrays) can be freely mixed with SAOPs/conventional arrays, in any
order you can think of. They can also be combined with inequalities to
form range skip arrays.

This patch is a direct follow-up to the Postgres 17 work that became
commit 5bf748b8. Making everything work well together is an important
design goal here. I'll talk more about that further down, and will
show a benchmark example query that'll give a good general sense of
the value of the patch with these more complicated cases.

A note on terminology
=====================

The terminology in this area has certain baggage. Many of us will
recall the patch that implemented loose index scan. That patch also
dubbed itself "skip scan", but that doesn't seem right to me (it's at
odds with how other RDBMSs describe features in this area). I would
like to address the issues with the terminology in this area now, to
avoid creating further confusion.

When I use the term "skip scan", I'm referring to a feature that's
comparable to the skip scan features from Oracle and MySQL 8.0+. This
*isn't* at all comparable to the feature that MySQL calls "loose index
scan" -- don't confuse the two features.

Loose index scan is a far more specialized technique than skip scan.
It only applies within special scans that feed into a DISTINCT group
aggregate. Whereas my skip scan patch isn't like that at all -- it's
much more general. With my patch, nbtree has exactly the same contract
with the executor/core code as before. There are no new index paths
generated by the optimizer to make skip scan work, even. Skip scan
isn't particularly aimed at improving group aggregates (though the
benchmark I'll show happens to involve a group aggregate, simply
because the technique works best with large and expensive index
scans).

My patch is an additive thing, that speeds up what we'd currently
refer to as full index scans (as well as range index scans that
currently do a "full scan" of a range/subset of an index). These index
paths/index scans can no longer really be called "full index scans",
of course, but they're still logically the same index paths as before.

MDAM and skip scan
==================

As I touched on already, the patch actually implements a couple of
related optimizations. "Skip scan" might be considered one out of the
several optimizations from the 1995 paper "Efficient Search of
Multidimensional B-Trees" [1] -- the paper describes skip scan under
its "Missing Key Predicate" subsection. I collectively refer to the
optimizations from the paper as the "MDAM techniques".

Alternatively, you could define these MDAM techniques as each
implementing some particular flavor of skip scan, since they all do
rather similar things under the hood. In fact, that's how I've chosen
to describe things in my patch: it talks about skip scan, and about
range skip scan, which is considered a minor variant of skip scan.
(Note that the term "skip scan" is never used in the MDAM paper.)

MDAM is short for "multidimensional access method". In the context of
the paper, "dimension" refers to dimensions in a decision support
system. These dimensions are represented by low cardinality columns,
each of which appear in a large composite B-Tree index. The emphasis
in the paper (and for my patch) is DSS and data warehousing; OLTP apps
typically won't benefit as much.

Note: Loose index scan *isn't* described by the paper at all. I also
wouldn't classify loose index scan as one of the MDAM techniques. I
think of it as being in a totally different category, due to the way
that it applies semantic information. No MDAM technique will ever
apply high-level semantic information about what is truly required by
the plan tree, one level up. And so my patch simply teaches nbtree to
find the most efficient way of navigating through an index, based
solely on information that is readily available to the scan. The same
principles apply to all of the other MDAM techniques; they're
basically all just another flavor of skip scan (that do some kind of
clever preprocessing/transformation that enables reducing the scan to
a series of disjunctive accesses, and that could be implemented using
the new abstraction I'm calling skip arrays).

The paper more or less just applies one core idea, again and again.
It's surprising how far that one idea can take you. But it is still
just one core idea (don't overlook that point).

Range skip scan
---------------

To me, the most interesting MDAM technique is probably one that I
refer to as "range skip scan" in the patch. This is the technique that
the paper introduces first, in its "Intervening Range Predicates"
subsection. The best way of explaining it is through an example (you
could also just read the paper, which has an example of its own).

Imagine a table with just one index: a composite index on "(pdate,
customer_id)". Further suppose we have a query such as:

SELECT * FROM payments WHERE pdate BETWEEN '2024-01-01' AND
'2024-01-30' AND customer_id = 5; -- both index columns (pdate and
customer_id) appear in predicate

The patch effectively makes the nbtree code execute the index scan as
if the query had been written like this instead:

SELECT * FROM payments WHERE pdate = ANY ('2024-01-01', '2024-01-02',
..., '2024-01-30') AND customer_id = 5;

The use of a "range skip array" within nbtree allows the scan to skip
when that makes sense, locating the next date with customer_id = 5
each time (we might skip over many irrelevant leaf pages each time).
The scan must also *avoid* skipping when it *doesn't* make sense.

As always (since commit 5bf748b8 went in), whether and to what extent
we skip using array keys depends in large part on the physical
characteristics of the index at runtime. If the tuples that we need to
return are all clustered closely together, across only a handful of
leaf pages, then we shouldn't be skipping at all. When skipping makes
sense, we should skip constantly.

I'll discuss the trade-offs in this area a little more below, under "Design".

Using multiple MDAM techniques within the same index scan (includes benchmark)
------------------------------------------------------------------------------

I recreated the data in the MDAM paper's "sales" table by making
inferences from the paper. It's very roughly the same data set as the
paper (close enough to get the general idea across). The table size is
about 51GB, and the index is about 25GB (most of the attributes from
the table are used as index columns). There is nothing special about
this data set -- I just thought it would be cool to "recreate" the
queries from the paper, as best I could. Thought that this approach
might make my points about the design easier to follow.

The index we'll be using for this can be created via: "create index
mdam_idx on sales_mdam_paper(dept, sdate, item_class, store)". Again,
this is per the paper. It's also the order that the columns appear in
every WHERE clause in every query from the paper.

(That said, the particular column order from the index definition
mostly doesn't matter. Every index column is a low cardinality column,
so unless the order used completely obviates the need to skip a column
that would otherwise need to be skipped, such as "dept", the effect on
query execution time from varying column order is in the noise.
Obviously that's very much not how users are used to thinking about
composite indexes.)

The MDAM paper has numerous example queries, each of which builds on
the last, adding one more complication each time -- each of which is
addressed by another MDAM technique. The query I'll focus on here is
an example query that's towards the end of the paper, and so combines
multiple techniques together -- it's the query that appears in the "IN
Lists" subsection:

select
dept,
sdate,
item_class,
store,
sum(total_sales)
from
sales_mdam_paper
where
-- omitted: leading "dept" column from composite index
sdate between '1995-06-01' and '1995-06-30'
and item_class in (20, 35, 50)
and store in (200, 250)
group by dept, sdate, item_class, store
order by dept, sdate, item_class, store;

On HEAD, when we run this query we either get a sequential scan (which
is very slow) or a full index scan (which is almost as slow). Whereas
with the patch, nbtree will execute the query as a succession of a few
thousand very selective primitive index scans (which usually only scan
one leaf page, though some may scan two neighboring leaf pages).

Results: The full index scan on HEAD takes about 32 seconds. With the
patch, the query takes just under 52ms to execute. That works out to
be about 630x faster with the patch.

See the attached SQL file for full details. It provides all you'll
need to recreate this test result with the patch.

Nobody would put up with such an inefficient full index scan in the
first place, so the behavior on HEAD is not really a sensible baseline
-- 630x isn't very meaningful. I could have come up with a case that
showed an even larger improvement if I'd felt like it, but that
wouldn't have proven anything.

The important point is that the patch makes a huge composite index
like the one I've built for this actually make sense, for the first
time. So we're not so much making something faster as enabling a whole
new approach to indexing -- particularly for data warehousing use
cases. The way that Postgres DBAs choose which indexes they'll need to
create is likely to be significantly changed by this optimization.

I'll break down how this is possible. This query makes use of 3
separate MDAM techniques:

1. A "simple" skip scan (on "dept").

2. A "range" skip scan (on "sdate").

3. The pair of IN() lists/SAOPs on item_class and on store. (Nothing
new here, except that nbtree needs these regular SAOP arrays to roll
over the higher-order skip arrays to trigger moving on to the next
dept/date.)

Internally, we're just doing a series of several thousand distinct
non-overlapping accesses, in index key space order (so as to preserve
the appearance of one continuous index scan). These accesses starts
out like this:

dept=INT_MIN, date='1995-06-01', item_class=20, store=200
(Here _bt_advance_array_keys discovers that the actual lowest dept
is 1, not INT_MIN)
dept=1, date='1995-06-01', item_class=20, store=200
dept=1, date='1995-06-01', item_class=20, store=250
dept=1, date='1995-06-01', item_class=35, store=200
dept=1, date='1995-06-01', item_class=35, store=250
...

(Side note: as I mentioned, each of the two "store" values usually
appear together on the same leaf page in practice. Arguably I should
have shown 2 lines/accesses here (for "dept=1"), rather than showing
4. The 4 "dept=1" lines shown required only 2 primitive index
scans/index descents/leaf page reads. Disjunctive accesses don't
necessarily map 1:1 with primitive/physical index scans.)

About another ten thousand similar accesses occur (omitted for
brevity). Execution of the scan within nbtree finally ends with these
primitive index scans/accesses:
...
dept=100, date='1995-06-30', item_class=50, store=250
dept=101, date='1995-06-01', item_class=20, store=200
STOP

There is no "dept=101" entry in the index (the highest department in
the index happens to be 100). The index scan therefore terminates at
this point, having run out of leaf pages to scan (we've reached the
rightmost point of the rightmost leaf page, as the scan attempts to
locate non-existent dept=101 tuples).

Design
======

Since index scans with skip arrays work just like index scans with
regular arrays (as of Postgres 17), naturally, there are no special
restrictions. Associated optimizer index paths have path keys, and so
could (just for example) appear in a merge join, or feed into a group
aggregate, while avoiding a sort node. Index scans that skip could
also feed into a relocatable cursor.

As I mentioned already, the patch adds a skipping mechanism that is
purely an additive thing. I think that this will turn out to be an
important enabler of using the optimizations, even when there's much
uncertainty about how much they'll actually help at runtime.

Optimizer
---------

We make a broad assumption that skipping is always to our advantage
during nbtree preprocessing -- preprocessing generates as many skip
arrays as could possibly make sense based on static rules (rules that
don't apply any kind of information about data distribution). Of
course, skipping isn't necessarily the best approach in all cases, but
that's okay. We only actually skip when physical index characteristics
show that it makes sense. The real decisions about skipping are all
made dynamically.

That approach seems far more practicable than preempting the problem
during planning or during nbtree preprocessing. It seems like it'd be
very hard to model the costs statistically. We need revisions to
btcostestimate, of course, but the less we can rely on btcostestimate
the better. As I said, there are no new index paths generated by the
optimizer for any of this.

What do you call an index scan where 90% of all index tuples are 1 of
only 3 distinct values, while the remaining 10% of index tuples are
all perfectly unique in respect of a leading column? Clearly the best
strategy when skipping using the leading column to "use skip scan for
90% of the index, and use a conventional range scan for the remaining
10%". Skipping generally makes sense, but we legitimately need to vary
our strategy *during the same index scan*. It makes no sense to think
of skip scan as a discrete sort of index scan.

I have yet to prove that always having the option of skipping (even
when it's very unlikely to help) really does "come for free" -- for
now I'm just asserting that that's possible. I'll need proof. I expect
to hear some principled skepticism on this point. It's probably not
quite there in this v1 of the patch -- there'll be some regressions (I
haven't looked very carefully just yet). However, we seem to already
be quite close to avoiding regressions from excessive/useless
skipping.

Extensible infrastructure/support functions
-------------------------------------------

Currently, the patch only supports skip scan for a subset of all
opclasses -- those that have the required support function #6, or
"skip support" function. This provides the opclass with (among other
things) a way to increment the current skip array value (or to
decrement it, in the case of backward scans). In practice we only have
this for a handful of discrete integer (and integer-ish) types. Note
that the patch currently cannot skip for an index column that happens
to be text. Note that even this v1 supports skip scans that use
unsupported types, provided that the input opclass of the specific
columns we'll need to skip has support.

The patch should be able to support every type/opclass as a target for
skipping, regardless of whether an opclass support function happened
to be available. That could work by teaching the nbtree code to have
explicit probes for the next skip array value in the index, only then
combining that new value with the qual from the input scan keys/query.
I've put that off for now because it seems less important -- it
doesn't really affect anything I've said about the core design, which
is what I'm focussing on for now.

It makes sense to use increment/decrement whenever feasible, even
though it isn't strictly necessary (or won't be, once the patch has
the required explicit probe support). The only reason to not apply
increment/decrement opclass skip support (that I can see) is because
it just isn't practical (this is generally the case for continuous
types). While it's slightly onerous to have to invent all this new
opclass infrastructure, it definitely makes sense.

There is a performance advantage to having skip arrays that can
increment through each distinct possible indexable value (this
increment/decrement stuff comes from the MDAM paper). The MDAM
techniques inherently work best when "skipping" columns of discrete
types like integer and date, which is why the paper has examples that
all look like that. If you look at my example query and its individual
accesses, you'll realize why this is so.

Thoughts?

[1] https://vldb.org/conf/1995/P710.PDF
--
Peter Geoghegan

Attachment	Content-Type	Size
mdam_paper_in_query.sql	application/octet-stream	23.7 KB
v1-0001-Add-skip-scan-to-nbtree.patch	application/octet-stream	109.9 KB

From:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-02 12:52:58
Message-ID:	CAJ7c6TMx43RT7qDHWtqtfXtmd0iZOWyUOd8M2Ch-gvXNCOpXGQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Peter,

> Attached is a POC patch that adds skip scan to nbtree. The patch
> teaches nbtree index scans to efficiently use a composite index on
> '(a, b)' for queries with a predicate such as "WHERE b = 5". This is
> feasible in cases where the total number of distinct values in the
> column 'a' is reasonably small (think tens or hundreds, perhaps even
> thousands for very large composite indexes).
>
> [...]
>
> Thoughts?

Many thanks for working on this. I believe it is an important feature
and it would be great to deliver it during the PG18 cycle.

I experimented with the patch and here are the results I got so far.

Firstly, it was compiled on Intel MacOS and ARM Linux. All the tests
pass just fine.

Secondly, I tested the patch manually using a release build on my
Raspberry Pi 5 and the GUCs that can be seen in [1].

Test 1 - simple one.

```
CREATE TABLE test1(c char, n bigint);
CREATE INDEX test1_idx ON test1 USING btree(c,n);

INSERT INTO test1
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
```

Test 2 - a more complicated one.

```
CREATE TABLE test2(c1 char, c2 char, n bigint);
CREATE INDEX test2_idx ON test2 USING btree(c1,c2,n);

INSERT INTO test2
SELECT chr(ascii('a') + random(0,2)) AS c1,
chr(ascii('a') + random(0,2)) AS c2,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test2 WHERE n > 900_000_000;
```

Test 3 - to see how it works with covering indexes.

```
CREATE TABLE test3(c char, n bigint, s text DEFAULT 'text_value' || n);
CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s);

INSERT INTO test3
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n,
'text_value_' || random(0, 1_000_000_000) AS s
FROM generate_series(0, 1_000_000);

EXPLAIN [ANALYZE] SELECT s FROM test3 WHERE n < 1000;
```

In all the cases the patch worked as expected.

I noticed that with the patch we choose Index Only Scans for Test 1
and without the patch - Parallel Seq Scan. However the Parallel Seq
Scan is 2.4 times faster. Before the patch the query takes 53 ms,
after the patch - 127 ms. I realize this could be just something
specific to my hardware and/or amount of data.

Do you think this is something that was expected or something worth
investigating further?

I haven't looked at the code yet.

[1]: https://github.com/afiskon/pgscripts/blob/master/single-install-meson.sh

--
Best regards,
Aleksander Alekseev

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-02 13:30:28
Message-ID:	CAH2-WzmTZrntbtXMOEAhvb+4EDHTwJLcOxnUe0qFVHir-ZUHew@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 2, 2024 at 8:53 AM Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:
> CREATE TABLE test1(c char, n bigint);
> CREATE INDEX test1_idx ON test1 USING btree(c,n);

The type "char" (note the quotes) is different from char(1). It just
so happens that v1 has support for skipping attributes that use the
default opclass for "char", without support for char(1).

If you change your table definition to CREATE TABLE test1(c "char", n
bigint), then your example queries can use the optimization. This
makes a huge difference.

> EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;

For example, this first test query goes from needing a full index scan
that has 5056 buffer hits to a skip scan that requires only 12 buffer
hits.

> I noticed that with the patch we choose Index Only Scans for Test 1
> and without the patch - Parallel Seq Scan. However the Parallel Seq
> Scan is 2.4 times faster. Before the patch the query takes 53 ms,
> after the patch - 127 ms.

I'm guessing that it's actually much faster once you change the
leading column to the "char" type/default opclass.

> I realize this could be just something
> specific to my hardware and/or amount of data.

The selfuncs.c costing current has a number of problems.

One problem is that it doesn't know that some opclasses/types don't
support skipping at all. That particular problem should be fixed on
the nbtree side; nbtree should support skipping regardless of the
opclass that the skipped attribute uses (while still retaining the new
opclass support functions for a subset of types where we expect it to
make skip scans somewhat faster).

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-02 13:40:24
Message-ID:	CAH2-Wzk4_60s18U2wA1GfimtF4yz1QwhBtZ4FTYG1Ta_iOvTJQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 2, 2024 at 9:30 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
>
> For example, this first test query goes from needing a full index scan
> that has 5056 buffer hits to a skip scan that requires only 12 buffer
> hits.

Actually, looks like that's an invalid result. The "char" opclass
support function appears to have bugs.

My testing totally focussed on types like integer, date, and UUID. The
"char" opclass was somewhat of an afterthought. Will fix "char" skip
support for v2.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-02 16:25:51
Message-ID:	CAH2-WzkfVQ57EtWGidSrQLK4sDAYT2K=kvqKbMDnj46n-A5HSQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 2, 2024 at 9:40 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Tue, Jul 2, 2024 at 9:30 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > > EXPLAIN [ANALYZE] SELECT COUNT(*) FROM test1 WHERE n > 900_000_000;
> >
> > For example, this first test query goes from needing a full index scan
> > that has 5056 buffer hits to a skip scan that requires only 12 buffer
> > hits.
>
> Actually, looks like that's an invalid result. The "char" opclass
> support function appears to have bugs.

Attached v2 fixes this bug. The problem was that the skip support
function used by the "char" opclass assumed signed char comparisons,
even though the authoritative B-Tree comparator (support function 1)
uses signed comparisons (via uint8 casting). A simple oversight. Your
test cases will work with this v2, provided you use "char" (instead of
unadorned char) in the create table statements.

Another small change in v2: I added a DEBUG2 message to nbtree
preprocessing, indicating the number of attributes that we're going to
skip. This provides an intuitive way to see whether the optimizations
are being applied in the first place. That should help to avoid
further confusion like this as the patch continues to evolve.

Support for char(1) doesn't seem feasible within the confines of a
skip support routine. Just like with text (which I touched on in the
introductory email), this will require teaching nbtree to perform
explicit next-key probes. An approach based on explicit probes is
somewhat less efficient in some cases, but it should always work. It's
impractical to write opclass support that (say) increments a char
value 'a' to 'b'. Making that approach work would require extensive
cooperation from the collation provider, and some knowledge of
encoding, which just doesn't make sense (if it's possible at all). I
don't have the problem with "char" because it isn't a collatable type
(it is essentially the same thing as an uint8 integer type, except
that it outputs printable ascii characters).

FWIW, your test cases don't seem like particularly good showcases for
the patch. The queries you came up with require a relatively large
amount of random I/O when accessing the heap, which skip scan will
never help with -- so skip scan is a small win (at least relative to
an unoptimized full index scan). Obviously, no skip scan can ever
avoid any required heap accesses compared to a naive full index scan
(loose index scan *does* have that capability, which is possible only
because it applies semantic information in a way that's very
different).

FWIW, a more sympathetic version of your test queries would have
involved something like "WHERE n = 900_500_000". That would allow the
implementation to perform a series of *selective* primitive index
scans (one primitive index scan per "c" column/char grouping). That
change has the effect of allowing the scan to skip over many
irrelevant leaf pages, which is of course the whole point of skip
scan. It also makes the scan will require far fewer heap accesses, so
heap related costs no longer drown out the nbtree improvements.

--
Peter Geoghegan

Attachment	Content-Type	Size
v2-0001-Add-skip-scan-to-nbtree.patch	application/octet-stream	112.6 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-02 16:55:59
Message-ID:	CAH2-WznFUHMnkHXb6SEC2KgHoQmnpJQWp2PibPbsP4o5bH8y0A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 2, 2024 at 12:25 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached v2 fixes this bug. The problem was that the skip support
> function used by the "char" opclass assumed signed char comparisons,
> even though the authoritative B-Tree comparator (support function 1)
> uses signed comparisons (via uint8 casting). A simple oversight.

Although v2 gives correct answers to the queries, the scan itself
performs an excessive amount of leaf page accesses. In short, it
behaves just like a full index scan would, even though we should
expect it to skip over significant runs of the index. So that's
another bug.

It looks like the queries you posted have a kind of adversarial
quality to them, as if they were designed to confuse the
implementation. Was it intentional? Did you take them from an existing
test suite somewhere?

The custom instrumentation I use to debug these issues shows:

_bt_readpage: 🍀 1981 with 175 offsets/tuples (leftsib 4032, rightsib 3991) ➡️
_bt_readpage first: (c, n)=(b, 998982285), TID='(1236,173)',
0x7f1464fe9fc0, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: 173,
nmatching: 174 ✅
_bt_readpage: 🍀 3991 with 175 offsets/tuples (leftsib 1981, rightsib 9) ➡️
_bt_readpage first: (c, n)=(b, 999474517), TID='(4210,9)',
0x7f1464febfc8, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: 173,
nmatching: 174 ✅
_bt_readpage: 🍀 9 with 229 offsets/tuples (leftsib 3991, rightsib 3104) ➡️
_bt_readpage first: (c, n)=(c, 1606), TID='(882,68)', 0x7f1464fedfc0,
from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: -1, nmatching: 0 ❌
_bt_readpage: 🍀 3104 with 258 offsets/tuples (leftsib 9, rightsib 1685) ➡️
_bt_readpage first: (c, n)=(c, 706836), TID='(3213,4)',
0x7f1464feffc0, from non-pivot offnum 2 started page
_bt_readpage final: , (nil), continuescan high key check did not set
so->currPos.moreRight=false ➡️ 🟢
_bt_readpage stats: currPos.firstItem: 0, currPos.lastItem: -1, nmatching: 0 ❌
*** SNIP, many more "nmatching: 0" pages appear after these two ***

The final _bt_advance_array_keys call for leaf page 3991 should be
scheduling a new primitive index scan (i.e. skipping), but that never
happens. Not entirely sure why that is, but it probably has something
to do with _bt_advance_array_keys failing to hit the
"has_required_opposite_direction_only" path for determining if another
primitive scan is required. You're using an inequality required in the
opposite-to-scan-direction here, so that path is likely to be
relevant.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-02 17:09:27
Message-ID:	CAH2-WzkijWY9axmHvpFjxV-6A6o7=OXuDk5gDDpdqCjTvuRpXA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 2, 2024 at 12:55 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Although v2 gives correct answers to the queries, the scan itself
> performs an excessive amount of leaf page accesses. In short, it
> behaves just like a full index scan would, even though we should
> expect it to skip over significant runs of the index. So that's
> another bug.

Hit "send" too soon. I simply forgot to run "alter table test1 alter
column c type "char";" before running the query. So, I was mistaken
about there still being a bug in v2. The issue here is that we don't
have support for the underlying type, char(1) -- nothing more.

v2 of the patch with your query 1 (when changed to use the "char"
type/opclass instead of the currently unsupported char(1)
type/opclass) performs 395 index related buffer hits, and 5406 heap
block accesses. Whereas it's 3833 index buffer hits with master
(naturally, the same 5406 heap accesses are required with master). In
short, this query isn't particularly sympathetic to the patch. Nor is
it unsympathetic.

--
Peter Geoghegan

From:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-05 11:03:51
Message-ID:	CAJ7c6TP2N9Dv06NkbFXYKOTaZuQcdiSf6PSKjbAPizzV3p1pfA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Peter,

> It looks like the queries you posted have a kind of adversarial
> quality to them, as if they were designed to confuse the
> implementation. Was it intentional?

To some extent. I merely wrote several queries that I would expect
should benefit from skip scans. Since I didn't look at the queries you
used there was a chance that I will hit something interesting.

> Attached v2 fixes this bug. The problem was that the skip support
> function used by the "char" opclass assumed signed char comparisons,
> even though the authoritative B-Tree comparator (support function 1)
> uses signed comparisons (via uint8 casting). A simple oversight. Your
> test cases will work with this v2, provided you use "char" (instead of
> unadorned char) in the create table statements.

Thanks for v2.

> If you change your table definition to CREATE TABLE test1(c "char", n
> bigint), then your example queries can use the optimization. This
> makes a huge difference.

You are right, it does.

Test1 takes 33.7 ms now (53 ms before the path, x1.57)

Test3 I showed before contained an error in the table definition
(Postgres can't do `n bigint, s text DEFAULT 'text_value' || n`). Here
is the corrected test:

```
CREATE TABLE test3(c "char", n bigint, s text);
CREATE INDEX test3_idx ON test3 USING btree(c,n) INCLUDE(s);

INSERT INTO test3
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n,
'text_value_' || random(0, 1_000_000_000) AS s
FROM generate_series(0, 1_000_000);

EXPLAIN ANALYZE SELECT s FROM test3 WHERE n < 10_000;
```

It runs fast (< 1 ms) and uses the index, as expected.

Test2 with "char" doesn't seem to benefit from the patch anymore
(pretty sure it did in v1). It always chooses Parallel Seq Scans even
if I change the condition to `WHERE n > 999_995_000` or `WHERE n =
999_997_362`. Is it an expected behavior?

I also tried Test4 and Test5.

In Test4 I was curious if scip scans work properly with functional indexes:

```
CREATE TABLE test4(d date, n bigint);
CREATE INDEX test4_idx ON test4 USING btree(extract(year from d),n);

INSERT INTO test4
SELECT ('2024-' || random(1,12) || '-' || random(1,28)) :: date AS d,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM test4 WHERE n > 900_000_000;
```

The query uses Index Scan, however the performance is worse than with
Seq Scan chosen before the patch. It doesn't matter if I choose '>' or
'=' condition.

Test5 checks how skip scans work with partial indexes:

```
CREATE TABLE test5(c "char", n bigint);
CREATE INDEX test5_idx ON test5 USING btree(c, n) WHERE n > 900_000_000;

INSERT INTO test5
SELECT chr(ascii('a') + random(0,2)) AS c,
random(0, 1_000_000_000) AS n
FROM generate_series(0, 1_000_000);

EXPLAIN ANALYZE SELECT COUNT(*) FROM test5 WHERE n > 950_000_000;
```

It runs fast and choses Index Only Scan. But then I discovered that
without the patch Postgres also uses Index Only Scan for this query. I
didn't know it could do this - what is the name of this technique? The
query takes 17.6 ms with the patch, 21 ms without the patch. Not a
huge win but still.

That's all I have for now.

--
Best regards,
Aleksander Alekseev

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-06 00:44:50
Message-ID:	CAH2-Wzk6Ah4YZgA+AW7fZMRzT=RqHL9SO=r4nzaWapPKr1OXnw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 5, 2024 at 7:04 AM Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:
> Test2 with "char" doesn't seem to benefit from the patch anymore
> (pretty sure it did in v1). It always chooses Parallel Seq Scans even
> if I change the condition to `WHERE n > 999_995_000` or `WHERE n =
> 999_997_362`. Is it an expected behavior?

The "char" opclass's skip support routine was totally broken in v1, so
its performance isn't really relevant. In any case v2 didn't make any
changes to the costing, so I'd expect it to use exactly the same query
plan as v1.

> The query uses Index Scan, however the performance is worse than with
> Seq Scan chosen before the patch. It doesn't matter if I choose '>' or
> '=' condition.

That's because the index has a leading/skipped column of type
"numeric", which isn't a supported type just yet (a supported B-Tree
opclass, actually).

The optimization is effective if you create the expression index with
a cast to integer:

CREATE INDEX test4_idx ON test4 USING btree(((extract(year from d))::int4),n);

This performs much better. Now I see "DEBUG: skipping 1 index
attributes" when I run the query "EXPLAIN (ANALYZE, BUFFERS) SELECT
COUNT(*) FROM test4 WHERE n > 900_000_000", which indicates that the
optimization has in fact been used as expected. There are far fewer
buffers hit with this version of your test4, which also indicates that
the optimization has been effective.

Note that the original numeric expression index test4 showed "DEBUG:
skipping 0 index attributes" when the test query ran, which indicated
that the optimization couldn't be used. I suggest that you look out
for that, by running "set client_min_messages to debug2;" from psql
when testing the patch.

> It runs fast and choses Index Only Scan. But then I discovered that
> without the patch Postgres also uses Index Only Scan for this query. I
> didn't know it could do this - what is the name of this technique?

It is a full index scan. These have been possible for many years now
(possibly well over 20 years).

Arguably, the numeric case that didn't use the optimization (your
test4) should have been costed as a full index scan, but it wasn't --
that's why you didn't get a faster sequential scan, which would have
made a little bit more sense. In general, the costing changes in the
patch are very rough.

That said, this particular problem (the test4 numeric issue) should be
fixed by inventing a way for nbtree to use skip scan with types that
lack skip support. It's not primarily a problem with the costing. At
least not in my mind.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-06 00:57:58
Message-ID:	CAH2-WzmBX+4PUhK76BbSKCBgsAdnw25-weYjJ48-_ZwEnfAA3Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 5, 2024 at 8:44 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> CREATE INDEX test4_idx ON test4 USING btree(((extract(year from d))::int4),n);
>
> This performs much better. Now I see "DEBUG: skipping 1 index
> attributes" when I run the query "EXPLAIN (ANALYZE, BUFFERS) SELECT
> COUNT(*) FROM test4 WHERE n > 900_000_000", which indicates that the
> optimization has in fact been used as expected. There are far fewer
> buffers hit with this version of your test4, which also indicates that
> the optimization has been effective.

Actually, with an index-only scan it is 281 buffer hits (including
some small number of VM buffer hits) with the patch, versus 2736
buffer hits on master. So a big change to the number of index page
accesses only.

If you use a plain index scan for this, then the cost of random heap
accesses totally dominates, so skip scan cannot possibly give much
benefit. Even a similar bitmap scan requires 4425 distinct heap page accesses,
which is significantly more than the total number of index pages in
the index. 4425 heap pages is almost the entire table; the table
consists of 4480 mainfork blocks.

This is a very nonselective query. It's not at all surprising that
this query (and others like it) hardly benefit at all, except when we
can use an index-only scan (so that the cost of heap accesses doesn't
totally dominate).

--
Peter Geoghegan

From:	<Masahiro(dot)Ikeda(at)nttdata(dot)com>
To:	<pg(at)bowt(dot)ie>, <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc:	<Masao(dot)Fujii(at)nttdata(dot)com>
Subject:	RE: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-12 05:18:56
Message-ID:	TYWPR01MB10982A413E0EC4088E78C0E11B1A62@TYWPR01MB10982.jpnprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Since I'd like to understand the skip scan to improve the EXPLAIN output
for multicolumn B-Tree Index[1], I began to try the skip scan with some
queries and look into the source code.

I have some feedback and comments.

(1)

At first, I was surprised to look at your benchmark result because the skip scan
index can improve much performance. I agree that there are many users to be
happy with the feature for especially OLAP use-case. I expected to use v18.

(2)

I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?

# Test result. The attached file is the detail of tests.

-- Index Scan
-- The actual time is low since the skip scan works well
-- But the cost is higher than one of seqscan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991 loops=1)
Output: id1, id2, id3, value
Index Cond: (test.id3 = 101)
Buffers: shared hit=4402
Planning:
Buffers: shared hit=7
Planning Time: 0.234 ms
Execution Time: 15.711 ms
(8 rows)

-- Seq Scan
-- actual time is high, but the cost is lower than one of the above Index Scan.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..12676.73 rows=984 width=20) (actual time=0.856..113.861 rows=991 loops=1)
Output: id1, id2, id3, value
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=6370
-> Parallel Seq Scan on public.test (cost=0.00..11578.33 rows=410 width=20) (actual time=0.061..102.016 rows=330 loops=3)
Output: id1, id2, id3, value
Filter: (test.id3 = 101)
Rows Removed by Filter: 333003
Buffers: shared hit=6370
Worker 0: actual time=0.099..98.014 rows=315 loops=1
Buffers: shared hit=2066
Worker 1: actual time=0.054..97.162 rows=299 loops=1
Buffers: shared hit=1858
Planning:
Buffers: shared hit=19
Planning Time: 0.194 ms
Execution Time: 114.129 ms
(18 rows)

I look at btcostestimate() to find the reason and found the bound quals
and cost.num_sa_scans are different from my expectation.

My assumption is
* bound quals is id3=XXX (and id1 and id2 are skipped attributes)
* cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans
per skipped attribute)

But it's wrong. The above index scan result is
* bound quals is NULL
* cost.num_sa_scans = 1

As I know you said the below, but I'd like to know the above is expected or not.

> That approach seems far more practicable than preempting the problem
> during planning or during nbtree preprocessing. It seems like it'd be
> very hard to model the costs statistically. We need revisions to
> btcostestimate, of course, but the less we can rely on btcostestimate
> the better. As I said, there are no new index paths generated by the
> optimizer for any of this.

I couldn't understand why there is the below logic well.

btcostestimate()
(...omit...)
if (indexcol != iclause->indexcol)
{
/* no quals at all for indexcol */
found_skip = true;
if (index->pages < 100)
break;
num_sa_scans += 10 * (indexcol - iclause->indexcol); // why add minus value?
continue; // why skip to add bound quals?
}

(3)

Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?

[1] Improve EXPLAIN output for multicolumn B-Tree Index
https://www.postgresql.org/message-id/flat/TYWPR01MB1098260B694D27758FE2BA46FB1C92%40TYWPR01MB10982.jpnprd01.prod.outlook.com

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
compare_cost.sql	application/octet-stream	2.3 KB
compare_cost_with_v2_patch.out	application/octet-stream	9.3 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-15 18:34:38
Message-ID:	CAH2-Wz=KtVs1zBAnxvVWPq5CyTpO_f+eskOBP88OsF4sOE4M_A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 12, 2024 at 1:19 AM <Masahiro(dot)Ikeda(at)nttdata(dot)com> wrote:
> Since I'd like to understand the skip scan to improve the EXPLAIN output
> for multicolumn B-Tree Index[1], I began to try the skip scan with some
> queries and look into the source code.

Thanks for the review!

Attached is v3, which generalizes skip scan, allowing it to work with
opclasses/types that lack a skip support routine. In other words, v3
makes skip scan work for all types, including continuous types, where
it's impractical or infeasible to add skip support. So now important
types like text and numeric also get the skip scan optimization (it's
not just discrete types like integer and date, as in previous
versions).

I feel very strongly that everything should be implemented as part of
the new skip array abstraction; the patch should only add the concept
of skip arrays, which should work just like SAOP arrays. We should
avoid introducing any special cases. In short, _bt_advance_array_keys
should work in exactly the same way as it does as of Postgres 17
(except for a few representational differences for skip arrays). This
seems essential because _bt_advance_array_keys inherently need to be
able to trigger moving on to the next skip array value when it reaches
the end of a SAOP array (and vice-versa). And so it just makes sense
to abstract-away the differences, hiding the difference in lower level
code.

I have described the new _bt_first behavior that is now available in
this new v3 of the patch as "adding explicit next key probes". While
v3 does make new changes to _bt_first, it's not really a special kind
of index probe. v3 invents new sentinel values instead.

The use of sentinels avoids inventing true special cases: the values
-inf, +inf, as well as variants of = that use a real datum value, but
match on the next key in the index. These new = variants can be
thought of as "+infinitesimal" values. So when _bt_advance_array_keys
has to "increment" the numeric value 5.0, it sets the scan key to the
value "5.0 +infinitesimal". There can never be any matching tuples in
the index (just like with -inf sentinel values), but that doesn't
matter. So the changes v3 makes to _bt_first doesn't change the basic
conceptual model. The added complexity is kept to a manageable level,
particularly within _bt_advance_array_keys, which is already very
complicated.

To help with testing and review, I've added another temporary testing
GUC to v3: skipscan_skipsupport_enabled. This can be set to "false" to
avoid using skip support, even where available. The GUC makes it easy
to measure how skip support routines can help performance (with
discrete types like integer and date).

> I found the cost is estimated to much higher if the number of skipped attributes
> is more than two. Is it expected behavior?

Yes and no.

Honestly, the current costing is just placeholder code. It is totally
inadequate. I'm not surprised that you found problems with it. I just
didn't put much work into it, because I didn't really know what to do.

> # Test result. The attached file is the detail of tests.
>
> -- Index Scan
> -- The actual time is low since the skip scan works well
> -- But the cost is higher than one of seqscan
> EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using idx_id1_id2_id3 on public.test (cost=0.42..26562.77 rows=984 width=20) (actual time=0.051..15.533 rows=991 loops=1)
> Output: id1, id2, id3, value
> Index Cond: (test.id3 = 101)
> Buffers: shared hit=4402
> Planning:
> Buffers: shared hit=7
> Planning Time: 0.234 ms
> Execution Time: 15.711 ms
> (8 rows)

This is a useful example, because it shows the difficulty with the
costing. I ran this query using my own custom instrumentation of the
scan. I saw that we only ever manage to skip ahead by perhaps 3 leaf
pages at a time, but we still come out ahead. As you pointed out, it's
~7.5x faster than the sequential scan, but not very different to the
equivalent full index scan. At least not very different in terms of
leaf page accesses. Why should we win by this much, for what seems
like a marginal case for skip scan?

Even cases where "skipping" doesn't manage to skip any leaf pages can
still benefit from skipping *index tuples* -- there is more than one
kind of skipping to consider. That is, the patch helps a lot with some
(though not all) cases where I didn't really expect that to happen:
the Postgres 17 SAOP tuple skipping code (the code in
_bt_checkkeys_look_ahead, and the related code in _bt_readpage) helps
quite a bit in "marginal" skip scan cases, even though it wasn't
really designed for that purpose (it was added to avoid regressions in
SAOP array scans for the Postgres 17 work).

I find that some queries using my original example test case are about
twice as fast as an equivalent full index scan, even when only the
fourth and final index column is used in the query predicate. The scan
can't even skip a single leaf page at a time, and yet we still win by
a nice amount. We win, though it is almost by mistake!

This is mostly a good thing. Both for the obvious reason (fast is
better than slow), and because it justifies being so aggressive in
assuming that skip scan might work out during planning (being wrong
without really losing is nice). But there is also a downside: it makes
it even harder to model costs at runtime, from within the optimizer.

If I measure the actual runtime costs other than runtime (e.g.,
buffers accesses), I'm not sure that the optimizer is wrong to think
that the parallel sequential scan is faster. It looks approximately
correct. It is only when we look at runtime that the optimizer's
choice looks wrong. Which is...awkward.

In general, I have very little idea about how to improve the costing
within btcostestimate. I am hoping that somebody has better ideas
about it. btcostestimate is definitely the area where the patch is
weakest right now.

> I look at btcostestimate() to find the reason and found the bound quals
> and cost.num_sa_scans are different from my expectation.
>
> My assumption is
> * bound quals is id3=XXX (and id1 and id2 are skipped attributes)
> * cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans
> per skipped attribute)
>
> But it's wrong. The above index scan result is
> * bound quals is NULL
> * cost.num_sa_scans = 1

The logic with cost.num_sa_scans was definitely not what I intended.
That's fixed in v3, at least. But the code in btcostestimate is still
essentially the same as in earlier versions -- it needs to be
completely redesigned (or, uh, designed for the first time).

> As I know you said the below, but I'd like to know the above is expected or not.

> Currently, there is an assumption that "there will be 10 primitive index scans
> per skipped attribute". Is any chance to use pg_stats.n_distinct?

It probably makes sense to use pg_stats.n_distinct here. But how?

If the problem is that we're too pessimistic, then I think that this
will usually (though not always) make us more pessimistic. Isn't that
the wrong direction to go in? (We're probably also too optimistic in
some cases, but being too pessimistic is a bigger problem in
practice.)

For example, your test case involved 11 distinct values in each
column. The current approach of hard-coding 10 (which is just a
temporary hack) should actually make the scan look a bit cheaper than
it would if we used the true ndistinct.

Another underlying problem is that the existing SAOP costing really
isn't very accurate, without skip scan -- that's a big source of the
pessimism with arrays/skipping. Why should we be able to get the true
number of primitive index scans just by multiplying together each
omitted prefix column's ndistinct? That approach is good for getting
the worst case, which is probably relevant -- but it's probably not a
very good assumption for the average case. (Though at least we can cap
the total number of primitive index scans to 1/3 of the total number
of pages in the index in btcostestimate, since we have guarantees
about the worst case as of Postgres 17.)

--
Peter Geoghegan

Attachment	Content-Type	Size
v3-0001-Add-skip-scan-to-nbtree.patch	application/octet-stream	127.5 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-07-24 21:14:54
Message-ID:	CAH2-Wzny9xh4Zw+gnf=TNjwXzVxEEQ24=murMM03h0a4+EMT+A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 15, 2024 at 2:34 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v3, which generalizes skip scan, allowing it to work with
> opclasses/types that lack a skip support routine. In other words, v3
> makes skip scan work for all types, including continuous types, where
> it's impractical or infeasible to add skip support.

Attached is v4, which:

* Fixes a previous FIXME item affecting range skip scans/skip arrays
used in cross-type scenarios.

* Refactors and simplifies the handling of range inequalities
associated with skip arrays more generally. We now always use
inequality scan keys during array advancement (and when descending the
tree within _bt_first), rather than trying to use a datum taken from
the range inequality as an array element directly.

This gives us cleaner separation between scan keys/data types in
cross-type scenarios: skip arrays will now only ever contain
"elements" of opclass input type. Sentinel values such as -inf are
expanded to represent "the lowest possible value that comes after the
array's low_compare lower bound, if any". Opclasses that don't offer
skip support took roughly this same approach within v3, but in v4 all
opclasses do it the same way (so opclasses with skip support use the
SK_BT_NEG_INF sentinel marking in their scan keys, though never the
SK_BT_NEXTKEY sentinel marking).

This is really just a refactoring revision. Nothing particularly
exciting here compared to v3.

--
Peter Geoghegan

Attachment	Content-Type	Size
v4-0001-Add-skip-scan-to-nbtree.patch	application/octet-stream	128.7 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-08-03 19:34:54
Message-ID:	6loebaft2tpdwlvdzojuev5i5mcer7rituumiueocrsrripixa@pbfj47yl3rll
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Jun 26, 2024 at 03:16:07PM GMT, Peter Geoghegan wrote:
>
> Loose index scan is a far more specialized technique than skip scan.
> It only applies within special scans that feed into a DISTINCT group
> aggregate. Whereas my skip scan patch isn't like that at all -- it's
> much more general. With my patch, nbtree has exactly the same contract
> with the executor/core code as before. There are no new index paths
> generated by the optimizer to make skip scan work, even. Skip scan
> isn't particularly aimed at improving group aggregates (though the
> benchmark I'll show happens to involve a group aggregate, simply
> because the technique works best with large and expensive index
> scans).

I see that the patch is not supposed to deal with aggregates in any special
way. But from what I understand after a quick review, skip scan is not getting
applied to them if there are no quals in the query (in that case
_bt_preprocess_keys returns before calling _bt_preprocess_array_keys). Yet such
queries could benefit from skipping, I assume they still could be handled by
the machinery introduced in this patch?

> > Currently, there is an assumption that "there will be 10 primitive index scans
> > per skipped attribute". Is any chance to use pg_stats.n_distinct?
>
> It probably makes sense to use pg_stats.n_distinct here. But how?
>
> If the problem is that we're too pessimistic, then I think that this
> will usually (though not always) make us more pessimistic. Isn't that
> the wrong direction to go in? (We're probably also too optimistic in
> some cases, but being too pessimistic is a bigger problem in
> practice.)
>
> For example, your test case involved 11 distinct values in each
> column. The current approach of hard-coding 10 (which is just a
> temporary hack) should actually make the scan look a bit cheaper than
> it would if we used the true ndistinct.
>
> Another underlying problem is that the existing SAOP costing really
> isn't very accurate, without skip scan -- that's a big source of the
> pessimism with arrays/skipping. Why should we be able to get the true
> number of primitive index scans just by multiplying together each
> omitted prefix column's ndistinct? That approach is good for getting
> the worst case, which is probably relevant -- but it's probably not a
> very good assumption for the average case. (Though at least we can cap
> the total number of primitive index scans to 1/3 of the total number
> of pages in the index in btcostestimate, since we have guarantees
> about the worst case as of Postgres 17.)

Do I understand correctly, that the only way how multiplying ndistincts could
produce too pessimistic results is when there is a correlation between distinct
values? Can one benefit from the extended statistics here?

And while we're at it, I think it would be great if the implementation will
allow some level of visibility about the skip scan. From what I see, currently
it's by design impossible for users to tell whether something was skipped or
not. But when it comes to planning and estimates, maybe it's not a bad idea to
let explain analyze show something like "expected number of primitive scans /
actual number of primitive scans".

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-08-03 22:14:18
Message-ID:	CAH2-Wz=mS3JBHJK3w0vj8Bc2qTknUoK=VWFHF2MUHtomzDHBXA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 3, 2024 at 3:34 PM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> I see that the patch is not supposed to deal with aggregates in any special
> way.

Right.

> But from what I understand after a quick review, skip scan is not getting
> applied to them if there are no quals in the query (in that case
> _bt_preprocess_keys returns before calling _bt_preprocess_array_keys).

Right.

> Yet such queries could benefit from skipping, I assume they still could be handled by
> the machinery introduced in this patch?

I'm not sure.

There are no real changes required inside _bt_advance_array_keys with
this patch -- skip arrays are dealt with in essentially the same way
as conventional arrays (as of Postgres 17). I suspect that loose index
scan would be best implemented using _bt_advance_array_keys. It could
also "plug in" to the existing _bt_advance_array_keys design, I
suppose.

As I touched on already, your loose index scan patch applies
high-level semantic information in a way that is very different to my
skip scan patch. This means that it makes revisions to the index AM
API (if memory serves it adds a callback called amskip to that API).
It also means that loose index scan can actually avoid heap accesses;
loose scans wholly avoid accessing logical rows (in both the index and
the heap) by reasoning that it just isn't necessary to do so at all.
Skipping happens in both data structures. Right?

Obviously, my new skip scan patch cannot possibly reduce the number of
heap page accesses required by a given index scan. Precisely the same
logical rows must be accessed as before. There is no two-way
conversation between the index AM and the table AM about which
rows/row groupings have at least one visible tuple. We're just
navigating through the index more efficiently, without changing any
contract outside of nbtree itself.

The "skip scan" name collision is regrettable. But the fact is that
Oracle, MySQL, and now SQLite all call this feature skip scan. That
feels like the right precedent to follow.

> Do I understand correctly, that the only way how multiplying ndistincts could
> produce too pessimistic results is when there is a correlation between distinct
> values?

Yes, that's one problem with the costing. Not the only one, though.

The true number of primitive index scans depends on the cardinality of
the data. For example, a skip scan might be the cheapest plan by far
if (say) 90% of the index has the same leading column value and the
remaining 10% has totally unique values. We'd still do a bad job of
costing this query with an accurate ndistinct for the leading column.
We really one need to do one or two primitive index scans for "the
first 90% of the index", and one more primitive index scan for "the
remaining 10% of the index". For a query such as this, we "require a
full index scan for the remaining 10% of the index", which is
suboptimal, but doesn't fundamentally change anything (I guess that a
skip scan is always suboptimal, in the sense that you could always do
better by having more indexes).

> Can one benefit from the extended statistics here?

I really don't know. Certainly seems possible in cases with more than
one skipped leading column.

The main problem with the costing right now is that it's just not very
well thought through, in general. The performance at runtime depends
on the layout of values in the index itself, so the underlying way
that you'd model the costs doesn't have any great precedent in
costsize.c. We do have some idea of the number of leaf pages we'll
access in btcostestimate(), but that works in a way that isn't really
fit for purpose. It kind of works with one primitive index scan, but
works much less well with multiple primitive scans.

> And while we're at it, I think it would be great if the implementation will
> allow some level of visibility about the skip scan. From what I see, currently
> it's by design impossible for users to tell whether something was skipped or
> not. But when it comes to planning and estimates, maybe it's not a bad idea to
> let explain analyze show something like "expected number of primitive scans /
> actual number of primitive scans".

I agree. I think that that's pretty much mandatory for this patch. At
least the actual number of primitive scans should be exposed. Not
quite as sure about showing the estimated number, since that might be
embarrassingly wrong quite regularly, without it necessarily mattering
that much (I'd worry that it'd be distracting).

Displaying the number of primitive scans would already be useful for
index scans with SAOPs, even without this patch. The same general
concepts (estimated vs. actual primitive index scans) already exist,
as of Postgres 17. That's really nothing new.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-08-03 22:21:43
Message-ID:	CAH2-WznwP8kiUOhvJgw+4ap1-X=LX92WrpyjFwy4zHebA1Sy+Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 3, 2024 at 6:14 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Displaying the number of primitive scans would already be useful for
> index scans with SAOPs, even without this patch. The same general
> concepts (estimated vs. actual primitive index scans) already exist,
> as of Postgres 17. That's really nothing new.

We actually expose this via instrumentation, in a certain sense. This
is documented by a "Note":

https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-ALL-INDEXES-VIEW

That is, we already say "Each internal primitive index scan increments
pg_stat_all_indexes.idx_scan, so it's possible for the count of index
scans to significantly exceed the total number of index scan executor
node executions". So, as I said in the last email, advertising the
difference between # of primitive index scans and # of index scan
executor node executions in EXPLAIN ANALYZE is already a good idea.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-08-09 21:13:28
Message-ID:	CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 24, 2024 at 5:14 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v4

Attached is v5, which splits the code from v4 patch into 2 pieces --
it becomes 0002-* and 0003-*. Certain refactoring work now appears
under its own separate patch/commit -- see 0002-* (nothing new here,
except the commit message/patch structure). The patch that actually
adds skip scan (0003-* in this new version) has been further polished,
though not in a way that I think is interesting enough to go into
here.

The interesting and notable change for v5 is the addition of the code
in 0001-*. The new 0001-* patch is concerned with certain aspects of
how _bt_advance_array_keys decides whether to start another primitive
index scan (or to stick with the ongoing one for one more leaf page
instead). This is a behavioral change, albeit a subtle one. It's also
kinda independent of skip scan (more on why that is at the end).

It's easiest to explain why 0001-* matters by way of an example. My
example will show significantly more internal/root page accesses than
seen on master, though only when 0002-* and 0003-* are applied, and
0001-* is omitted. When all 3 v5 patches are applied together, the
total number of index pages accessed by the test query will match the
master branch. It's important that skip scan never loses by much to
the master branch, of course. Even when the details of the index/scan
are inconvenient to the implementation, in whatever way.

Setup:

create table demo (int4 a, numeric b);
create index demo_idx on demo (a, b);
insert into demo select a, random() from generate_series(1, 10000) a,
generate_series(1,5) five_rows_per_a_val;
vacuum demo;

We now have a btree index "demo_idx", which has two levels (a root
page plus a leaf level). The root page contains several hundred pivot
tuples, all of which have their "b" value truncated away (or have the
value -inf, if you prefer), with just one prefix "a" column left in
place. Naturally, every leaf page has a high key with its own
separator key that matches one particular tuple that appears in the
root page (except for the rightmost leaf page). So our leaf level scan
will see lots of truncated leaf page high keys (all matching a
corresponding root page tuple).

Test query:

select a from demo where b > 0.99;

This is a query that really shouldn't be doing any skipping at all. We
nevertheless still see a huge amount of skipping with this query, ocne
0001-* is omitted. Prior to 0001-*, a new primitive index scan is
started whenever the scan reaches a "boundary" between adjoining leaf
pages. That is, whenever _bt_advance_array_keys stopped on a high key
pstate.finaltup. So without the new 0001-* work, the number of page
accesses almost doubles (because we access the root page once per leaf
page accessed, instead of just accessing it once for the whole scan).

What skip scan should have been doing all along (and will do now) is
to step forward to the next right sibling leaf page whenever it
reaches a boundary between leaf pages. This should happen again and
again, without our ever choosing to start a new primitive index scan
instead (it shouldn't happen even once with this query). In other
words, we ought to behave just like a full index scan would behave
with this query -- which is exactly what we get on master.

The scan will still nominally "use skip scan" even with this fix in
place, but in practice, for this particular query/index, the scan
won't ever actually decide to skip. So it at least "looks like" an
index scan from the point of view of EXPLAIN (ANALYZE, BUFFERS). There
is a separate question of how many CPU cycles we use to do all this,
but for now my focus is on total pages accessed by the patch versus on
master, especially for adversarial cases such as this.

It should be noted that the skip scan patch never had any problems
with this very similar query (same table as before):

select a from demo where b < 0.01;

The fact that we did the wrong thing for the first query, but the
right thing for this second similar query, was solely due to certain
accidental implementation details -- it had nothing to do with the
fundamentals of the problem. You might even say that 0001-* makes the
original "b > 0.99" case behave in the same manner as this similar "b
< 0.01" case, which is justifiable on consistency grounds. Why
wouldn't these two cases behave similarly? It's only logical.

The underlying problem arguably has little to do with skip scan;
whether we use a real SAOP array on "a" or a consed up skip array is
incidental to the problem that my example highlights. As always, the
underlying "array type" (skip vs SOAP) only matters to the lowest
level code. And so technically, this is an existing issue on
HEAD/master. You can see that for yourself by making the problematic
query's qual "where a = any ('every possible a value') and b > 0.99"
-- same problem on Postgres 17, without involving skip scan.

To be sure, the underlying problem does become more practically
relevant with the invention of skip arrays for skip scan, but 0001-*
can still be treated as independent work. It can be committed well
ahead of the other stuff IMV. The same is likely also true of the
refactoring now done in 0002-* -- it does refactoring that makes
sense, even without skip scan. And so I don't expect it to take all
that long for it to be committable.

--
Peter Geoghegan

Attachment	Content-Type	Size
v5-0001-Normalize-nbtree-truncated-high-key-array-behavio.patch	application/octet-stream	13.1 KB
v5-0003-Add-skip-scan-to-nbtree.patch	application/octet-stream	115.5 KB
v5-0002-Refactor-handling-of-nbtree-array-redundancies.patch	application/octet-stream	18.6 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-04 16:52:57
Message-ID:	CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 15, 2024 at 2:34 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Fri, Jul 12, 2024 at 1:19 AM <Masahiro(dot)Ikeda(at)nttdata(dot)com> wrote:
> > I found the cost is estimated to much higher if the number of skipped attributes
> > is more than two. Is it expected behavior?
>
> Yes and no.
>
> Honestly, the current costing is just placeholder code. It is totally
> inadequate. I'm not surprised that you found problems with it. I just
> didn't put much work into it, because I didn't really know what to do.

Attached is v6, which finally does something sensible in btcostestimate.

v6 is also the first version that supports parallel index scans that
can skip. This works by extending the approach taken by scans with
regular SAOP arrays to work with skip arrays. We need to serialize and
deserialize the current array keys in shared memory, as datums -- we
cannot just use simple BTArrayKeyInfo.cur_elem offsets with skip
arrays.

v6 also includes the patch that shows "Index Searches" in EXPLAIN
ANALYZE output, just because it's convenient when testing the patch.
This has been independently submitted as
https://commitfest.postgresql.org/49/5183/, so probably doesn't need
review here.

v6 is the first version of the patch that is basically feature
complete. I only have one big open item left: I must still fix certain
regressions seen with queries that are very unfavorable for skip scan,
where the CPU cost (but not I/O cost) of maintaining skip arrays slows
things down. Overall, I'm making fast progress here.

Back to the topic of the btcostestimate/planner changes. The rest of
the email is a discussion of the cost model.

The planner changes probably still have some problems, but all of the
obvious problems have been fixed by v6. I found it useful to focus on
making the cost model not have any obvious problems instead of trying
to make it match a purely theoretical ideal. For example, your
(Ikeda-san's) complaint about the "Index Scan using idx_id1_id2_id3 on
public.test" test case having too high a cost (higher than the cost of
a slower sequential scan) has been fixed. It's now about 3x cheaper
than the sequential scan, since we're actually paying attention to
ndistinct in v6.

Just like when we cost SAOP arrays on HEAD, skip arrays are costed by
pessimistically multiplying together the estimated number of array
elements for all the scan's arrays, without trying to account for
correlation between index columns. Being pessimistic about
correlations like this is often wrong, but that still seems like the
best bias we could have, all things considered. Plus it's nothing new.

Range style skip arrays require a slightly more complicated approach
to estimating the number of array elements: costing applies a
selectivity estimate, taken from the associated index column's
inequality keys, and applies that estimate to ndistinct itself. That
way the cost of a range skip array is lower than an
otherwise-equivalent simple skip array case (we prorate ndistinct with
skip arrays). More importantly, the cost of more selectivity ranges is
lower than the cost of less selective ranges. There is also a bias
here: we don't account for skew in ndistinct. That's probably OK,
because at least it's a bias *against* skip scan.

The new cost model does not specifically try to account for how scans
will behave when no skipping should be expected at all -- cases where
a so-called "skip scan" degenerates into a full index scan. In theory,
we should be costing these scans the same as before, since there has
been no change in runtime behavior. Overall, the cost of full index
scans with very many distinct prefix column values goes down by quite
a bit -- the cost is something like 1/3 lower in typical cases.

The problem with preserving the cost model from HEAD for these
unfavorable cases for skip scan is that I don't feel that I understand
the existing behavior. In practice the revised costing seems to be a
somewhat more accurate predictor of the actual runtime of queries.
Another problem is that I can't see a good way to make the behavior
continuous when ndistinct starts small and grows so large that we
should expect a true full index scan. (As I mentioned at the start of
this email, there are unfixed regressions for these unfavorable cases,
so I'm basing this analysis on the "set skipscan_prefix_cols = 0"
behavior rather than the current default patch behavior to correct for
that. This behavior matches HEAD with a full index scan, and should
match the default behavior in a future version of the skip scan
patch.)

--
Peter Geoghegan

Attachment	Content-Type	Size
v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	45.5 KB
v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch	application/octet-stream	18.7 KB
v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch	application/octet-stream	12.9 KB
v6-0004-Add-skip-scan-to-nbtree.patch	application/octet-stream	154.7 KB

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>, Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-07 15:27:05
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I started looking at this patch today. The first thing I usually do for
new patches is a stress test, so I did a simple script that generates
random table and runs a random query with IN() clause with various
configs (parallel query, index-only scans, ...). And it got stuck on a
parallel query pretty quick.

I've seen a bunch of those cases, so it's not a particularly unlikely
issue. The backtraces look pretty much the same in all cases - the
processes are stuck either waiting on the conditional variable in
_bt_parallel_seize, or trying to send data in shm_mq_send_bytes.

Attached is the script I use for stress testing (pretty dumb, just a
bunch of loops generating tables + queries), and backtraces for two
lockups (one is EXPLAIN ANALYZE, but otherwise exactly the same).

I haven't investigated why this is happening, but I wonder if this might
be similar to the parallel hashjoin issues, with trying to send data,
but the receiver being unable to proceed and effectively working on the
sender. But that's just a wild guess.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
lockup2.log	text/x-log	17.4 KB
lockup.log	text/x-log	14.4 KB
run.sh	application/x-shellscript	8.4 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-09 20:54:57
Message-ID:	CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 7, 2024 at 11:27 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> I started looking at this patch today.

Thanks for taking a look!

> The first thing I usually do for
> new patches is a stress test, so I did a simple script that generates
> random table and runs a random query with IN() clause with various
> configs (parallel query, index-only scans, ...). And it got stuck on a
> parallel query pretty quick.

I can reproduce this locally, without too much difficulty.
Unfortunately, this is a bug on master/Postgres 17. Some kind of issue
in my commit 5bf748b8.

The timing of this is slightly unfortunate. There's only a few weeks
until the release of 17, plus I have to travel for work over the next
week. I won't be back until the 16th, and will have limited
availability between then and now. I think that I'll have ample time
to debug and fix the issue ahead of the release of 17, though.

Looks like the problem is a parallel index scan with SAOP array keys
can find itself in a state where every parallel worker waits for the
leader to finish off a scheduled primitive index scan, while the
leader itself waits for the scan's tuple queue to return more tuples.
Obviously, the query will effectively go to sleep indefinitely when
that happens (unless and until the DBA cancels the query). This is
only possible with just the right/wrong combination of array keys and
index cardinality.

I cannot recreate the problem with parallel_leader_participation=off,
which strongly suggests that leader participation is a factor. I'll
find time to study this in detail as soon as I can.

Further background: I was always aware of the leader's tendency to go
away forever shortly after the scan begins. That was supposed to be
safe, since we account for it by serializing the scan's current array
keys in shared memory, at the point a primitive index scan is
scheduled -- any backend should be able to pick up where any other
backend left off, no matter how primitive scans are scheduled. That
now doesn't seem to be completely robust, likely due to restrictions
on when and how other backends can pick up the scheduled work from
within _bt_first, at the point that it calls _bt_parallel_seize.

In short, one or two details of how backends call _bt_parallel_seize
to pick up BTPARALLEL_NEED_PRIMSCAN work likely need to be rethought.

--
Peter Geoghegan

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>, Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-12 14:49:24
Message-ID:	CAEze2WgLavbhzUZBp_=-ObikngqK=tad1Et8b8W-kwb8gQJjPg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 9 Sept 2024 at 21:55, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Sat, Sep 7, 2024 at 11:27 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > I started looking at this patch today.
>
> Thanks for taking a look!
>
> > The first thing I usually do for
> > new patches is a stress test, so I did a simple script that generates
> > random table and runs a random query with IN() clause with various
> > configs (parallel query, index-only scans, ...). And it got stuck on a
> > parallel query pretty quick.
>
> I can reproduce this locally, without too much difficulty.
> Unfortunately, this is a bug on master/Postgres 17. Some kind of issue
> in my commit 5bf748b8.
[...]
> In short, one or two details of how backends call _bt_parallel_seize
> to pick up BTPARALLEL_NEED_PRIMSCAN work likely need to be rethought.

Thanks to Peter for the description, that helped me debug the issue. I
think I found a fix for the issue: regression tests for 811af978
consistently got stuck on my macbook before the attached patch 0001,
after applying that this patch they completed just fine.

The issue to me seems to be the following:

Only _bt_first can start a new primitive scan, so _bt_parallel_seize
only assigns a new primscan if the process is indeed in _bt_first (as
provided with _b_p_s(first=true)). All other backends that hit a
NEED_PRIMSCAN state will currently pause until a backend in _bt_first
does the next primitive scan.

A backend that hasn't requested the next primitive scan will likely
hit _bt_parallel_seize from code other than _bt_first, thus pausing.
If this is the leader process, it'll stop consuming tuples from
follower processes.

If the follower process finds a new primary scan is required after
finishing reading results from a page, it will first request a new
primitive scan, and only then start producing the tuples.

As such, we can have a follower process that just finished reading a
page, had issued a new primitive scan, and now tries to send tuples to
its primary process before getting back to _bt_first, but the its
primary process won't acknowledge any tuples because it's waiting for
that process to start the next primitive scan - now we're deadlocked.

---

The fix in 0001 is relatively simple: we stop backends from waiting
for a concurrent backend to resolve the NEED_PRIMSCAN condition, and
instead move our local state machine so that we'll hit _bt_first
ourselves, so that we may be able to start the next primitive scan.
Also attached is 0002, which adds tracking of responsible backends to
parallel btree scans, thus allowing us to assert we're never waiting
for our own process to move the state forward. I found this patch
helpful while working on solving this issue, even if it wouldn't have
found the bug as reported.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment	Content-Type	Size
v1-0001-Fix-stuck-parallel-btree-scans.patch	application/octet-stream	2.6 KB
v1-0002-nbtree-add-tracking-of-processing-responsibilitie.patch	application/octet-stream	4.2 KB

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-14 12:23:50
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/12/24 16:49, Matthias van de Meent wrote:
> On Mon, 9 Sept 2024 at 21:55, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>>
> ...
>
> The fix in 0001 is relatively simple: we stop backends from waiting
> for a concurrent backend to resolve the NEED_PRIMSCAN condition, and
> instead move our local state machine so that we'll hit _bt_first
> ourselves, so that we may be able to start the next primitive scan.
> Also attached is 0002, which adds tracking of responsible backends to
> parallel btree scans, thus allowing us to assert we're never waiting
> for our own process to move the state forward. I found this patch
> helpful while working on solving this issue, even if it wouldn't have
> found the bug as reported.
>

No opinion on the analysis / coding, but per my testing the fix indeed
addresses the issue. The script reliably got stuck within a minute, now
it's running for ~1h just fine. It also checks results and that seems
fine too, so that seems fine too.

regards

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-16 19:13:47
Message-ID:	CAH2-WzkgPyaWD+DCh70okFPPjDT9OHr9DFAgEwoz_tSg60AoxA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 12, 2024 at 10:49 AM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> Thanks to Peter for the description, that helped me debug the issue. I
> think I found a fix for the issue: regression tests for 811af978
> consistently got stuck on my macbook before the attached patch 0001,
> after applying that this patch they completed just fine.

Thanks for taking a look at it.

> The fix in 0001 is relatively simple: we stop backends from waiting
> for a concurrent backend to resolve the NEED_PRIMSCAN condition, and
> instead move our local state machine so that we'll hit _bt_first
> ourselves, so that we may be able to start the next primitive scan.

I agree with your approach, but I'm concerned about it causing
confusion inside _bt_parallel_done. And so I attach a v2 revision of
your bug fix. v2 adds a check that nails that down, too. I'm not 100%
sure if the change to _bt_parallel_done becomes strictly necessary, to
make the basic fix robust, but it's a good idea either way. In fact, it
seemed like a good idea even before this bug came to light: it was
already clear that this was strictly necessary for the skip scan
patch. And for reasons that really have nothing to do with the
requirements for skip scan (it's related to how we call
_bt_parallel_done without much care in code paths from the original
parallel index scan commit).

More details on changes in v2 that didn't appear in Matthias' v1:

v2 makes _bt_parallel_done do nothing at all when the backend-local
so->needPrimScan flag is set (regardless of whether it has been set by
_bt_parallel_seize or by _bt_advance_array_keys). This is a bit like
the approach taken before the Postgres 17 work went in:
_bt_parallel_done used to only permit the shared btps_pageStatus state
to become BTPARALLEL_DONE when it found that "so->arrayKeyCount >=
btscan->btps_arrayKeyCount" (else the call was a no-op). With this
extra hardening, _bt_parallel_done will only permit setting BTPARALLEL_DONE when
"!so->needPrimScan". Same idea, more or less.

v2 also changes comments in _bt_parallel_seize. The comment tweaks
suggest that the new "if (!first && status ==
BTPARALLEL_NEED_PRIMSCAN) return false" path is similar to the
existing master branch "if (!first && so->needPrimScan) return false"
precheck logic on master (the precheck that takes place before
examining any state in shared memory). The new path can be thought of
as dealing with cases where the backend-local so->needPrimScan flag
must have been stale back when it was prechecked -- it's essentially the same
logic, though unlike the precheck it works against the authoritative
shared memory state.

My current plan is to commit something like this in the next day or two.

--
Peter Geoghegan

Attachment	Content-Type	Size
v2-0001-Fix-stuck-parallel-btree-scans.patch	application/x-patch	6.1 KB

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>, Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-16 22:05:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I've been looking at this patch over the couple last days, mostly doing
some stress testing / benchmarking (hence the earlier report) and basic
review. I do have some initial review comments, and the testing produced
some interesting regressions (not sure if those are the cases where
skipscan can't really help, that Peter mentioned he needs to look into).

review
------

First, the review comments - nothing particularly serious, mostly just
cosmetic stuff:

1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

- I find the places that increment "nsearches" a bit random. Each AM
does it in entirely different place (at least it seems like that to me).
Is there a way make this a bit more consistent?

- I find this comment rather unhelpful:

uint64 btps_nsearches; /* instrumentation */

Instrumentation what? What's the counter for?

- I see _bt_first moved the pgstat_count_index_scan, but doesn't that
mean we skip it if the earlier code does "goto readcomplete"? Shouldn't
that still count as an index scan?

- show_indexscan_nsearches does this:

if (scanDesc && scanDesc->nsearches > 0)
ExplainPropertyUInteger("Index Searches", NULL,
scanDesc->nsearches, es);

But shouldn't it divide the count by nloops, similar to (for example)
show_instrumentation_count?

2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch

- Admittedly very subjective, but I find the "oppoDirCheck" abbreviation
rather weird, I'd just call it "oppositeDirCheck".

3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch

- nothing

4) v6-0004-Add-skip-scan-to-nbtree.patch

- indices.sgml seems to hahve typo "Intevening" -> "Intervening"

- It doesn't seem like a good idea to remove the paragraph about
multicolumn indexes and replace it with just:

Multicolumn indexes should be used judiciously.

I mean, what does judiciously even mean? what should the user consider
to be judicious? Seems rather unclear to me. Admittedly, the old text
was not much helpful, but at least it gave some advice.

But maybe more importantly, doesn't skipscan apply only to a rather
limited subset of data types (that support increment/decrement)? Doesn't
the new wording mostly ignore that, implying skipscan applies to all
btree indexes? I don't think it mentions datatypes anywhere, but there
are many indexes on data types like text, UUID and so on.

- Very subjective nitpicking, but I find it a bit strange when a comment
about a block is nested in the block, like in _bt_first() for the
array->null_elem check.

- assignProcTypes() claims providing skipscan for cross-type scenarios
doesn't make sense. Why is that? I'm not saying the claim is wrong, but
it's not clear to me why would that be the case.

costing
-------

Peter asked me to look at the costing, and I think it looks generally
sensible. We don't really have a lot of information to base the costing
on in the first place - the whole point of skipscan is about multicolumn
indexes, but none of the existing extended statistic seems very useful.
We'd need some cross-column correlation info, or something like that.

It's an interesting question - if we could collect some new statistics
for multicolumn indexes (say, by having a way to collect AM-specific
stats), what would we collect for skipscan?

There's one thing that I don't quite understand, and that's how
btcost_correlation() adjusts correlation for multicolumn indexes:

if (index->nkeycolumns > 1)
indexCorrelation = varCorrelation * 0.75;

That seems fine for a two-column index, I guess. But shouldn't it
compound for indexes with more keys? I mean, 0.75 * 0.75 for third
column, etc? I don't think btcostestimate() does that, it just remembers
whatever btcost_correlation() returns.

Anyway, the overall costing approach seems sensible, I think. It assumes
things we assume in general (columns/keys are considered independent),
which may be problematic, but this is the best we can do.

The only alternative approach I can think of is not to adjust the
costing for the index scan at all, and only use this to enable (or not
enable) the skipscan internally. That would mean the overall plan
remains the same, and maybe sometimes we would think an index scan would
be too expensive and use something else. Not great, but it doesn't have
the risk of regressions - IIUC we can disable the skipscan at runtime,
if we realize it's not really helpful.

If we're concerned about regressions, I think this would be the way to
deal with them. Or at least it's the best idea I have.

testing
-------

As usual, I wrote a bash script to do a bit of stress testing. It
generates tables with random data, and then runs random queries with
random predicates on them, while mutating a couple parameters (like
number of workers) to trigger different plans. It does that on 16,
master and with the skipscan patch (with the fix for parallel scans).

I've uploaded the script and results from the last run here:

https://github.com/tvondra/pg-skip-scan-tests

There's the "run-mdam.sh" script that generates tables/queries, runs
them, collects all kinds of info about the query, and produces files
with explain plans, CSV with timings, etc.

Not all of the queries end up using index scans - depending on the
predicates, etc. it might have to use seqscan. Or maybe it only uses
index scan because it's forced to by the enable_* options, etc.

Anyway, I ran a couple thousand such queries, and I haven't found any
incorrect results (the script compares that between versions too). So
that's good ;-)

But my main goal was to see how this affects performance. The tables
were pretty small (just 1M rows, maybe ~80MB), but with restarts and
dropping caches, large enough to test this.

And generally the results seem good. You can either inspect the CSV with
raw results (look at the script to undestand what the fields are), or
check the attached PDF with a pivot table summarizing them.

As usual, there's a heatmap on the right side, comparing the results for
different versions (first "master/16" and then "skipscan/master"). Green
means "speedup/good" and red meand "regression/bad".

Most of the places are "white" (no change) or not very far from it, or
perhaps "green". But there's also a bunch of red results, which means
regression (FWIW the PDF is filtered only to queries that would actually
use the executed plans without the GUCs).

Some of the red placees are for very short queries - just a couple ms,
which means it can easily be random noise, or something like that. But a
couple queries are much longer, and might deserve some investigation.
The easiest way is to look at the "QID" column in the row, which
identifies the query in the "query" CSV. Then look into the results CSV
for IDs of the runs (in the first "SEQ" column), and find the details in
the "analyze" log, which has all the plans etc.

Alternatively, use the .ods in the git repository, which allows drill
down to results (from the pivot tables).

For example, one of the slowed down queries is query 702 (top of page 8
in the PDF). The query is pretty simple:

explain (analyze, timing off, buffers off)
select id1,id2 from t_1000000_1000_1_2
where NOT (id1 in (:list)) AND (id2 = :value);

and it was executed on a table with random data in two columns, each
with 1000 distinct values. This is perfectly random data, so a great
match for the assumptions in costing etc.

But with uncached data, this runs in ~50 ms on master, but takes almost
200 ms with skipscan (these timings are from my laptop, but similar to
the results).

-- master
Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
t_1000000_1000_1_2 (cost=0.96..20003.96 rows=1719 width=16)
(actual rows=811 loops=1)
Index Cond: (id2 = 997)
Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
Rows Removed by Filter: 163
Heap Fetches: 0
Planning Time: 7.596 ms
Execution Time: 28.851 ms
(7 rows)

-- with skipscan
Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16)
(actual rows=811 loops=1)
Index Cond: (id2 = 997)
Index Searches: 1007
Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
Rows Removed by Filter: 163
Heap Fetches: 0
Planning Time: 3.730 ms
Execution Time: 238.554 ms
(8 rows)

I haven't looked into why this is happening, but this seems like a
pretty good match for skipscan (on the first column). And for the
costing too - it's perfectly random data, no correllation, etc.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
optimal.pdf	application/pdf	320.1 KB
q702.sql	application/sql	1.5 KB
run-mdam.sh	application/x-shellscript	23.4 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-17 15:10:58
Message-ID:	CAH2-WznMjjWrJqB9XrE-CupWRo4+LnnKHPgzUei5cCy6LnN2GA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 16, 2024 at 3:13 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I agree with your approach, but I'm concerned about it causing
> confusion inside _bt_parallel_done. And so I attach a v2 revision of
> your bug fix. v2 adds a check that nails that down, too.

Pushed this just now.

Thanks
--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-17 15:49:11
Message-ID:	CAH2-WznWiiuswhaUZg6viKwX7mqQNe_2bttzP4wR11WrKo8d9Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 4, 2024 at 12:52 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v6, which finally does something sensible in btcostestimate.

Attached is v7, which is just to fix bitrot, and keep CFBot happy. No
real changes here.

I will work through Tomas' recent feedback in the next few days.

--
Peter Geoghegan

Attachment	Content-Type	Size
v7-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	45.5 KB
v7-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch	application/octet-stream	12.9 KB
v7-0003-Refactor-handling-of-nbtree-array-redundancies.patch	application/octet-stream	18.7 KB
v7-0004-Add-skip-scan-to-nbtree.patch	application/octet-stream	154.5 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-17 22:14:46
Message-ID:	CAH2-Wz=E7XrkvscBN0U6V81NK3Q-dQOmivvbEsjG-zwEfDdFpg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> I've been looking at this patch over the couple last days, mostly doing
> some stress testing / benchmarking (hence the earlier report) and basic
> review.

Thanks for taking a look! Very helpful.

> I do have some initial review comments, and the testing produced
> some interesting regressions (not sure if those are the cases where
> skipscan can't really help, that Peter mentioned he needs to look into).

The one type of query that's clearly regressed in a way that's just
not acceptable are queries where we waste CPU cycles during scans
where it's truly hopeless. For example, I see a big regression on one
of the best cases for the Postgres 17 work, described here:

https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scans#a-practical-example-3x-performance-improvement

Notably, these cases access exactly the same buffers/pages as before,
so this really isn't a matter of "doing too much skipping". The number
of buffers hit exactly matches what you'll see on Postgres 17. It's
just that we waste too many CPU cycles in code such as
_bt_advance_array_keys, to uselessly maintain skip arrays.

I'm not suggesting that there won't be any gray area with these
regressions -- nothing like this will ever be that simple. But it
seems to me like I should go fix these obviously-not-okay cases next,
and then see where that leaves everything else, regressions-wise. That
seems likely to be the most efficient way of dealing with the
regressions. So I'll start there.

That said, I *would* be surprised if you found a regression in any
query that simply didn't receive any new scan key transformations in
new preprocessing code in places like _bt_decide_skipatts and
_bt_skip_preproc_shrink. I see that many of the queries that you're
using for your stress-tests "aren't really testing skip scan", in this
sense. But I'm hardly about to tell you that you shouldn't spend time
on such queries -- that approach just discovered a bug affecting
Postgres 17 (that was also surprising, but it still happened!). My
point is that it's worth being aware of which test queries actually
use skip arrays in the first place -- it might help you with your
testing. There are essentially no changes to _bt_advance_array_keys
that'll affect traditional SAOP arrays (with the sole exception of
changes made by
v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch, which
affect every kind of array in the same way).

> 1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
>
> - I find the places that increment "nsearches" a bit random. Each AM
> does it in entirely different place (at least it seems like that to me).
> Is there a way make this a bit more consistent?

From a mechanical perspective there is nothing at all random about it:
we do this at precisely the same point that we currently call
pgstat_count_index_scan, which in each index AM maps to one descent of
the index. It is at least consistent. Whenever a B-Tree index scan
shows "Index Scans: N", you'll see precisely the same number by
swapping it with an equivalent contrib/btree_gist-based GiST index and
running the same query again (assuming that index tuples that match
the array keys are spread apart in both the B-Tree and GiST indexes).

(Though I see problems with the precise place that nbtree calls
pgstat_count_index_scan right now, at least in certain edge-cases,
which I discuss below in response to your questions about that.)

> uint64 btps_nsearches; /* instrumentation */
>
> Instrumentation what? What's the counter for?

Will fix.

In case you missed it, there is another thread + CF Entry dedicated to
discussing this instrumentation patch:

https://commitfest.postgresql.org/49/5183/
https://www.postgresql.org/message-id/flat/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q(at)mail(dot)gmail(dot)com

> - I see _bt_first moved the pgstat_count_index_scan, but doesn't that
> mean we skip it if the earlier code does "goto readcomplete"? Shouldn't
> that still count as an index scan?

In my opinion, no, it should not.

We're counting the number of times we'll have descended the tree using
_bt_search (or using _bt_endpoint, perhaps), which is a precisely
defined physical cost. A little like counting the number of buffers
accessed. I actually think that this aspect of how we call
pgstat_count_index_scan is a bug that should be fixed, with the fix
backpatched to Postgres 17. Right now, we see completely different
counts for a parallel index scan, compared to an equivalent serial
index scan -- differences that cannot be explained as minor
differences caused by parallel scan implementation details. I think
that it's just wrong right now, on master, since we're simply not
counting the thing that we're supposed to be counting (not reliably,
not if it's a parallel index scan).

> - show_indexscan_nsearches does this:
>
> if (scanDesc && scanDesc->nsearches > 0)
> ExplainPropertyUInteger("Index Searches", NULL,
> scanDesc->nsearches, es);
>
> But shouldn't it divide the count by nloops, similar to (for example)
> show_instrumentation_count?

I can see arguments for and against doing it that way. It's
ambiguous/subjective, but on balance I favor not dividing by nloops.
You can make a similar argument for doing this with "Buffers: ", and
yet we don't divide by nloops there, either.

Honestly, I just want to find a way to do this that everybody can live
with. Better documentation could help here.

> 2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch
>
> - Admittedly very subjective, but I find the "oppoDirCheck" abbreviation
> rather weird, I'd just call it "oppositeDirCheck".

Will fix.

> 3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch
>
> - nothing

Great. I think that I should be able to commit this one soon, since
it's independently useful work.

> 4) v6-0004-Add-skip-scan-to-nbtree.patch
>
> - indices.sgml seems to hahve typo "Intevening" -> "Intervening"
>
> - It doesn't seem like a good idea to remove the paragraph about
> multicolumn indexes and replace it with just:
>
> Multicolumn indexes should be used judiciously.
>
> I mean, what does judiciously even mean? what should the user consider
> to be judicious? Seems rather unclear to me. Admittedly, the old text
> was not much helpful, but at least it gave some advice.

Yeah, this definitely needs more work.

> But maybe more importantly, doesn't skipscan apply only to a rather
> limited subset of data types (that support increment/decrement)? Doesn't
> the new wording mostly ignore that, implying skipscan applies to all
> btree indexes? I don't think it mentions datatypes anywhere, but there
> are many indexes on data types like text, UUID and so on.

Actually, no, skip scan works in almost the same way with all data
types. Earlier versions of the patch didn't support every data type
(perhaps I should have waited for that before posting my v1), but the
version of the patch you looked at has no restrictions on any data
type.

You must be thinking of whether or not an opclass has skip support.
That's just an extra optimization, which can be used for a small
handful of discrete data types such as integer and date (hard to
imagine how skip support could ever be implemented for types like
numeric and text). There is a temporary testing GUC that will allow
you to get a sense of how much skip support can help: try "set
skipscan_skipsupport_enabled=off" with (say) my original MDAM test
query to get a sense of that. You'll see more buffer hits needed for
"next key probes", though not dramatically more.

It's worth having skip support (the idea comes from the MDAM paper),
but it's not essential. Whether or not an opclass has skip support
isn't accounted for by the cost model, but I doubt that it's worth
addressing (the cost model is already pessimistic).

> - Very subjective nitpicking, but I find it a bit strange when a comment
> about a block is nested in the block, like in _bt_first() for the
> array->null_elem check.

Will fix.

> - assignProcTypes() claims providing skipscan for cross-type scenarios
> doesn't make sense. Why is that? I'm not saying the claim is wrong, but
> it's not clear to me why would that be the case.

It is just talking about the support function that skip scan can
optionally use, where it makes sense (skip support functions). The
relevant "else if (member->number == BTSKIPSUPPORT_PROC)" stanza is
largely copied from the existing nearby "else if (member->number ==
BTEQUALIMAGE_PROC)" stanza that was added for B-Tree deduplication. In
both stanzas we're talking about a capability that maps to a
particular "input opclass", which means the opclass that maps to the
datums that are stored on disk, in index tuples.

There are no restrictions on the use of skip scan with queries that
happen to involve the use of cross-type operators. It doesn't even
matter if we happen to be using an incomplete opfamily, since range
skip arrays never need to *directly* take the current array element
from a lower/upper bound inequality scan key's argument. It all
happens indirectly: code in places like _bt_first and _bt_checkkeys
can use inequalities (which are stored in BTArrayKeyInfo.low_compare
and BTArrayKeyInfo.high_compare) to locate the next matching on-disk
index tuple that satisfies the inequality in question. Obviously, the
located datum must be the same type as the one used by the array and
its scan key (it has to be the input opclass type if it's taken from
an index tuple).

I think that it's a bit silly that nbtree generally bends over
backwards to find a way to execute a scan, given an incomplete
opfamily; in a green field situation it would make sense to just throw
an error instead. Even still, skip scan works in a way that is
maximally forgiving when incomplete opfamilies are used. Admittedly,
it is just about possible to come up with a scenario where we'll now
throw an error for a query that would have worked on Postgres 17. But
that's no different to what would happen if the query had an explicit
"= any( )" non-cross-type array instead of an implicit non-cross-type
skip array. The real problem in these scenarios is the lack of a
suitable cross-type ORDER proc (for a cross-type-operator query)
within _bt_first -- not the lack of cross-type operators. This issue
with missing ORDER procs just doesn't seem worth worrying about,
since, as I said, even slightly different queries (that don't use skip
scan) are bound to throw the same errors either way.

> Peter asked me to look at the costing, and I think it looks generally
> sensible.

I'm glad that you think that I basically have the right idea here.
Hard to know how to approach something like this, which doesn't have
any kind of precedent to draw on.

> We don't really have a lot of information to base the costing
> on in the first place - the whole point of skipscan is about multicolumn
> indexes, but none of the existing extended statistic seems very useful.
> We'd need some cross-column correlation info, or something like that.

Maybe, but that would just mean that we'd sometimes be more optimistic
about skip scan helping than we are with the current approach of
pessimistically assuming that there is no correlation at all. Not
clear that being pessimistic in this sense isn't the right thing to
do, despite the fact that it's clearly less accurate on average.

> There's one thing that I don't quite understand, and that's how
> btcost_correlation() adjusts correlation for multicolumn indexes:
>
> if (index->nkeycolumns > 1)
> indexCorrelation = varCorrelation * 0.75;
>
> That seems fine for a two-column index, I guess. But shouldn't it
> compound for indexes with more keys? I mean, 0.75 * 0.75 for third
> column, etc? I don't think btcostestimate() does that, it just remembers
> whatever btcost_correlation() returns.

I don't know either. In general I'm out of my comfort zone here.

> The only alternative approach I can think of is not to adjust the
> costing for the index scan at all, and only use this to enable (or not
> enable) the skipscan internally. That would mean the overall plan
> remains the same, and maybe sometimes we would think an index scan would
> be too expensive and use something else. Not great, but it doesn't have
> the risk of regressions - IIUC we can disable the skipscan at runtime,
> if we realize it's not really helpful.

In general I would greatly prefer to not have a distinct kind of index
path for scans that use skip scan. I'm quite keen on a design that
allows the scan to adapt to unpredictable conditions at runtime.

Of course, that doesn't preclude passing the index scan a hint about
what's likely to work at runtime, based on information figured out
when costing the scan. Perhaps that will prove necessary to avoid
regressing index scans that are naturally quite cheap already -- scans
where we really need to have the right general idea from the start to
avoid any regressions. I'm not opposed to that, provided the index
scan has the ability to change its mind when (for whatever reason) the
guidance from the optimizer turns out to be wrong.

> As usual, I wrote a bash script to do a bit of stress testing. It
> generates tables with random data, and then runs random queries with
> random predicates on them, while mutating a couple parameters (like
> number of workers) to trigger different plans. It does that on 16,
> master and with the skipscan patch (with the fix for parallel scans).

I wonder if some of the regressions you see can be tied to the use of
an LWLock in place of the existing use of a spin lock. I did that
because I sometimes need to allocate memory to deserialize the array
keys, with the exclusive lock held. It might be the case that a lot of
these regressions are tied to that, or something else that is far from
obvious...have to investigate.

In general, I haven't done much on parallel index scans here (I only
added support for them very recently), whereas your testing places a
lot of emphasis on parallel scans. Nothing wrong with that emphasis
(it caught that 17 bug), but just want to put it in context.

> I've uploaded the script and results from the last run here:
>
> https://github.com/tvondra/pg-skip-scan-tests
>
> There's the "run-mdam.sh" script that generates tables/queries, runs
> them, collects all kinds of info about the query, and produces files
> with explain plans, CSV with timings, etc.

It'll take me a while to investigate all this data.

> Anyway, I ran a couple thousand such queries, and I haven't found any
> incorrect results (the script compares that between versions too). So
> that's good ;-)

That's good!

> But my main goal was to see how this affects performance. The tables
> were pretty small (just 1M rows, maybe ~80MB), but with restarts and
> dropping caches, large enough to test this.

The really compelling cases all tend to involve fairly selective index
scans. Obviously, skip scan can only save work by navigating the index
structure more efficiently (unlike loose index scan). So if the main
cost is inherently bound to be the cost of heap accesses, then we
shouldn't expect a big speed up.

> For example, one of the slowed down queries is query 702 (top of page 8
> in the PDF). The query is pretty simple:
>
> explain (analyze, timing off, buffers off)
> select id1,id2 from t_1000000_1000_1_2
> where NOT (id1 in (:list)) AND (id2 = :value);
>
> and it was executed on a table with random data in two columns, each
> with 1000 distinct values. This is perfectly random data, so a great
> match for the assumptions in costing etc.
>
> But with uncached data, this runs in ~50 ms on master, but takes almost
> 200 ms with skipscan (these timings are from my laptop, but similar to
> the results).

I'll need to investigate this specifically. That does seem odd.

FWIW, it's a pity that the patch doesn't know how to push down the NOT
IN () here. The MDAM paper contemplates such a scheme. We see the use
of filter quals here, when in principle this could work by using a
skip array that doesn't generate elements that appear in the NOT IN()
list (it'd generate every possible indexable value *except* the given
list/array values). The only reason that I haven't implemented this
yet is because I'm not at all sure how to make it work on the
optimizer side. The nbtree side of the implementation will probably be
quite straightforward, since it's really just a slight variant of a
skip array, that excludes certain values.

> -- with skipscan
> Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
> t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16)
> (actual rows=811 loops=1)
> Index Cond: (id2 = 997)
> Index Searches: 1007
> Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
> Rows Removed by Filter: 163
> Heap Fetches: 0
> Planning Time: 3.730 ms
> Execution Time: 238.554 ms
> (8 rows)
>
> I haven't looked into why this is happening, but this seems like a
> pretty good match for skipscan (on the first column). And for the
> costing too - it's perfectly random data, no correllation, etc.

I wonder what "Buffers: N" shows? That's usually the first thing I
look at (that and "Index Searches", which looks like what you said it
should look like here). But, yeah, let me get back to you on this.

Thanks again!
--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-18 11:36:33
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/18/24 00:14, Peter Geoghegan wrote:
> On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> I've been looking at this patch over the couple last days, mostly doing
>> some stress testing / benchmarking (hence the earlier report) and basic
>> review.
>
> Thanks for taking a look! Very helpful.
>
>> I do have some initial review comments, and the testing produced
>> some interesting regressions (not sure if those are the cases where
>> skipscan can't really help, that Peter mentioned he needs to look into).
>
> The one type of query that's clearly regressed in a way that's just
> not acceptable are queries where we waste CPU cycles during scans
> where it's truly hopeless. For example, I see a big regression on one
> of the best cases for the Postgres 17 work, described here:
>
> https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scans#a-practical-example-3x-performance-improvement
>
> Notably, these cases access exactly the same buffers/pages as before,
> so this really isn't a matter of "doing too much skipping". The number
> of buffers hit exactly matches what you'll see on Postgres 17. It's
> just that we waste too many CPU cycles in code such as
> _bt_advance_array_keys, to uselessly maintain skip arrays.
>
> I'm not suggesting that there won't be any gray area with these
> regressions -- nothing like this will ever be that simple. But it
> seems to me like I should go fix these obviously-not-okay cases next,
> and then see where that leaves everything else, regressions-wise. That
> seems likely to be the most efficient way of dealing with the
> regressions. So I'll start there.
>
> That said, I *would* be surprised if you found a regression in any
> query that simply didn't receive any new scan key transformations in
> new preprocessing code in places like _bt_decide_skipatts and
> _bt_skip_preproc_shrink. I see that many of the queries that you're
> using for your stress-tests "aren't really testing skip scan", in this
> sense. But I'm hardly about to tell you that you shouldn't spend time
> on such queries -- that approach just discovered a bug affecting
> Postgres 17 (that was also surprising, but it still happened!). My
> point is that it's worth being aware of which test queries actually
> use skip arrays in the first place -- it might help you with your
> testing. There are essentially no changes to _bt_advance_array_keys
> that'll affect traditional SAOP arrays (with the sole exception of
> changes made by
> v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch, which
> affect every kind of array in the same way).
>

Makes sense. I started with the testing before before even looking at
the code, so it's mostly a "black box" approach. I did read the 1995
paper before that, and the script generates queries with clauses
inspired by that paper, in particular:

- col = $value
- col IN ($values)
- col BETWEEN $value AND $value
- NOT (clause)
- clause [AND|OR] clause

There certainly may be gaps and interesting cases the script does not
cover. Something to improve.

>> 1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
>>
>> - I find the places that increment "nsearches" a bit random. Each AM
>> does it in entirely different place (at least it seems like that to me).
>> Is there a way make this a bit more consistent?
>
> From a mechanical perspective there is nothing at all random about it:
> we do this at precisely the same point that we currently call
> pgstat_count_index_scan, which in each index AM maps to one descent of
> the index. It is at least consistent. Whenever a B-Tree index scan
> shows "Index Scans: N", you'll see precisely the same number by
> swapping it with an equivalent contrib/btree_gist-based GiST index and
> running the same query again (assuming that index tuples that match
> the array keys are spread apart in both the B-Tree and GiST indexes).
>
> (Though I see problems with the precise place that nbtree calls
> pgstat_count_index_scan right now, at least in certain edge-cases,
> which I discuss below in response to your questions about that.)
>

OK, understood. FWIW I'm not saying these places are "wrong", just that
it feels each AM does that in a very different place.

>> uint64 btps_nsearches; /* instrumentation */
>>
>> Instrumentation what? What's the counter for?
>
> Will fix.
>
> In case you missed it, there is another thread + CF Entry dedicated to
> discussing this instrumentation patch:
>
> https://commitfest.postgresql.org/49/5183/
> https://www.postgresql.org/message-id/flat/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q(at)mail(dot)gmail(dot)com
>

Thanks, I wasn't aware of that.

>> - I see _bt_first moved the pgstat_count_index_scan, but doesn't that
>> mean we skip it if the earlier code does "goto readcomplete"? Shouldn't
>> that still count as an index scan?
>
> In my opinion, no, it should not.
>
> We're counting the number of times we'll have descended the tree using
> _bt_search (or using _bt_endpoint, perhaps), which is a precisely
> defined physical cost. A little like counting the number of buffers
> accessed. I actually think that this aspect of how we call
> pgstat_count_index_scan is a bug that should be fixed, with the fix
> backpatched to Postgres 17. Right now, we see completely different
> counts for a parallel index scan, compared to an equivalent serial
> index scan -- differences that cannot be explained as minor
> differences caused by parallel scan implementation details. I think
> that it's just wrong right now, on master, since we're simply not
> counting the thing that we're supposed to be counting (not reliably,
> not if it's a parallel index scan).
>

OK, understood. If it's essentially an independent issue (perhaps even
counts as a bug?) what about correcting it on master first? Doesn't
sound like something we'd backpatch, I guess.

>> - show_indexscan_nsearches does this:
>>
>> if (scanDesc && scanDesc->nsearches > 0)
>> ExplainPropertyUInteger("Index Searches", NULL,
>> scanDesc->nsearches, es);
>>
>> But shouldn't it divide the count by nloops, similar to (for example)
>> show_instrumentation_count?
>
> I can see arguments for and against doing it that way. It's
> ambiguous/subjective, but on balance I favor not dividing by nloops.
> You can make a similar argument for doing this with "Buffers: ", and
> yet we don't divide by nloops there, either.
>
> Honestly, I just want to find a way to do this that everybody can live
> with. Better documentation could help here.
>

Seems like a bit of a mess. IMHO we should either divide everything by
nloops (so that everything is "per loop", or not divide anything. My
vote would be to divide, but that's mostly my "learned assumption" from
the other fields. But having a 50:50 split is confusing for everyone.

>> 2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch
>>
>> - Admittedly very subjective, but I find the "oppoDirCheck" abbreviation
>> rather weird, I'd just call it "oppositeDirCheck".
>
> Will fix.
>
>> 3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch
>>
>> - nothing
>
> Great. I think that I should be able to commit this one soon, since
> it's independently useful work.
>

>> 4) v6-0004-Add-skip-scan-to-nbtree.patch
>>
>> - indices.sgml seems to hahve typo "Intevening" -> "Intervening"
>>
>> - It doesn't seem like a good idea to remove the paragraph about
>> multicolumn indexes and replace it with just:
>>
>> Multicolumn indexes should be used judiciously.
>>
>> I mean, what does judiciously even mean? what should the user consider
>> to be judicious? Seems rather unclear to me. Admittedly, the old text
>> was not much helpful, but at least it gave some advice.
>
> Yeah, this definitely needs more work.
>
>> But maybe more importantly, doesn't skipscan apply only to a rather
>> limited subset of data types (that support increment/decrement)? Doesn't
>> the new wording mostly ignore that, implying skipscan applies to all
>> btree indexes? I don't think it mentions datatypes anywhere, but there
>> are many indexes on data types like text, UUID and so on.
>
> Actually, no, skip scan works in almost the same way with all data
> types. Earlier versions of the patch didn't support every data type
> (perhaps I should have waited for that before posting my v1), but the
> version of the patch you looked at has no restrictions on any data
> type.
>
> You must be thinking of whether or not an opclass has skip support.
> That's just an extra optimization, which can be used for a small
> handful of discrete data types such as integer and date (hard to
> imagine how skip support could ever be implemented for types like
> numeric and text). There is a temporary testing GUC that will allow
> you to get a sense of how much skip support can help: try "set
> skipscan_skipsupport_enabled=off" with (say) my original MDAM test
> query to get a sense of that. You'll see more buffer hits needed for
> "next key probes", though not dramatically more.
>
> It's worth having skip support (the idea comes from the MDAM paper),
> but it's not essential. Whether or not an opclass has skip support
> isn't accounted for by the cost model, but I doubt that it's worth
> addressing (the cost model is already pessimistic).
>

I admit I'm a bit confused. I probably need to reread the paper, but my
impression was that the increment/decrement is required for skipscan to
work. If we can't do that, how would it generate the intermediate values
to search for? I imagine it would be possible to "step through" the
index, but I thought the point of skip scan is to not do that.

Anyway, probably a good idea for extending the stress testing script.
Right now it tests with "bigint" columns only.

>> - Very subjective nitpicking, but I find it a bit strange when a comment
>> about a block is nested in the block, like in _bt_first() for the
>> array->null_elem check.
>
> Will fix.
>
>> - assignProcTypes() claims providing skipscan for cross-type scenarios
>> doesn't make sense. Why is that? I'm not saying the claim is wrong, but
>> it's not clear to me why would that be the case.
>
> It is just talking about the support function that skip scan can
> optionally use, where it makes sense (skip support functions). The
> relevant "else if (member->number == BTSKIPSUPPORT_PROC)" stanza is
> largely copied from the existing nearby "else if (member->number ==
> BTEQUALIMAGE_PROC)" stanza that was added for B-Tree deduplication. In
> both stanzas we're talking about a capability that maps to a
> particular "input opclass", which means the opclass that maps to the
> datums that are stored on disk, in index tuples.
>
> There are no restrictions on the use of skip scan with queries that
> happen to involve the use of cross-type operators. It doesn't even
> matter if we happen to be using an incomplete opfamily, since range
> skip arrays never need to *directly* take the current array element
> from a lower/upper bound inequality scan key's argument. It all
> happens indirectly: code in places like _bt_first and _bt_checkkeys
> can use inequalities (which are stored in BTArrayKeyInfo.low_compare
> and BTArrayKeyInfo.high_compare) to locate the next matching on-disk
> index tuple that satisfies the inequality in question. Obviously, the
> located datum must be the same type as the one used by the array and
> its scan key (it has to be the input opclass type if it's taken from
> an index tuple).
>
> I think that it's a bit silly that nbtree generally bends over
> backwards to find a way to execute a scan, given an incomplete
> opfamily; in a green field situation it would make sense to just throw
> an error instead. Even still, skip scan works in a way that is
> maximally forgiving when incomplete opfamilies are used. Admittedly,
> it is just about possible to come up with a scenario where we'll now
> throw an error for a query that would have worked on Postgres 17. But
> that's no different to what would happen if the query had an explicit
> "= any( )" non-cross-type array instead of an implicit non-cross-type
> skip array. The real problem in these scenarios is the lack of a
> suitable cross-type ORDER proc (for a cross-type-operator query)
> within _bt_first -- not the lack of cross-type operators. This issue
> with missing ORDER procs just doesn't seem worth worrying about,
> since, as I said, even slightly different queries (that don't use skip
> scan) are bound to throw the same errors either way.
>

OK. Thanks for the explanation. I'll think about maybe testing such
queries too (with cross-type clauses).

>> Peter asked me to look at the costing, and I think it looks generally
>> sensible.
>
> I'm glad that you think that I basically have the right idea here.
> Hard to know how to approach something like this, which doesn't have
> any kind of precedent to draw on.
>
>> We don't really have a lot of information to base the costing
>> on in the first place - the whole point of skipscan is about multicolumn
>> indexes, but none of the existing extended statistic seems very useful.
>> We'd need some cross-column correlation info, or something like that.
>
> Maybe, but that would just mean that we'd sometimes be more optimistic
> about skip scan helping than we are with the current approach of
> pessimistically assuming that there is no correlation at all. Not
> clear that being pessimistic in this sense isn't the right thing to
> do, despite the fact that it's clearly less accurate on average.
>

Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming
no correlation means pessimistic skipscan costing) in a comment before
btcostestimate, or somewhere close.

>> There's one thing that I don't quite understand, and that's how
>> btcost_correlation() adjusts correlation for multicolumn indexes:
>>
>> if (index->nkeycolumns > 1)
>> indexCorrelation = varCorrelation * 0.75;
>>
>> That seems fine for a two-column index, I guess. But shouldn't it
>> compound for indexes with more keys? I mean, 0.75 * 0.75 for third
>> column, etc? I don't think btcostestimate() does that, it just remembers
>> whatever btcost_correlation() returns.
>
> I don't know either. In general I'm out of my comfort zone here.
>

Don't we do something similar elsewhere? For example, IIRC we do some
adjustments when estimating grouping in estimate_num_groups(), and
incremental sort had to deal with something similar too. Maybe we could
learn something from those places ... (both from the good and bad
experiences).

>> The only alternative approach I can think of is not to adjust the
>> costing for the index scan at all, and only use this to enable (or not
>> enable) the skipscan internally. That would mean the overall plan
>> remains the same, and maybe sometimes we would think an index scan would
>> be too expensive and use something else. Not great, but it doesn't have
>> the risk of regressions - IIUC we can disable the skipscan at runtime,
>> if we realize it's not really helpful.
>
> In general I would greatly prefer to not have a distinct kind of index
> path for scans that use skip scan. I'm quite keen on a design that
> allows the scan to adapt to unpredictable conditions at runtime.
>

Right. I don't think I've been suggesting having a separate path, I 100%
agree it's better to have this as an option for index scan paths.

> Of course, that doesn't preclude passing the index scan a hint about
> what's likely to work at runtime, based on information figured out
> when costing the scan. Perhaps that will prove necessary to avoid
> regressing index scans that are naturally quite cheap already -- scans
> where we really need to have the right general idea from the start to
> avoid any regressions. I'm not opposed to that, provided the index
> scan has the ability to change its mind when (for whatever reason) the
> guidance from the optimizer turns out to be wrong.
>

+1 (assuming it's feasible, given the amount of available information)

>> As usual, I wrote a bash script to do a bit of stress testing. It
>> generates tables with random data, and then runs random queries with
>> random predicates on them, while mutating a couple parameters (like
>> number of workers) to trigger different plans. It does that on 16,
>> master and with the skipscan patch (with the fix for parallel scans).
>
> I wonder if some of the regressions you see can be tied to the use of
> an LWLock in place of the existing use of a spin lock. I did that
> because I sometimes need to allocate memory to deserialize the array
> keys, with the exclusive lock held. It might be the case that a lot of
> these regressions are tied to that, or something else that is far from
> obvious...have to investigate.
>
> In general, I haven't done much on parallel index scans here (I only
> added support for them very recently), whereas your testing places a
> lot of emphasis on parallel scans. Nothing wrong with that emphasis
> (it caught that 17 bug), but just want to put it in context.
>

Sure. With this kind of testing I don't know what I'm looking for, so I
try to cover very wide range of cases. Inevitably, some of the cases
will not test the exact subject of the patch. I think it's fine.

>> I've uploaded the script and results from the last run here:
>>
>> https://github.com/tvondra/pg-skip-scan-tests
>>
>> There's the "run-mdam.sh" script that generates tables/queries, runs
>> them, collects all kinds of info about the query, and produces files
>> with explain plans, CSV with timings, etc.
>
> It'll take me a while to investigate all this data.
>

I think it'd help if I go through the results and try to prepare some
reproducers, to make it easier for you. After all, it's my script and
you'd have to reverse engineer some of it.

>> Anyway, I ran a couple thousand such queries, and I haven't found any
>> incorrect results (the script compares that between versions too). So
>> that's good ;-)
>
> That's good!
>
>> But my main goal was to see how this affects performance. The tables
>> were pretty small (just 1M rows, maybe ~80MB), but with restarts and
>> dropping caches, large enough to test this.
>
> The really compelling cases all tend to involve fairly selective index
> scans. Obviously, skip scan can only save work by navigating the index
> structure more efficiently (unlike loose index scan). So if the main
> cost is inherently bound to be the cost of heap accesses, then we
> shouldn't expect a big speed up.
>
>> For example, one of the slowed down queries is query 702 (top of page 8
>> in the PDF). The query is pretty simple:
>>
>> explain (analyze, timing off, buffers off)
>> select id1,id2 from t_1000000_1000_1_2
>> where NOT (id1 in (:list)) AND (id2 = :value);
>>
>> and it was executed on a table with random data in two columns, each
>> with 1000 distinct values. This is perfectly random data, so a great
>> match for the assumptions in costing etc.
>>
>> But with uncached data, this runs in ~50 ms on master, but takes almost
>> 200 ms with skipscan (these timings are from my laptop, but similar to
>> the results).
>
> I'll need to investigate this specifically. That does seem odd.
>
> FWIW, it's a pity that the patch doesn't know how to push down the NOT
> IN () here. The MDAM paper contemplates such a scheme. We see the use
> of filter quals here, when in principle this could work by using a
> skip array that doesn't generate elements that appear in the NOT IN()
> list (it'd generate every possible indexable value *except* the given
> list/array values). The only reason that I haven't implemented this
> yet is because I'm not at all sure how to make it work on the
> optimizer side. The nbtree side of the implementation will probably be
> quite straightforward, since it's really just a slight variant of a
> skip array, that excludes certain values.
>
>> -- with skipscan
>> Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
>> t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16)
>> (actual rows=811 loops=1)
>> Index Cond: (id2 = 997)
>> Index Searches: 1007
>> Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
>> Rows Removed by Filter: 163
>> Heap Fetches: 0
>> Planning Time: 3.730 ms
>> Execution Time: 238.554 ms
>> (8 rows)
>>
>> I haven't looked into why this is happening, but this seems like a
>> pretty good match for skipscan (on the first column). And for the
>> costing too - it's perfectly random data, no correllation, etc.
>
> I wonder what "Buffers: N" shows? That's usually the first thing I
> look at (that and "Index Searches", which looks like what you said it
> should look like here). But, yeah, let me get back to you on this.
>

Yeah, I forgot to get that from my reproducer. But the logs in the
github repo with results has BUFFERS - for master (SEQ 12621), the plan
looks like this:

Index Only Scan using t_1000000_1000_1_2_id1_id2_idx
on t_1000000_1000_1_2
(cost=0.96..12179.41 rows=785 width=16)
(actual rows=785 loops=1)
Index Cond: (id2 = 997)
Filter: (id1 <> ALL ('{983, ..., 640}'::bigint[]))
Rows Removed by Filter: 181
Heap Fetches: 0
Buffers: shared read=3094
Planning:
Buffers: shared hit=93 read=27
Planning Time: 9.962 ms
Execution Time: 38.007 ms
(10 rows)

and with the patch (SEQ 12623) it's this:

Index Only Scan using t_1000000_1000_1_2_id1_id2_idx
on t_1000000_1000_1_2
(cost=0.96..1745.27 rows=784 width=16)
(actual rows=785 loops=1)
Index Cond: (id2 = 997)
Index Searches: 1002
Filter: (id1 <> ALL ('{983, ..., 640}'::bigint[]))
Rows Removed by Filter: 181
Heap Fetches: 0
Buffers: shared hit=1993 read=1029
Planning:
Buffers: shared hit=93 read=27
Planning Time: 9.506 ms
Execution Time: 179.048 ms
(11 rows)

This is on exactly the same data, after dropping caches and restarting
the instance. So there should be no caching effects. Yet, there's a
pretty clear difference - the total number of buffers is the same, but
the patched version has many more hits. Yet it's slower. Weird, right?

regards

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-18 18:52:35
Message-ID:	CAH2-WznKyHq_W7heu87z80EHyZepQeWbGuAZebcxZHvOXWCU-w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Makes sense. I started with the testing before before even looking at
> the code, so it's mostly a "black box" approach. I did read the 1995
> paper before that, and the script generates queries with clauses
> inspired by that paper, in particular:

I think that this approach with black box testing is helpful, but also
something to refine over time. Gray box testing might work best.

> OK, understood. If it's essentially an independent issue (perhaps even
> counts as a bug?) what about correcting it on master first? Doesn't
> sound like something we'd backpatch, I guess.

What about backpatching it to 17?

As things stand, you can get quite contradictory counts of the number
of index scans due to irrelevant implementation details from parallel
index scan. It just looks wrong, particularly on 17, where it is
reasonable to expect near exact consistency between parallel and
serial scans of the same index.

> Seems like a bit of a mess. IMHO we should either divide everything by
> nloops (so that everything is "per loop", or not divide anything. My
> vote would be to divide, but that's mostly my "learned assumption" from
> the other fields. But having a 50:50 split is confusing for everyone.

My idea was that it made most sense to follow the example of
"Buffers:", since both describe physical costs.

Honestly, I'm more than ready to take whatever the path of least
resistance is. If dividing by nloops is what people want, I have no
objections.

> > It's worth having skip support (the idea comes from the MDAM paper),
> > but it's not essential. Whether or not an opclass has skip support
> > isn't accounted for by the cost model, but I doubt that it's worth
> > addressing (the cost model is already pessimistic).
> >
>
> I admit I'm a bit confused. I probably need to reread the paper, but my
> impression was that the increment/decrement is required for skipscan to
> work. If we can't do that, how would it generate the intermediate values
> to search for? I imagine it would be possible to "step through" the
> index, but I thought the point of skip scan is to not do that.

I think that you're probably still a bit confused because the
terminology in this area is a little confusing. There are two ways of
explaining the situation with types like text and numeric (types that
lack skip support). The two explanations might seem to be
contradictory, but they're really not, if you think about it.

The first way of explaining it, which focuses on how the scan moves
through the index:

For a text index column "a", and an int index column "b", skip scan
will work like this for a query with a qual "WHERE b = 55":

1. Find the first/lowest sorting "a" value in the index. Let's say
that it's "Aardvark".

2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
returning some matches.

3. Find the next value after "Aardvark" in the index using a probe
like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say
that it turns out to be "Abacus".

4. Look for matches "WHERE a = 'Abacus' and b = 55"...

... (repeat these steps until we've exhaustively processed every
existing "a" value in the index)...

The second way of explaining it, which focuses on how the skip arrays
advance. Same query (and really the same behavior) as in the first
explanation:

1. Skip array's initial value is the sentinel -inf, which cannot
possibly match any real index tuple, but can still guide the search.
So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't
include the "b = 55" part, since it is unnecessary, but conceptually
it's a part of what we search for within _bt_first).

2. Find that the index has no "a" values matching -inf (it inevitably
cannot have any matches for -inf), but we do locate the next highest
match. The closest matching value is "Aardvark". The skip array on "a"
therefore advances from -inf to "Aardvark".

3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
returning some matches.

4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b =
55", which will cause us to advance the array on "a" incrementally
inside _bt_advance_array_keys (just like it would if there was a
standard SAOP array on "a" instead). The skip array on "a" therefore
advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use
sentinel values for this text column, which lacks skip support).

5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55",
which cannot possibly find matches, but, again, can reposition the
scan as needed. We can't find an exact match, of course, but we do
locate the next closest match -- which is "Abacus", again. So the skip
array now advances from "Aardvark" +infinitesimal to "Abacus". The
sentinel values are made up values, but that doesn't change anything.
(And, again, we don't include the "b = 55" part here, for the same
reason as before.)

6. Look for matches "WHERE a = 'Abacus' and b = 55"...

...(repeat these steps as many times as required)...

In summary:

Even index columns that lack skip support get to "increment" (or
"decrement") their arrays by using sentinel values that represent -inf
(or +inf for backwards scans), as well as sentinels that represent
concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal
for backwards scans, say). This scheme sounds contradictory, because
in one sense it allows every skip array to be incremented, but in
another sense it makes it okay that we don't have a type-specific way
to increment values for many individual types/opclasses.

Inventing these sentinel values allows _bt_advance_array_keys to reason about
arrays without really having to care about which kinds of arrays are
involved, their order relative to each other, etc. In a certain sense,
we don't really need explicit "next key" probes of the kind that the
MDAM paper contemplates, though we do still require the same index
accesses as a design with explicit accesses.

Does that make sense?

Obviously, if we did add skip support for text, it would be very
unlikely to help performance. Sure, one can imagine incrementing from
"Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but
that isn't very helpful. You're almost certain to end up accessing the
same pages with such a scheme, anyway. What are the chances of an
index with a leading text column actually containing tuples matching
(say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically
zero. Whereas if the column "a" happens to use a discrete type such as
integer or date, then skip support is likely to help: there's a decent
chance that a value generated by incrementing the last value
(and I mean incrementing it for real) will find a real match when
combined with the user-supplied "b" predicate.

It might be possible to add skip support for text, but there wouldn't
be much point.

> Anyway, probably a good idea for extending the stress testing script.
> Right now it tests with "bigint" columns only.

Good idea.

> Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming
> no correlation means pessimistic skipscan costing) in a comment before
> btcostestimate, or somewhere close.

Will do.

> Don't we do something similar elsewhere? For example, IIRC we do some
> adjustments when estimating grouping in estimate_num_groups(), and
> incremental sort had to deal with something similar too. Maybe we could
> learn something from those places ... (both from the good and bad
> experiences).

I'll make a note of that. Gonna focus on regressions for now.

> Right. I don't think I've been suggesting having a separate path, I 100%
> agree it's better to have this as an option for index scan paths.

Cool.

> Sure. With this kind of testing I don't know what I'm looking for, so I
> try to cover very wide range of cases. Inevitably, some of the cases
> will not test the exact subject of the patch. I think it's fine.

I agree. Just wanted to make sure that we were on the same page.

> I think it'd help if I go through the results and try to prepare some
> reproducers, to make it easier for you. After all, it's my script and
> you'd have to reverse engineer some of it.

Yes, that would be helpful.

I'll probably memorialize the problem by writing my own minimal test
case for it. I'm using the same TDD approach for this project as was
used for the related Postgres 17 project.

> This is on exactly the same data, after dropping caches and restarting
> the instance. So there should be no caching effects. Yet, there's a
> pretty clear difference - the total number of buffers is the same, but
> the patched version has many more hits. Yet it's slower. Weird, right?

Yes, it's weird. It seems likely that you've found an unambiguous bug,
not just a "regular" performance regression. The regressions that I
already know about aren't nearly this bad. So it seems like you have
the right general idea about what to expect, and it seems like your
approach to testing the patch is effective.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-19 19:22:46
Message-ID:	CAH2-Wz=iuqL1KbEwsUKmB1ucByTjRboLxqtiG+OtDa9RHud4ZQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> For example, one of the slowed down queries is query 702 (top of page 8
> in the PDF). The query is pretty simple:
>
> explain (analyze, timing off, buffers off)
> select id1,id2 from t_1000000_1000_1_2
> where NOT (id1 in (:list)) AND (id2 = :value);
>
> and it was executed on a table with random data in two columns, each
> with 1000 distinct values.

I cannot recreate this problem using the q702.sql repro you provided.
Feels like I'm missing a step, because I find that skip scan wins
nicely here.

> This is perfectly random data, so a great
> match for the assumptions in costing etc.

FWIW, I wouldn't say that this is a particularly sympathetic case for
skip scan. It's definitely still a win, but less than other cases I
can imagine. This is due to the relatively large number of rows
returned by the scan. Plus 1000 distinct leading values for a skip
array isn't all that low, so we end up scanning over 1/3 of all of the
leaf pages in the index.

BTW, be careful to distinguish between leaf pages and internal pages
when interpreting "Buffers:" output with the patch. Generally
speaking, the patch repeats many internal page accesses, which needs
to be taken into account when compare "Buffers:" counts against
master. It's not uncommon for 3/4 or even 4/5 of all index page hits
to be for internal pages with the patch. Whereas on master the number
of internal page hits is usually tiny. This is one reason why the
additional context provided by "Index Searches:" can be helpful.

> But with uncached data, this runs in ~50 ms on master, but takes almost
> 200 ms with skipscan (these timings are from my laptop, but similar to
> the results).

Even 50ms seems really slow for your test case -- with or without my
patch applied.

Are you sure that this wasn't an assert-enabled build? There's lots of
extra assertions for the code paths used by skip scan for this, which
could explain the apparent regression.

I find that this same query takes only ~2.056 ms with the patch. When
I disabled skip scan locally via "set skipscan_prefix_cols = 0" (which
should give me behavior that's pretty well representative of master),
it takes ~12.039 ms. That's exactly what I'd expect for this query: a
solid improvement, though not the really enormous ones that you'll see
when skip scan is able to avoid reading many of the index pages that
master reads.

--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-20 13:45:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/19/24 21:22, Peter Geoghegan wrote:
> On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> For example, one of the slowed down queries is query 702 (top of page 8
>> in the PDF). The query is pretty simple:
>>
>> explain (analyze, timing off, buffers off)
>> select id1,id2 from t_1000000_1000_1_2
>> where NOT (id1 in (:list)) AND (id2 = :value);
>>
>> and it was executed on a table with random data in two columns, each
>> with 1000 distinct values.
>
> I cannot recreate this problem using the q702.sql repro you provided.
> Feels like I'm missing a step, because I find that skip scan wins
> nicely here.
>

I don't know, I can reproduce it just fine. I just tried with v7.

What I do is this:

1) build master and patched versions:

./configure --enable-depend --prefix=/mnt/data/builds/$(build}/
make -s clean
make -s -j4 install

2) create a new cluster (default config), create DB, generate the data

3) restart cluster, drop caches

4) run the query from the SQL script

I suspect you don't do (3). I didn't mention this explicitly, my message
only said "with uncached data", so maybe that's the problem?

>> This is perfectly random data, so a great
>> match for the assumptions in costing etc.
>
> FWIW, I wouldn't say that this is a particularly sympathetic case for
> skip scan. It's definitely still a win, but less than other cases I
> can imagine. This is due to the relatively large number of rows
> returned by the scan. Plus 1000 distinct leading values for a skip
> array isn't all that low, so we end up scanning over 1/3 of all of the
> leaf pages in the index.
>

I wasn't suggesting it's a sympathetic case for skipscan. My point is
that it perfectly matches the costing assumptions, i.e. columns are
independent etc. But if it's not sympathetic, maybe the cost shouldn't
be 1/5 of cost from master?

> BTW, be careful to distinguish between leaf pages and internal pages
> when interpreting "Buffers:" output with the patch. Generally
> speaking, the patch repeats many internal page accesses, which needs
> to be taken into account when compare "Buffers:" counts against
> master. It's not uncommon for 3/4 or even 4/5 of all index page hits
> to be for internal pages with the patch. Whereas on master the number
> of internal page hits is usually tiny. This is one reason why the
> additional context provided by "Index Searches:" can be helpful.
>

Yeah, I recall there's an issue with that.

>> But with uncached data, this runs in ~50 ms on master, but takes almost
>> 200 ms with skipscan (these timings are from my laptop, but similar to
>> the results).
>
> Even 50ms seems really slow for your test case -- with or without my
> patch applied.
>
> Are you sure that this wasn't an assert-enabled build? There's lots of
> extra assertions for the code paths used by skip scan for this, which
> could explain the apparent regression.
>
> I find that this same query takes only ~2.056 ms with the patch. When
> I disabled skip scan locally via "set skipscan_prefix_cols = 0" (which
> should give me behavior that's pretty well representative of master),
> it takes ~12.039 ms. That's exactly what I'd expect for this query: a
> solid improvement, though not the really enormous ones that you'll see
> when skip scan is able to avoid reading many of the index pages that
> master reads.
>

I'm pretty sure you're doing this on cached data, because 2ms is exactly
the timing I see in that case.

regards

--
Tomas Vondra

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-20 14:07:41
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/18/24 20:52, Peter Geoghegan wrote:
> On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> Makes sense. I started with the testing before before even looking at
>> the code, so it's mostly a "black box" approach. I did read the 1995
>> paper before that, and the script generates queries with clauses
>> inspired by that paper, in particular:
>
> I think that this approach with black box testing is helpful, but also
> something to refine over time. Gray box testing might work best.
>
>> OK, understood. If it's essentially an independent issue (perhaps even
>> counts as a bug?) what about correcting it on master first? Doesn't
>> sound like something we'd backpatch, I guess.
>
> What about backpatching it to 17?
>
> As things stand, you can get quite contradictory counts of the number
> of index scans due to irrelevant implementation details from parallel
> index scan. It just looks wrong, particularly on 17, where it is
> reasonable to expect near exact consistency between parallel and
> serial scans of the same index.
>

Yes, I think backpatching to 17 would be fine. I'd be worried about
maybe disrupting some monitoring in production systems, but for 17 that
shouldn't be a problem yet. So fine with me.

FWIW I wonder how likely is it that someone has some sort of alerting
tied to this counter. I'd bet few people do. It's probably more about a
couple people looking at explain plans, but they'll be confused even if
we change that only starting with 17.

>> Seems like a bit of a mess. IMHO we should either divide everything by
>> nloops (so that everything is "per loop", or not divide anything. My
>> vote would be to divide, but that's mostly my "learned assumption" from
>> the other fields. But having a 50:50 split is confusing for everyone.
>
> My idea was that it made most sense to follow the example of
> "Buffers:", since both describe physical costs.
>
> Honestly, I'm more than ready to take whatever the path of least
> resistance is. If dividing by nloops is what people want, I have no
> objections.
>

I don't have a strong opinion on this. I just know I'd be confused by
half the counters being total and half /loop, but chances are other
people would disagree.

>>> It's worth having skip support (the idea comes from the MDAM paper),
>>> but it's not essential. Whether or not an opclass has skip support
>>> isn't accounted for by the cost model, but I doubt that it's worth
>>> addressing (the cost model is already pessimistic).
>>>
>>
>> I admit I'm a bit confused. I probably need to reread the paper, but my
>> impression was that the increment/decrement is required for skipscan to
>> work. If we can't do that, how would it generate the intermediate values
>> to search for? I imagine it would be possible to "step through" the
>> index, but I thought the point of skip scan is to not do that.
>
> I think that you're probably still a bit confused because the
> terminology in this area is a little confusing. There are two ways of
> explaining the situation with types like text and numeric (types that
> lack skip support). The two explanations might seem to be
> contradictory, but they're really not, if you think about it.
>
> The first way of explaining it, which focuses on how the scan moves
> through the index:
>
> For a text index column "a", and an int index column "b", skip scan
> will work like this for a query with a qual "WHERE b = 55":
>
> 1. Find the first/lowest sorting "a" value in the index. Let's say
> that it's "Aardvark".
>
> 2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
> returning some matches.
>
> 3. Find the next value after "Aardvark" in the index using a probe
> like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say
> that it turns out to be "Abacus".
>
> 4. Look for matches "WHERE a = 'Abacus' and b = 55"...
>
> ... (repeat these steps until we've exhaustively processed every
> existing "a" value in the index)...

Ah, OK. So we do probe the index like this. I was under the impression
we don't do that. But yeah, this makes sense.

>
> The second way of explaining it, which focuses on how the skip arrays
> advance. Same query (and really the same behavior) as in the first
> explanation:
>
> 1. Skip array's initial value is the sentinel -inf, which cannot
> possibly match any real index tuple, but can still guide the search.
> So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't
> include the "b = 55" part, since it is unnecessary, but conceptually
> it's a part of what we search for within _bt_first).
>
> 2. Find that the index has no "a" values matching -inf (it inevitably
> cannot have any matches for -inf), but we do locate the next highest
> match. The closest matching value is "Aardvark". The skip array on "a"
> therefore advances from -inf to "Aardvark".
>
> 3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
> returning some matches.
>
> 4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b =
> 55", which will cause us to advance the array on "a" incrementally
> inside _bt_advance_array_keys (just like it would if there was a
> standard SAOP array on "a" instead). The skip array on "a" therefore
> advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use
> sentinel values for this text column, which lacks skip support).
>
> 5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55",
> which cannot possibly find matches, but, again, can reposition the
> scan as needed. We can't find an exact match, of course, but we do
> locate the next closest match -- which is "Abacus", again. So the skip
> array now advances from "Aardvark" +infinitesimal to "Abacus". The
> sentinel values are made up values, but that doesn't change anything.
> (And, again, we don't include the "b = 55" part here, for the same
> reason as before.)
>
> 6. Look for matches "WHERE a = 'Abacus' and b = 55"...
>
> ...(repeat these steps as many times as required)...
>

Yeah, this makes more sense. Thanks.

> In summary:
>
> Even index columns that lack skip support get to "increment" (or
> "decrement") their arrays by using sentinel values that represent -inf
> (or +inf for backwards scans), as well as sentinels that represent
> concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal
> for backwards scans, say). This scheme sounds contradictory, because
> in one sense it allows every skip array to be incremented, but in
> another sense it makes it okay that we don't have a type-specific way
> to increment values for many individual types/opclasses.
>
> Inventing these sentinel values allows _bt_advance_array_keys to reason about
> arrays without really having to care about which kinds of arrays are
> involved, their order relative to each other, etc. In a certain sense,
> we don't really need explicit "next key" probes of the kind that the
> MDAM paper contemplates, though we do still require the same index
> accesses as a design with explicit accesses.
>
> Does that make sense?
>

Yes, it does. Most of my confusion was caused by my belief that we can't
probe the index for the next value without "incrementing" the current
value, but that was a silly idea.

> Obviously, if we did add skip support for text, it would be very
> unlikely to help performance. Sure, one can imagine incrementing from
> "Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but
> that isn't very helpful. You're almost certain to end up accessing the
> same pages with such a scheme, anyway. What are the chances of an
> index with a leading text column actually containing tuples matching
> (say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically
> zero. Whereas if the column "a" happens to use a discrete type such as
> integer or date, then skip support is likely to help: there's a decent
> chance that a value generated by incrementing the last value
> (and I mean incrementing it for real) will find a real match when
> combined with the user-supplied "b" predicate.
>
> It might be possible to add skip support for text, but there wouldn't
> be much point.
>

Stupid question - so why does it make sense for types like int? There
can also be a lot of values between the current and the next value, so
why would that be very different from "incrementing" a text value?

>
>> I think it'd help if I go through the results and try to prepare some
>> reproducers, to make it easier for you. After all, it's my script and
>> you'd have to reverse engineer some of it.
>
> Yes, that would be helpful.
>
> I'll probably memorialize the problem by writing my own minimal test
> case for it. I'm using the same TDD approach for this project as was
> used for the related Postgres 17 project.
>

Sure. Still, giving you a reproducer should make it easier .

>> This is on exactly the same data, after dropping caches and restarting
>> the instance. So there should be no caching effects. Yet, there's a
>> pretty clear difference - the total number of buffers is the same, but
>> the patched version has many more hits. Yet it's slower. Weird, right?
>
> Yes, it's weird. It seems likely that you've found an unambiguous bug,
> not just a "regular" performance regression. The regressions that I
> already know about aren't nearly this bad. So it seems like you have
> the right general idea about what to expect, and it seems like your
> approach to testing the patch is effective.
>

Yeah, it's funny. It's not the first time I start stress testing a patch
only to stumble over some pre-existing issues ... ;-)

regards

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-20 14:21:29
Message-ID:	CAH2-WznjrqL3yFBwWWUER513AMdO3PSLWPfun_uK7m+xH4Cr3g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 20, 2024 at 9:45 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> 3) restart cluster, drop caches
>
> 4) run the query from the SQL script
>
> I suspect you don't do (3). I didn't mention this explicitly, my message
> only said "with uncached data", so maybe that's the problem?

You're right that I didn't do step 3 here. I'm generally in the habit
of using fully cached data when testing this kind of work.

The only explanation I can think of is that (at least on your
hardware) OS readahead helps the master branch more than skipping
helps the patch. That's surprising, but I guess it's possible here
because skip scan only needs to access about every third page. And
because this particular index was generated by CREATE INDEX, and so
happens to have a strong correlation between key space order and
physical block order. And probably because this is an index-only scan.

> I wasn't suggesting it's a sympathetic case for skipscan. My point is
> that it perfectly matches the costing assumptions, i.e. columns are
> independent etc. But if it's not sympathetic, maybe the cost shouldn't
> be 1/5 of cost from master?

The costing is pretty accurate if we assume cached data, though --
which is what the planner will actually assume. In any case, is that
really the only problem you see here? That the costing might be
inaccurate because it fails to account for some underlying effect,
such as the influence of OS readhead?

Let's assume for a moment that the regression is indeed due to
readahead effects, and that we deem it to be unacceptable. What can be
done about it? I have a really hard time thinking of a fix, since by
most conventional measures skip scan is indeed much faster here.

--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-20 14:42:37
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/20/24 16:21, Peter Geoghegan wrote:
> On Fri, Sep 20, 2024 at 9:45 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> 3) restart cluster, drop caches
>>
>> 4) run the query from the SQL script
>>
>> I suspect you don't do (3). I didn't mention this explicitly, my message
>> only said "with uncached data", so maybe that's the problem?
>
> You're right that I didn't do step 3 here. I'm generally in the habit
> of using fully cached data when testing this kind of work.
>
> The only explanation I can think of is that (at least on your
> hardware) OS readahead helps the master branch more than skipping
> helps the patch. That's surprising, but I guess it's possible here
> because skip scan only needs to access about every third page. And
> because this particular index was generated by CREATE INDEX, and so
> happens to have a strong correlation between key space order and
> physical block order. And probably because this is an index-only scan.
>

Good idea. Yes, it does seem to be due to readahead - if I disable that,
the query takes ~320ms on master and ~280ms with the patch.

>> I wasn't suggesting it's a sympathetic case for skipscan. My point is
>> that it perfectly matches the costing assumptions, i.e. columns are
>> independent etc. But if it's not sympathetic, maybe the cost shouldn't
>> be 1/5 of cost from master?
>
> The costing is pretty accurate if we assume cached data, though --
> which is what the planner will actually assume. In any case, is that
> really the only problem you see here? That the costing might be
> inaccurate because it fails to account for some underlying effect,
> such as the influence of OS readhead?
>
> Let's assume for a moment that the regression is indeed due to
> readahead effects, and that we deem it to be unacceptable. What can be
> done about it? I have a really hard time thinking of a fix, since by
> most conventional measures skip scan is indeed much faster here.
>

It does seem to be due to readahead, and the costing not accounting for
these effects. And I don't think it's unacceptable - I don't think we
consider readahead elsewhere, and it certainly is not something I'd
expect this patch to fix. So I think it's fine.

Ultimately, I think this should be "fixed" by explicitly prefetching
pages. My index prefetching patch won't really help, because AFAIK this
is about index pages. And I don't know how feasible it is.

regards

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-20 14:59:41
Message-ID:	CAH2-Wzm3FQDMGa61-8DRjM13O=TKsWbFDM2cOdBSK=1E=DcmhA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 20, 2024 at 10:07 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Yes, I think backpatching to 17 would be fine. I'd be worried about
> maybe disrupting some monitoring in production systems, but for 17 that
> shouldn't be a problem yet. So fine with me.

I'll commit minimal changes to _bt_first that at least make the
counters consistent, then. I'll do so soon.

> FWIW I wonder how likely is it that someone has some sort of alerting
> tied to this counter. I'd bet few people do. It's probably more about a
> couple people looking at explain plans, but they'll be confused even if
> we change that only starting with 17.

On 17 the behavior in this area is totally different, either way.

> Ah, OK. So we do probe the index like this. I was under the impression
> we don't do that. But yeah, this makes sense.

Well, we don't have *explicit* next-key probes. If you think of values
like "Aardvark" + infinitesimal as just another array value (albeit
one that requires a little special handling in _bt_first), then there
are no explicit probes. There are no true special cases required.

Maybe this sounds like a very academic point. I don't think that it
is, though. Bear in mind that even when _bt_first searches for index
tuples matching a value like "Aardvark" + infinitesimal, there's some
chance that _bt_search will return a leaf page with tuples that the
index scan ultimately returns. And so there really is no "separate
explicit probe" of the kind the MDAM paper contemplates.

When this happens, we won't get any exact matches for the sentinel
search value, but there could still be matches for (say) "WHERE a =
'Abacus' AND b = 55" on that same leaf page. In general, repositioning
the scan to later "within" the 'Abacus' index tuples might not be
required -- our initial position (based on the sentinel search key)
could be "close enough". This outcome is more likely to happen if the
query happened to be written "WHERE b = 1", rather than "WHERE b =
55".

> Yes, it does. Most of my confusion was caused by my belief that we can't
> probe the index for the next value without "incrementing" the current
> value, but that was a silly idea.

It's not a silly idea, I think. Technically that understanding is
fairly accurate -- we often *do* have to "increment" to get to the
next value (though reading the next value from an index tuple and then
repositioning using it with later/less significant scan keys is the
other possibility).

Incrementing is always possible, even without skip support, because we
can always fall back on +infinitesimal style sentinel values (AKA
SK_BT_NEXTPRIOR values). That's the definitional sleight-of-hand that
allows _bt_advance_array_keys to not have to think about skip arrays
as a special case, regardless of whether or not they happen to have
skip support.

> > It might be possible to add skip support for text, but there wouldn't
> > be much point.
> >
>
> Stupid question - so why does it make sense for types like int? There
> can also be a lot of values between the current and the next value, so
> why would that be very different from "incrementing" a text value?

Not a stupid question at all. You're right; it'd be the same.

Obviously, there are at least some workloads (probably most) where any
int columns will contain values that are more or less fully
contiguous. I also expect there to be some workloads where int columns
appear in B-Tree indexes that contain values with large gaps between
neighboring values (e.g., because the integers are hash values). We'll
always use skip support for any omitted prefix int column (same with
any opclass that offers skip support), but we can only expect to see a
benefit in the former "dense" cases -- never in the latter "sparse"
cases.

The MDAM paper talks about an adaptive strategy for dense columns and
sparse columns. I don't see any point in that, and assume that it's
down to some kind of implementation deficiencies in NonStop SQL back
in the 1990s. I can just always use skip support in the hope that
integer column data will turn out to be "sparse" because there's no
downside to being optimistic about it. The access patterns are exactly
the same as they'd be with skip support disabled.

My "academic point" about not having *explicit* next-key probes might
make more sense now. This is the thing that makes it okay to always be
optimistic about types with skip support containing "dense" data.

FWIW I actually have skip support for the UUID opclass. I implemented
it to have test coverage for pass-by-reference types in certain code
paths, but it's otherwise I don't expect it to be useful -- in
practice all UUID columns contain "sparse" data. There's still no real
downside to it, though. (I wouldn't try to do it with text because
it'd be much harder to implement skip support correctly, especially
with collated text.)

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-20 18:06:41
Message-ID:	CAH2-WzkO1o37W9=YZzx2x3SQ0=xVSJ_HW3vuo0Zzoiz+DV6wUw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 20, 2024 at 10:59 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Fri, Sep 20, 2024 at 10:07 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > Yes, I think backpatching to 17 would be fine. I'd be worried about
> > maybe disrupting some monitoring in production systems, but for 17 that
> > shouldn't be a problem yet. So fine with me.
>
> I'll commit minimal changes to _bt_first that at least make the
> counters consistent, then. I'll do so soon.

Pushed, thanks

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-21 17:44:53
Message-ID:	CAH2-Wz=L=zHTxJLAowj37r0GWcagEGoqV2DZ_q=BirA-oq7iiQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >> 3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch
> >>
> >> - nothing
> >
> > Great. I think that I should be able to commit this one soon, since
> > it's independently useful work.
> >
>
> +1

I pushed this just now. There was one small change: I decided that it
made more sense to repalloc() in the case where skip scan must enlarge
the so.keyData[] space, rather than doing the so.keyData[] allocation
lazily in all cases. I was concerned that my original approach might
regress nested loop joins with very fast inner index scans.

Attached is v8. I still haven't worked through any of your feedback,
Tomas. Again, this revision is just to keep CFBot happy by fixing the
bit rot on the master branch created by my recent commits.

--
Peter Geoghegan

Attachment	Content-Type	Size
v8-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	45.1 KB
v8-0003-Add-skip-scan-to-nbtree.patch	application/octet-stream	154.7 KB
v8-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch	application/octet-stream	12.9 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-09-25 19:08:21
Message-ID:	CAH2-Wznvjd6EE6JRiMraY1r=UbnkYi96Cy6T7Uy8AEhLy6VnXg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Sep 21, 2024 at 1:44 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v8. I still haven't worked through any of your feedback,
> Tomas. Again, this revision is just to keep CFBot happy by fixing the
> bit rot on the master branch created by my recent commits.

Attached is v9.

I think that v9-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch
is close to committable. It's basically independent work, which would
be nice to get out of the way soon.

Highlights for v9:

* Fixed a bug affecting scans that use scrollable cursors: v9 splits
the previous SK_BT_NEXTPRIOR sentinel scan key flag into separate
SK_BT_NEXT and SK_BT_PRIOR flags, so it's no longer possible to
confuse "foo"+infinitesimal with "foo"-infinitesimal when the scan's
direction changes at exactly the wrong time.

* Worked through all of Tomas' feedback.

In more detail:

- v9-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch has been
taught to divide the total number of index searches by nloop as
required (e.g., for nested loop joins), per Tomas. This doesn't make
much difference, either way, so if that's what people want I'm happy
to oblige.

- Separately, the same EXPLAIN ANALYZE patch now shows "Index
Searches: 0" in cases where the scan node is never executed. (This was
already possible in cases where the scan node was executed, only for
_bt_preprocess_keys to determine that the scan's qual was
contradictory.)

- Various small updates to comments/symbol names, again based on
feedback from Tomas.

- Various small updates to the user sgml docs, again based on feedback
from Tomas.

- I've polished the commit messages for all 3 patches, particularly
the big one (v9-0003-Add-skip-scan-to-nbtree.patch).

- I haven't done anything about fixing any of the known regressions in
v9. I'm not aware that Tomas expects me to fix any regressions
highlighted by his recent testing (only regressions that I've been
aware of before Tomas became involved). Tomas should correct me if I
have that wrong, though.

Obviously, the #1 open item right now remains fixing the known
regressions in cases where skip scan should be attempted, but cannot
possibly help.

Thanks
--
Peter Geoghegan

Attachment	Content-Type	Size
v9-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	51.9 KB
v9-0003-Add-skip-scan-to-nbtree.patch	application/octet-stream	158.7 KB
v9-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch	application/octet-stream	13.2 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-10-16 17:14:29
Message-ID:	CAH2-Wzn5xN_7XYFhdphxsZF7B1viKnam=dcPuW-ixA_HEJEe_w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 25, 2024 at 3:08 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v9.
>
> I think that v9-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch
> is close to committable. It's basically independent work, which would
> be nice to get out of the way soon.

Committed. Thanks for the review, Tomas.

Attached is v10, which is another revision that's intended only to fix
bit rot against the master branch. There are no notable changes to
report.

--
Peter Geoghegan

Attachment	Content-Type	Size
v10-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	51.9 KB
v10-0002-Add-skip-scan-to-nbtree.patch	application/octet-stream	162.6 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-10-18 16:17:08
Message-ID:	CAH2-Wzn5a3p7ddGkmGqKTU7ocwEmMFQb_iB662j7ycRA2OhfMw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 16, 2024 at 1:14 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v10, which is another revision that's intended only to fix
> bit rot against the master branch. There are no notable changes to
> report.

Attached is v11, which is yet another revision whose sole purpose is
to fix bit rot/make the patch apply cleanly against the master
branch's tip.

This time the bit rot is caused by independent nbtree work, on
backwards scans. Nothing interesting here compared to v10 (or to v9).

--
Peter Geoghegan

Attachment	Content-Type	Size
v11-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/x-patch	52.0 KB
v11-0002-Add-skip-scan-to-nbtree.patch	application/x-patch	162.8 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-10-28 16:38:24
Message-ID:	CAH2-Wzke8ptWUsFYksiisa8sT44cZ_U5ZpuA-39abNDREm5Z0g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 18, 2024 at 12:17 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v11, which is yet another revision whose sole purpose is
> to fix bit rot/make the patch apply cleanly against the master
> branch's tip.

Attached is v12, which is yet another revision required only so that
the patch's latest version applies cleanly on top of master. (This
time it was the IWYU bulk header removals that caused the patch to bit
rot.)

--
Peter Geoghegan

Attachment	Content-Type	Size
v12-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/x-patch	52.0 KB
v12-0002-Add-skip-scan-to-nbtree.patch	application/x-patch	162.3 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-04 16:41:33
Message-ID:	CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 28, 2024 at 12:38 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v12, which is yet another revision required only so that
> the patch's latest version applies cleanly on top of master.

Attached is v13, which is the first revision posted that has new
functional changes in quite some time (this isn't just for fixing
bitrot).

The notable change for v13 is the addition of preprocessing that
adjusts skip array attributes > and < operators. These are used for
initial position purposes by _bt_first -- the purpose of this new
preprocessing/transformation is to minimize the number of descents of
the index in certain corner cases involving > and < operators, by
converting them into similar >= and <= operators during preprocessing
where possible. This transformation relies on the opclass skip support
routine, and is strictly optional.

Imagine a qual like "WHERE a > 5" AND b = 3". This will use a range
style skip array on "a", and a regular required scan key on "b". The
new transformation makes the final scan keys output by preprocessing
match what you'd get if the qual had been written "WHERE a >= 6 AND b
= 3". This leads to a modest reduction in the number of descents
required in some cases. It's possible that a naive "a > 5" initial
descent within _bt_first would have wastefully landed on an earlier
leaf page. Whereas now, with v13, our more discriminating "a >= 6 AND
b = 3" initial _bt_first descent will manage to land directly on the
true first page (the inclusion of "b = 3" in the first _bt_first call
makes this possible).

Obviously, the new transformation won't help most individual calls to
_bt_first when a skip array (with a low_compare inequality) is used,
since most individual calls will already have a real skip array
element value (not just the sentinel value -inf or +inf) to work off
of. This is not an essential optimization (it only optimizes the first
_bt_first call for the whole skip scan), but it seems worth having.
Very early versions of the patch had this same optimization, though
that was implemented in a way that wasn't compatible with cross-type
operators. The approach taken in v13 has no such problems, and does
things in a way that's very well targeted -- there's no impact on any
of the other preprocessing steps.

--
Peter Geoghegan

Attachment	Content-Type	Size
v13-0002-Add-skip-scan-to-nbtree.patch	application/x-patch	170.3 KB
v13-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/x-patch	52.0 KB

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-04 21:57:58
Message-ID:	CAEze2Wj05buX8KFT1DyS24N2BdP6o2s989LdWHF66H6dAW8rOg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 18 Oct 2024 at 18:17, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Wed, Oct 16, 2024 at 1:14 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Attached is v10, which is another revision that's intended only to fix
> > bit rot against the master branch. There are no notable changes to
> > report.
>
> Attached is v11, which is yet another revision whose sole purpose is
> to fix bit rot/make the patch apply cleanly against the master
> branch's tip.

This is a review on v11, not the latest v13. I suspect most comments
still apply, but I haven't verified this.

Re: design

I'm a bit concerned about the additional operations that are being
added to the scan. Before this patch, the amount of work in the
"horizontal" portion of the scan was limited to user-supplied
scankeys, so O(1) even when the index condition is only (f < 7). But,
with this patch, we're adding work for (a=, b=, c=, etc.) for every
tuple in the scan.
As these new "skip array" keys are primarily useful for inter-page
coordination (by determining if we want to start a primitive scan to
skip to a different page and which value range that primitive scan
would search for, or continue on to the next sibling), can't we only
apply the "skip array" portion of the code at the final tuple we
access on this page?

> +++ b/doc/src/sgml/indices.sgml
[...]
> + leading prefix column, but this is usually much less efficient than a scan
> + of an index without the extra prefix column. See <xref

I think this last part is a bit more clear about what should go
without the prefix column when formulated as follows:

[...], but this is usually much less efficient than scanning an index
without the extra prefix column.

> - would be almost useless for queries involving only <literal>y</literal>, so it
> - should not be the only index. A combination of the multicolumn index
> - and a separate index on <literal>y</literal> would serve reasonably well. For
> + would be less useful for queries involving only <literal>y</literal>. Just
> + how useful might depend on how effective the B-Tree index skip scan
> + optimization is; if <literal>x</literal> has no more than several hundred

While this section already defines some things about index scans which
seem btree-specific, I don't think we should add more references to
btree scan internals in a section about bitmaps and bitmap index
scans. While presumably btree is the most commonly used index type,
I'm not sure if we should just assume that's the only one that does
efficient non-prefix searches. GIN, for example, is quite efficient
for searches on non-primary columns, and BRIN's performance also
generally doesn't care about which column of the index is searched.

> +++ b/src/backend/access/nbtree/nbtree.c
[...]
> - slock_t btps_mutex; /* protects above variables, btps_arrElems */
> + LWLock btps_lock; /* protects above variables, btps_arrElems */

Why is this changed to LWLock, when it's only ever acquired exclusively?

> +btestimateparallelscan(Relation rel, int nkeys, int norderbys)

I notice you're using DatumSerialize. Are there reasons why we
wouldn't want to use heap_fill_tuple, which generally produces much
more compact output?

Also, I think you can limit the space usage to BLCKSZ in total,
because a full index tuple can't be larger than 1/3rd of a block; and
for skip scans we'll only have known equality bounds for a prefix of
attributes available in the index tuples, and a single (?)
index-produced dynamic attribute we want to skip ahead of. So, IIUC,
at most we'll have 2 index tuples' worth of data, or 2/3 BLCKSZ.
Right?

> +++ b/src/backend/access/nbtree/nbtsearch.c
[...]
> + * Skip scan works by having _bt_preprocess_keys cons up = boundary keys

I needed to look up what this 'cons up' thing is, as it wasn't
something that I'd seen before. It also seems used exclusively in
btree code, and only after the array keys patch, so I think it'd be
better in general to use 'construct' here.

> +++ b/src/backend/access/nbtree/nbtcompare.c

The changes here are essentially 6x the same code, but for different
types. What do you think about the attached
0001-Deduplicate[...].patch.txt, which has the same effect but with
only 1 copy of the code checked in?

> +++b/src/backend/access/nbtree/nbtutils.c
[...]
> +_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)

Why does this stop processing keys after hitting a row compare?
Doesn't skip scan still apply to any subsequent normal keys? E.g.
"c=1" creates a scan "a=skip, b=skip, c=1", so "(a, b)>(1, 2), c=1"
should IMO still allow a skip scan for a=skip, b=1 to be constructed -
it shouldn't be that we get a much less specific (and potentially,
performant) scan just by adding a rowcompare scankey on early
attributes.

> _bt_preprocess_array_keys
> - output_ikey = 0;
> + numArrayKeyData,
> + numSkipArrayKeys;

I don't think numArrayKeyData/arrayKeyData are good names here, as it
confused me many times reviewing this function's changes. On a scankey
of a=1,b=2 we won't have any array keys, yet this variable is set to
2. Something like numOutputKeys is probably more accurate.

> + /* Create a skip array and scan key where indicated by skipatts */
> + while (numSkipArrayKeys &&
> + attno_skip <= scan->keyData[input_ikey].sk_attno)
> + {
> + Oid opcintype = rel->rd_opcintype[attno_skip - 1];
> + Oid collation = rel->rd_indcollation[attno_skip - 1];
> + Oid eq_op = skipatts[attno_skip - 1].eq_op;
> + RegProcedure cmp_proc;
> +
> + if (!OidIsValid(eq_op))
> + {
> + /* won't skip using this attribute */
> + attno_skip++;

Isn't this branch impossible, given that numSkipArrayKeys is output
from _bt_decide_skipatts, whose output won't contain skipped
attributes which have eq_op=InvalidOid? I'd replace this with
Assert(OidIsValid(eq_op)).

> _bt_rewind_nonrequired_arrays

What types of scan keys can still generate non-required array keys? It
seems to me those are now mostly impossible, as this patch generates
required skip arrays for all attributes that don't yet have an
equality key and which are ahead of any (in)equality keys, except the
case with row compare keys which I already commented on above.

> utils/skipsupport.[ch]
I'm not sure why this is included in utils - isn't this exclusively
used in access/nbtree/*?

> +++ b/src/include/access/nbtree.h
BTArrayKeyInfo explodes in size, from 24B to 88B. I think some of that
is necessary, but should it really be that large?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment	Content-Type	Size
0001-Deduplicate-nbtcompare-s-scan-key-support-functions.patch.txt	text/plain	10.2 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-06 19:07:46
Message-ID:	CAH2-Wzny5oJ7RGcTO9vT9wyzXqx-N0qivtPXm+G0tuv68rdF9Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2024 at 4:58 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> This is a review on v11, not the latest v13. I suspect most comments
> still apply, but I haven't verified this.

v11 is indeed quite similar to v13, so this shouldn't really matter.

> I'm a bit concerned about the additional operations that are being
> added to the scan. Before this patch, the amount of work in the
> "horizontal" portion of the scan was limited to user-supplied
> scankeys, so O(1) even when the index condition is only (f < 7). But,
> with this patch, we're adding work for (a=, b=, c=, etc.) for every
> tuple in the scan.

There's no question that there are still some cases where this cannot
possibly pay for itself. And that just isn't acceptable -- no
arguments here.

> As these new "skip array" keys are primarily useful for inter-page
> coordination (by determining if we want to start a primitive scan to
> skip to a different page and which value range that primitive scan
> would search for, or continue on to the next sibling), can't we only
> apply the "skip array" portion of the code at the final tuple we
> access on this page?

I plan on doing something like that. I'll need to.

AFAICT I only need to avoid wasting CPU cycles here -- there are no
notable regressions from performing excessive amounts of index
descents, as far as I can tell. And so I plan on doing this without
fundamentally changing anything about the current design.

In particular, I want the performance to remain robust in cases where
the best strategy varies significantly as the scan progresses -- even
when we need to strongly favor skipping at first, and then strongly
favor staying/not skipping later on (all during the same individual
index scan). I really like that the current design gets that part
right.

> While this section already defines some things about index scans which
> seem btree-specific, I don't think we should add more references to
> btree scan internals in a section about bitmaps and bitmap index
> scans.

This section of the docs discusses the trade-off between multiple
single column indexes, and fewer multi-column indexes. How could skip
scan not be relevant to such a discussion? One of the main benefits of
skip scan is that it'll allow users to get by with fewer indexes.

> > +++ b/src/backend/access/nbtree/nbtree.c
> [...]
> > - slock_t btps_mutex; /* protects above variables, btps_arrElems */
> > + LWLock btps_lock; /* protects above variables, btps_arrElems */
>
> Why is this changed to LWLock, when it's only ever acquired exclusively?

In general one should never do more than an extremely small, tightly
controlled amount of work with a spinlock held. It's now possible that
we'll allocate memory with the lock held -- doing that with a spinlock
held is an obvious no-no. We really need to use an LWLock for this
stuff now, on general principle.

> > +btestimateparallelscan(Relation rel, int nkeys, int norderbys)
>
> I notice you're using DatumSerialize. Are there reasons why we
> wouldn't want to use heap_fill_tuple, which generally produces much
> more compact output?

heap_fill_tuple doesn't support the notion of -inf and +inf scan key
sentinel values. Plus I'm inclined to use DatumSerialize because it's
more or less designed for this kind of problem.

> Also, I think you can limit the space usage to BLCKSZ in total,
> because a full index tuple can't be larger than 1/3rd of a block; and
> for skip scans we'll only have known equality bounds for a prefix of
> attributes available in the index tuples, and a single (?)
> index-produced dynamic attribute we want to skip ahead of. So, IIUC,
> at most we'll have 2 index tuples' worth of data, or 2/3 BLCKSZ.
> Right?

Possibly, but who wants to take a chance? The scheme you're describing
only saves memory when there's 3 skip arrays, which is fairly unlikely
in general.

I think that the approach taken to serializing the array keys should
be as conservative as possible. It's not particularly likely that
we'll want to do a parallel skip scan. It's rather hard to test those
code paths.

> I needed to look up what this 'cons up' thing is, as it wasn't
> something that I'd seen before. It also seems used exclusively in
> btree code, and only after the array keys patch, so I think it'd be
> better in general to use 'construct' here.

FWIW I wasn't the first person to use the term in the nbtree code.

I think you're right, though. It is a needlessly obscure term that is
only known to Lisp hackers. I'll fix it.

> > +++ b/src/backend/access/nbtree/nbtcompare.c
>
> The changes here are essentially 6x the same code, but for different
> types. What do you think about the attached
> 0001-Deduplicate[...].patch.txt, which has the same effect but with
> only 1 copy of the code checked in?

Reminds me of the approach taken by this extension:

https://github.com/petere/pguint

I do find the need to write so much boilerplate code for B-Tree
opclasses annoying. I also find it annoying that the nbtree code
insists on being as forgiving as possible with incomplete opfamilies.
But those problems seem out of scope here -- not like I'm really
making it any worse.

> > +++b/src/backend/access/nbtree/nbtutils.c
> [...]
> > +_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
>
> Why does this stop processing keys after hitting a row compare?

Why not? It's not ideal, but there are a number of things about
RowCompare scan keys that are already less than ideal. We don't even
try to do any kind of preprocessing that involves RowCompares --
they're already a slightly awkward special case to the nbtree code.

> Doesn't skip scan still apply to any subsequent normal keys? E.g.
> "c=1" creates a scan "a=skip, b=skip, c=1", so "(a, b)>(1, 2), c=1"
> should IMO still allow a skip scan for a=skip, b=1 to be constructed -
> it shouldn't be that we get a much less specific (and potentially,
> performant) scan just by adding a rowcompare scankey on early
> attributes.

It's just awkward to get it to work as expected, while still
preserving all of the useful properties of the design.

Your "(a, b)>(1,2)" scan key returns rows matching:

WHERE (a = 1 AND b > 2) OR (a > 1)

And so your complete "(a, b)>(1,2) AND c = 1" qual returns rows matching:

WHERE ((a = 1 AND b > 2) OR (a > 1)) AND c = 1

It is difficult to imagine how the existing design for skip arrays can
be extended to support this kind of qual, though. I guess we'd still
need skip arrays on both "a" and "b" here, though. Right?

The "b" skip array would be restricted to a range of values between 3
and +inf inclusive, if and only if we were still on the "a" skip
array's first array element (i.e. iff "a = 1" the "b" has a valid
low_compare). Otherwise (i.e. when "a > 1"), the "b" skip array
wouldn't be constrained by any low_compare inequality. So
low_compare/high_compare only apply conditionally, in a world where we
support these kinds of RowCompare quals. Right now I can avoid the
problem by refusing to allow "c = 1" to ever be marked required (by
never creating any skip array on an index attribute >= an attribute
with a RowCompare key on input).

Obviously, the current design of skip arrays involves arrays that are
always constrained by the same range/low_compare and high_compare
inequalities, independent of any other factor/any wider context. It's
not impossible to make something like your RowCompare case work, but
it'd be very significantly more complicated than the existing design.
Though not all that much more useful.

Doing something like this might make sense in the context of a project
that adds support for the MDAM paper's "General OR Optimization"
transformations -- RowCompare support would only be a bonus. I don't
see any opportunities to target RowCompare as independent work --
seems as if RowCompare quals aren't significantly simpler than what is
required to support very general MDAM OR optimizations.

> > _bt_preprocess_array_keys
> > - output_ikey = 0;
> > + numArrayKeyData,
> > + numSkipArrayKeys;
>
> I don't think numArrayKeyData/arrayKeyData are good names here, as it
> confused me many times reviewing this function's changes.

The code in this area actually did change recently, though I didn't announce it.

> On a scankey
> of a=1,b=2 we won't have any array keys, yet this variable is set to
> 2. Something like numOutputKeys is probably more accurate.

That's not accurate. _bt_preprocess_array_keys() sets
*new_numberOfKeys on output. When there are no array keys (of either
type) to be output, _bt_preprocess_array_keys will just return NULL --
it won't have changed the *new_numberOfKeys passed by its
_bt_preprocess_keys caller when it returns NULL.

In other words, when _bt_preprocess_array_keys determines that there
are no array keys to process (neither SAOP arrays nor skip arrays), it
won't modify anything, and won't return an alternative
input-to-_bt_preprocess_keys scan key array. _bt_preprocess_keys will
work directly off of the scan->keyData[] input scan keys passed by the
executor proper.

> > + /* Create a skip array and scan key where indicated by skipatts */
> > + while (numSkipArrayKeys &&
> > + attno_skip <= scan->keyData[input_ikey].sk_attno)
> > + {
> > + Oid opcintype = rel->rd_opcintype[attno_skip - 1];
> > + Oid collation = rel->rd_indcollation[attno_skip - 1];
> > + Oid eq_op = skipatts[attno_skip - 1].eq_op;
> > + RegProcedure cmp_proc;
> > +
> > + if (!OidIsValid(eq_op))
> > + {
> > + /* won't skip using this attribute */
> > + attno_skip++;
>
> Isn't this branch impossible, given that numSkipArrayKeys is output
> from _bt_decide_skipatts, whose output won't contain skipped
> attributes which have eq_op=InvalidOid? I'd replace this with
> Assert(OidIsValid(eq_op)).

It's very possible to hit this code path -- we'll hit this code path
every time an explicit "=" input scan key appears before an attribute
that requires a skip array (which happens whenever there's a third,
later column that we'll be able to mark required thanks to the skip
array).

Note that BTSkipPreproc.eq_op is the "= op to be used to add array, if
any". And so when we see !OidIsValid(eq_op) in this loop, it means
"this is for an index attribute that we don't want to add a skip array
to" (though, as I said, a later attribute is expected to get a skip
array when this happens).

> > _bt_rewind_nonrequired_arrays
>
> What types of scan keys can still generate non-required array keys?

Right now it's:

1. As you said, certain cases involving RowCompare scan keys.
2. Cases involving B-Tree operator classes that don't even have a "="
operator (yes, technically those are supported!).

Also seems possible that I'll end up relying on our support for
non-required arrays when I go fix those regressions. Seems possible
that I'll get rid of the explicit SK_BT_REQFWD/SK_BT_REQBKWD markings,
and go back to treating scan keys as required based on context (a
little like how things were prior to commit 7ccaf13a06).

> It seems to me those are now mostly impossible, as this patch generates
> required skip arrays for all attributes that don't yet have an
> equality key and which are ahead of any (in)equality keys, except the
> case with row compare keys which I already commented on above.

I agree that non-required arrays become an obscure edge case with the
patch, having been reasonably common before now (well, they were
common in Postgres 17). I don't think that that provides us with any
opportunities to get rid of unneeded code.

> > utils/skipsupport.[ch]
> I'm not sure why this is included in utils - isn't this exclusively
> used in access/nbtree/*?

The location of skip support is based on (though slightly different
to) the location of sort support. In general many B-Tree opclasses are
implemented in or around src/utils/adt/*. The exception is all of the
stuff in nbtcompare.c, though I always found that weird (I wouldn't
mind getting rid of nbtcompare.c by relocating its code places like
int.c and int8.c).

> > +++ b/src/include/access/nbtree.h
> BTArrayKeyInfo explodes in size, from 24B to 88B. I think some of that
> is necessary, but should it really be that large?

I'm disinclined to do anything about it right now.

I'll make a note of it, and review when the most important performance
problems are fixed.

Thanks for the review
--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-08 21:59:55
Message-ID:	CAH2-Wz=M_8UCPpD5GQuJXmQZ6xePwhVSss38TmkBAwic12VvCg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2024 at 11:41 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v13, which is the first revision posted that has new
> functional changes in quite some time (this isn't just for fixing
> bitrot).

Attached is v14, which is just to fix bitrot caused by today's bugfix
commit b5ee4e52. Nothing new to report here.

--
Peter Geoghegan

Attachment	Content-Type	Size
v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.0 KB
v14-0002-Add-skip-scan-to-nbtree.patch	application/octet-stream	170.5 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-19 08:30:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Apologies for the delayed response. I've confirmed that the costing is
significantly
improved for multicolumn indexes in the case I provided. Thanks!
https://www.postgresql.org/message-id/TYWPR01MB10982A413E0EC4088E78C0E11B1A62%40TYWPR01MB10982.jpnprd01.prod.outlook.com

I have some comments and questions regarding the v14 patch.

(1)

> v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

As a user, I agree with adding a counter for skip scans
(though it might be better to reply to the different thread).

I believe this will help users determine whether the skip scan
optimization is
effective. If the ratio of (actual rows)/(index searches) becomes small,
it
suggests that many traversals are occurring, indicating that the skip
scan
optimization is not working efficiently. In such cases, users might
benefit
from replacing the multi-column index with an optimized leading column
index.

IIUC, why not add it to the documentation? It would clearly help users
understand how to tune their queries using the counter, and it would
also show that the counter is not just for developers.

(2)

> v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

From the perspective of consistency, wouldn't it be better to align the
naming
between the EXPLAIN output and pg_stat_all_indexes.idx_scan, even though
the
documentation states they refer to the same concept?

I personally prefer something like "search" instead of "scan", as "scan"
is
commonly associated with node names like Index Scan and similar terms.
To maintain
consistency, how about renaming pg_stat_all_indexes.idx_scan to
pg_stat_all_indexes.idx_search?

(3)

> v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

The counter should be added in blgetbitmap().

(4)

> v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
> doc/src/sgml/bloom.sgml

The below forgot "Index Searches: 1".

-> Bitmap Index Scan on btreeidx2 (cost=0.00..12.04
rows=500 width=0) (never executed)
Index Cond: (i2 = 898732)
Planning Time: 0.491 ms
Execution Time: 0.055 ms
(10 rows)

Although we may not need to fix it, due to the support for skip scan,
the B-tree
index is now selected over the Bloom index in my environment.

(5)

> v14-0002-Add-skip-scan-to-nbtree.patch

Although I tested with various data types such as int, uuid, oid, and
others on my
local PC, I could only identify the regression case that you already
mentioned.

The case is that Index(Only)Scan is selected instead of SeqScan.
Although the performance
of the full index scan is almost the same as that of SeqScan, the skip
scan degrades by
a factor of 4, despite the number of accessed pages being almost the
same. The flamegraph
indicates that the ratio of _bt_advance_array_keys() and
_bt_tuple_before_array_skeys()
becomes high.

Although it's not an optimal solution and would only reduce the degree
of performance
degradation, how about introducing a threshold per page to switch from
skip scan to full
index scan? For example, switch to full index scan after
_bt_tuple_before_array_skeys() in
_bt_checkkeys() fails 30 times per page. While 30 is just an example, I
referenced the
following case, which seems close to the worst-case scenario. In that
case,
_bt_advance_array_keys() and _bt_tuple_before_array_skeys() are called
almost 333 times per
page (1,000,000 records / 2,636 pages), and 30 is about 1/10 of that
number.

# Example case (see test.sql)
# master: 51.612ms
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..10633.43 rows=1 width=8) (actual
time=0.160..51.612 rows=1 loops=1)
Output: id1, id2
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=4425
-> Parallel Seq Scan on public.t (cost=0.00..9633.33 rows=1
width=8) (actual time=31.728..48.378 rows=0 loops=3)
Output: id1, id2
Filter: (t.id2 = 100)
Rows Removed by Filter: 333333
Buffers: shared hit=4425
Worker 0: actual time=47.583..47.583 rows=0 loops=1
Buffers: shared hit=1415
Worker 1: actual time=47.568..47.569 rows=0 loops=1
Buffers: shared hit=1436
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.078 ms
Execution Time: 51.630 ms
(17 rows)

# master with v14patch: 199.328ms. 4x slower
SET skipscan_prefix_cols=32;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_idx on public.t (cost=0.42..3535.75 rows=1
width=8) (actual time=0.090..199.328 rows=1 loops=1)
Output: id1, id2
Index Cond: (t.id2 = 100)
Index Searches: 1
Heap Fetches: 0
Buffers: shared hit=2736
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.069 ms
Execution Time: 199.348 ms
(9 rows)

# master with v14patch: 54.665ms.
SET skipscan_prefix_cols=0; // disable skip scan
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_idx on public.t (cost=0.42..3535.75 rows=1
width=8) (actual time=0.031..54.665 rows=1 loops=1)
Output: id1, id2
Index Cond: (t.id2 = 100)
Index Searches: 1
Heap Fetches: 0
Buffers: shared hit=2736
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.067 ms
Execution Time: 54.685 ms
(9 rows)

(6)

> v14-0002-Add-skip-scan-to-nbtree.patch
> src/backend/access/nbtree/nbtutils.c

+static int
+_bt_preprocess_num_array_keys(IndexScanDesc scan, BTSkipPreproc
*skipatts,
+ int *numSkipArrayKeys)
(snip)
+ int prev_numSkipArrayKeys = *numSkipArrayKeys;
+
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inkey
+ */
+ while (attno_skip < attno_inkey)

Is it better to move prev_numSkipArrayKeys =*numSkipArrayKeys after the
while loop?
For example, the index below should return *numSkipArrayKeys = 0 instead
of 1
if the id3 type does not support eq_op.

* index: CREATE INDEX test_idx on TEST (id1 int, id2 int, id3 no_eq_op,
id4 int);
* query: SELECT * FROM test WHERE id4 = 10;

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
test.sql	text/plain	660 bytes

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-19 19:06:35
Message-ID:	CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Masahiro,

On Tue, Nov 19, 2024 at 3:30 AM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> Apologies for the delayed response. I've confirmed that the costing is
> significantly
> improved for multicolumn indexes in the case I provided. Thanks!
> https://www.postgresql.org/message-id/TYWPR01MB10982A413E0EC4088E78C0E11B1A62%40TYWPR01MB10982.jpnprd01.prod.outlook.com

Great! I made it one of my private/internal test cases for the
costing. Your test case was quite helpful.

Attached is v15. It works through your feedback.

Importantly, v15 has a new patch which has a fix for your test.sql
case -- which is the most important outstanding problem with the patch
(and has been for a long time now). I've broken those changes out into
a separate patch because they're still experimental, and have some
known minor bugs. But it works well enough for you to assess how close
I am to satisfactorily fixing the known regressions, so it seems worth
posting quickly.

> IIUC, why not add it to the documentation? It would clearly help users
> understand how to tune their queries using the counter, and it would
> also show that the counter is not just for developers.

The documentation definitely needs more work. I have a personal TODO
item about that.

Changes to the documentation can be surprisingly contentious, so I
often work on it last, when we have the clearest picture of how to
talk about the feature. For example, Matthias said something that's
approximately the opposite of what you said about it (though I agree
with you about it).

> From the perspective of consistency, wouldn't it be better to align the
> naming
> between the EXPLAIN output and pg_stat_all_indexes.idx_scan, even though
> the
> documentation states they refer to the same concept?
>
> I personally prefer something like "search" instead of "scan", as "scan"
> is
> commonly associated with node names like Index Scan and similar terms.
> To maintain
> consistency, how about renaming pg_stat_all_indexes.idx_scan to
> pg_stat_all_indexes.idx_search?

I suspect that other hackers will reject that proposal on
compatibility grounds, even though it would make sense in a "green
field" situation.

Honestly, discussions about UI/UX details such as EXPLAIN ANALYZE
always tend to result in unproductive bikeshedding. What I really want
is something that will be acceptable to all parties. I don't have any
strong opinions of my own about it -- I just think that it's important
to show *something* like "Index Searches: N" to make skip scan user
friendly.

> (3)
>
> > v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
>
> The counter should be added in blgetbitmap().

Fixed.

> (4)
>
> > v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
> > doc/src/sgml/bloom.sgml
>
> The below forgot "Index Searches: 1".
>
> -> Bitmap Index Scan on btreeidx2 (cost=0.00..12.04
> rows=500 width=0) (never executed)
> Index Cond: (i2 = 898732)
> Planning Time: 0.491 ms
> Execution Time: 0.055 ms
> (10 rows)

Fixed (though I made it show "Index Searches: 0" instead, since this
particular index scan node is "never executed").

> Although we may not need to fix it, due to the support for skip scan,
> the B-tree
> index is now selected over the Bloom index in my environment.

I am not inclined to change it.

> Although I tested with various data types such as int, uuid, oid, and
> others on my
> local PC, I could only identify the regression case that you already
> mentioned.

That's good news!

> Although it's not an optimal solution and would only reduce the degree
> of performance
> degradation, how about introducing a threshold per page to switch from
> skip scan to full
> index scan?

The approach to fixing these regressions from the new experimental
patch doesn't need to use any such threshold. It is effective both
with simple "WHERE id2 = 100" cases (like the queries from your
test.sql test case), as well as more complicated "WHERE id2 BETWEEN 99
AND 101" inequality cases.

What do you think? The regressions are easily under 5% with the new
patch applied, which is in the noise.

At the same time, we're just as capable of skipping whenever the scan
encounters a large group of skipped-prefix-column duplicates. For
example, if I take your test.sql test case and add another insert that
adds such a group (e.g., "INSERT INTO t SELECT 55, i FROM
generate_series(-1000000, 1000000) i;" ), and then re-run the query,
the scan is exactly as fast as before -- it just skips to get over the
newly inserted "55" group of tuples. Obviously, this also makes the
master branch far, far slower.

As I've said many times already, the need to be flexible and offer
robust performance in cases where skipping is either very effective or
very ineffective *during the same index scan* seems very important to
me. This "55" variant of your test.sql test case is a great example of
the kinds of cases I was thinking about.

> Is it better to move prev_numSkipArrayKeys =*numSkipArrayKeys after the
> while loop?
> For example, the index below should return *numSkipArrayKeys = 0 instead
> of 1
> if the id3 type does not support eq_op.
>
> * index: CREATE INDEX test_idx on TEST (id1 int, id2 int, id3 no_eq_op,
> id4 int);
> * query: SELECT * FROM test WHERE id4 = 10;

Nice catch! You're right. Fixed this in v15, too.

Thanks for the review
--
Peter Geoghegan

Attachment	Content-Type	Size
v15-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.5 KB
v15-0003-POC-fix-for-regressions-in-unsympathetic-cases.patch	application/octet-stream	11.3 KB
v15-0002-Add-skip-scan-to-nbtree.patch	application/octet-stream	174.2 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-20 09:04:19
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Peter,

On 2024-11-20 04:06, Peter Geoghegan wrote:
> Hi Masahiro,
>
> On Tue, Nov 19, 2024 at 3:30 AM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>> Apologies for the delayed response. I've confirmed that the costing is
>> significantly
>> improved for multicolumn indexes in the case I provided. Thanks!
>> https://www.postgresql.org/message-id/TYWPR01MB10982A413E0EC4088E78C0E11B1A62%40TYWPR01MB10982.jpnprd01.prod.outlook.com
>
> Great! I made it one of my private/internal test cases for the
> costing. Your test case was quite helpful.
>
> Attached is v15. It works through your feedback.
>
> Importantly, v15 has a new patch which has a fix for your test.sql
> case -- which is the most important outstanding problem with the patch
> (and has been for a long time now). I've broken those changes out into
> a separate patch because they're still experimental, and have some
> known minor bugs. But it works well enough for you to assess how close
> I am to satisfactorily fixing the known regressions, so it seems worth
> posting quickly.

Thanks for your quick response!

>> IIUC, why not add it to the documentation? It would clearly help users
>> understand how to tune their queries using the counter, and it would
>> also show that the counter is not just for developers.
>
> The documentation definitely needs more work. I have a personal TODO
> item about that.
>
> Changes to the documentation can be surprisingly contentious, so I
> often work on it last, when we have the clearest picture of how to
> talk about the feature. For example, Matthias said something that's
> approximately the opposite of what you said about it (though I agree
> with you about it).

OK, I understood.

>> From the perspective of consistency, wouldn't it be better to align
>> the
>> naming
>> between the EXPLAIN output and pg_stat_all_indexes.idx_scan, even
>> though
>> the
>> documentation states they refer to the same concept?
>>
>> I personally prefer something like "search" instead of "scan", as
>> "scan"
>> is
>> commonly associated with node names like Index Scan and similar terms.
>> To maintain
>> consistency, how about renaming pg_stat_all_indexes.idx_scan to
>> pg_stat_all_indexes.idx_search?
>
> I suspect that other hackers will reject that proposal on
> compatibility grounds, even though it would make sense in a "green
> field" situation.
>
> Honestly, discussions about UI/UX details such as EXPLAIN ANALYZE
> always tend to result in unproductive bikeshedding. What I really want
> is something that will be acceptable to all parties. I don't have any
> strong opinions of my own about it -- I just think that it's important
> to show *something* like "Index Searches: N" to make skip scan user
> friendly.

OK, I agree.

>> Although it's not an optimal solution and would only reduce the degree
>> of performance
>> degradation, how about introducing a threshold per page to switch from
>> skip scan to full
>> index scan?
>
> The approach to fixing these regressions from the new experimental
> patch doesn't need to use any such threshold. It is effective both
> with simple "WHERE id2 = 100" cases (like the queries from your
> test.sql test case), as well as more complicated "WHERE id2 BETWEEN 99
> AND 101" inequality cases.
>
> What do you think? The regressions are easily under 5% with the new
> patch applied, which is in the noise.

I didn't come up with the idea. At first glance, your idea seems good
for all cases.

Actually, test.sql shows a performance improvement, and the performance
is almost the same as the master's seqscan. To be precise, the master's
performance is 10-20% better than the v15 patch because the seqscan is
executed in parallel. However, the v15 patch is twice as fast as when
seqscan is not executed in parallel.

However, I found that there is still a problematic case when I read your
patch. IIUC, beyondskip becomes true only if the tuple's id2 is greater
than the scan key value. Therefore, the following query (see
test_for_v15.sql)
still degrades.

-- build without '--enable-debug' '--enable-cassert' 'CFLAGS=-O0 '
-- SET skipscan_prefix_cols=32;
Index Only Scan using t_idx on public.t (cost=0.42..3535.75 rows=1
width=8) (actual time=65.767..65.770 rows=1 loops=1)
Output: id1, id2
Index Cond: (t.id2 = 1000000)
Index Searches: 1
Heap Fetches: 0
Buffers: shared hit=2736
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.058 ms
Execution Time: 65.782 ms
(9 rows)

-- SET skipscan_prefix_cols=0;
Index Only Scan using t_idx on public.t (cost=0.42..3535.75 rows=1
width=8) (actual time=17.276..17.278 rows=1 loops=1)
Output: id1, id2
Index Cond: (t.id2 = 1000000)
Index Searches: 1
Heap Fetches: 0
Buffers: shared hit=2736
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.044 ms
Execution Time: 17.290 ms
(9 rows)

I’m reporting the above result, though you might already be aware of the
issue.

> At the same time, we're just as capable of skipping whenever the scan
> encounters a large group of skipped-prefix-column duplicates. For
> example, if I take your test.sql test case and add another insert that
> adds such a group (e.g., "INSERT INTO t SELECT 55, i FROM
> generate_series(-1000000, 1000000) i;" ), and then re-run the query,
> the scan is exactly as fast as before -- it just skips to get over the
> newly inserted "55" group of tuples. Obviously, this also makes the
> master branch far, far slower.
>
> As I've said many times already, the need to be flexible and offer
> robust performance in cases where skipping is either very effective or
> very ineffective *during the same index scan* seems very important to
> me. This "55" variant of your test.sql test case is a great example of
> the kinds of cases I was thinking about.

Yes, I agree. Therefore, even if I can't think of a way to prevent
regressions
or if I can only think of improvements that would significantly
sacrifice the
benefits of skip scan, I would still like to report any regression cases
if
they occur.

There may be a better way, such as the new idea you suggested, and I
think there
is room for discussion regarding how far we should go in handling
regressions,
regardless of whether we choose to accept regressions or sacrifice the
benefits of
skip scan to address them.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment	Content-Type	Size
test_for_v15.sql	text/plain	945 bytes

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-20 19:40:05
Message-ID:	CAH2-Wzk49UhvCTnhTz9N=A-+kYjr12aWVyqjDCoFbA2YV67gKA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 20, 2024 at 4:04 AM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> Thanks for your quick response!

Attached is v16. This is similar to v15, but the new
v16-0003-Fix-regressions* patch to fix the regressions is much less
buggy, and easier to understand.

Unlike v15, the experimental patch in v16 doesn't change anything
about which index pages are read by the scan -- not even in corner
cases. It is 100% limited to fixing the CPU overhead of maintaining
skip arrays uselessly *within* a leaf page. My extensive test suite
passes; it no longer shows any changes in "Buffers: N" for any of the
EXPLAIN (ANALYZE, BUFFERS) ... output that the tests look at. This is
what I'd expect.

I think that it will make sense to commit this patch as a separate
commit, immediately after skip scan itself is committed. It makes it
clear that, at least in theory, the new v16-0003-Fix-regressions*
patch doesn't change any behavior that's visible to code outside of
_bt_readpage/_bt_checkkeys/_bt_advance_array_keys.

> I didn't come up with the idea. At first glance, your idea seems good
> for all cases.

My approach of conditioning the new "beyondskip" behavior on
"has_skip_array && beyond_end_advance" is at least a good start.

The idea behind conditioning this behavior on having at least one
beyond_end_advance array advancement is pretty simple: in practice
that almost never happens during skip scans that actually end up
skipping (either via another _bt_first that redesends the index, or
via skipping "within the page" using the
_bt_checkkeys_look_ahead/pstate->skip mechanism). So that definitely
seems like a good general heuristic. It just isn't sufficient on its
own, as you have shown.

> Actually, test.sql shows a performance improvement, and the performance
> is almost the same as the master's seqscan. To be precise, the master's
> performance is 10-20% better than the v15 patch because the seqscan is
> executed in parallel. However, the v15 patch is twice as fast as when
> seqscan is not executed in parallel.

I think that that's a good result, overall.

Bear in mind that a case such as this might receive a big performance
benefit if it can skip only once or twice. It's almost impossible to
model those kinds of effects within the optimizer's cost model, but
they're still important effects.

FWIW, I notice that your "t" test table is 35 MB, whereas its t_idx
index is 21 MB. That's not very realistic (the index size is usually a
smaller fraction of the table size than we see here), which probably
partly explains why the planner likes parallel sequential scan for
this.

> However, I found that there is still a problematic case when I read your
> patch. IIUC, beyondskip becomes true only if the tuple's id2 is greater
> than the scan key value. Therefore, the following query (see
> test_for_v15.sql)
> still degrades.

As usual, you are correct. :-)

> I’m reporting the above result, though you might already be aware of the
> issue.

Thanks!

I have an experimental fix in mind for this case. One not-very-good
way to fix this new problem seems to work:

diff --git a/src/backend/access/nbtree/nbtutils.c
b/src/backend/access/nbtree/nbtutils.c
index b70b58e0c..ddae5f2a1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3640,7 +3640,7 @@ _bt_advance_array_keys(IndexScanDesc scan,
BTReadPageState *pstate,
* for skip scan, and stop maintaining the scan's skip arrays until we
* reach the page's finaltup, if any.
*/
- if (has_skip_array && beyond_end_advance &&
+ if (has_skip_array && !all_required_satisfied &&
!has_required_opposite_direction_skip && pstate->finaltup)
pstate->beyondskip = true;

However, a small number of my test cases now fail. And (I assume) this
approach has certain downsides on leaf pages where we're now too quick
to stop maintaining skip arrays.

What I really need to do next is to provide a vigorous argument for
why the new pstate->beyondskip behavior is correct. I'm already
imposing restrictions on range skip arrays in v16 of the patch --
that's what the "!has_required_opposite_direction_skip" portion of the
test is about. But it still feels too ad-hoc.

I'm a little worried that these restrictions on range skip arrays will
themselves be the problem for some other kind of query. Imagine a
query like this:

SELECT * FROM t WHERE id1 BETWEEN 0 AND 1_000_000 AND id2 = 1

This is probably going to be regressed due to the aforementioned
"!has_required_opposite_direction_skip" restriction. Right now I don't
fully understand what restrictions are truly necessary, though. More
research is needed.

I think for v17 I'll properly fix all of the regressions that you've
complained about so far, including the most recent "SELECT * FROM t
WHERE id2 = 1_000_000" regression. Hopefully the best fix for this
other "WHERE id1 BETWEEN 0 AND 1_000_000 AND id2 = 1" regression will
become clearer once I get that far. What do you think?

> Yes, I agree. Therefore, even if I can't think of a way to prevent
> regressions
> or if I can only think of improvements that would significantly
> sacrifice the
> benefits of skip scan, I would still like to report any regression cases
> if
> they occur.

You're right, of course. It might make sense to accept some very small
regressions. But not if we can basically avoid all regressions. Which
may well be an attainable goal.

> There may be a better way, such as the new idea you suggested, and I
> think there
> is room for discussion regarding how far we should go in handling
> regressions,
> regardless of whether we choose to accept regressions or sacrifice the
> benefits of
> skip scan to address them.

There are definitely lots more options to address these regressions.
For example, we could have the planner hint that it thinks that skip
scan won't be a good idea, without that actually changing the basic
choices that nbtree makes about which pages it needs to scan (only how
to scan each individual leaf page). Or, we could remember that the
previous page used "pstate-> beyondskip" each time _bt_readpage reads
another page. I could probably think of 2 or 3 more ideas like that,
if I had to.

However, the problem is not a lack of ideas IMV. The important
trade-off is likely to be the trade-off between how effectively we can
avoid these regressions versus how much complexity each approach
imposes. My guess is that complexity is more likely to impose limits
on us than overall feasibility.

--
Peter Geoghegan

Attachment	Content-Type	Size
v16-0003-Fix-regressions-in-unsympathetic-skip-scan-cases.patch	application/octet-stream	12.8 KB
v16-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.5 KB
v16-0002-Add-skip-scan-to-nbtree.patch	application/octet-stream	174.1 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-21 08:47:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2024-11-21 04:40, Peter Geoghegan wrote:
> On Wed, Nov 20, 2024 at 4:04 AM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>> Thanks for your quick response!
>
> Attached is v16. This is similar to v15, but the new
> v16-0003-Fix-regressions* patch to fix the regressions is much less
> buggy, and easier to understand.
>
> Unlike v15, the experimental patch in v16 doesn't change anything
> about which index pages are read by the scan -- not even in corner
> cases. It is 100% limited to fixing the CPU overhead of maintaining
> skip arrays uselessly *within* a leaf page. My extensive test suite
> passes; it no longer shows any changes in "Buffers: N" for any of the
> EXPLAIN (ANALYZE, BUFFERS) ... output that the tests look at. This is
> what I'd expect.
>
> I think that it will make sense to commit this patch as a separate
> commit, immediately after skip scan itself is committed. It makes it
> clear that, at least in theory, the new v16-0003-Fix-regressions*
> patch doesn't change any behavior that's visible to code outside of
> _bt_readpage/_bt_checkkeys/_bt_advance_array_keys.

Thanks for the update! I'll look into the details and understand the
approach to the commit.

>> I didn't come up with the idea. At first glance, your idea seems good
>> for all cases.
>
> My approach of conditioning the new "beyondskip" behavior on
> "has_skip_array && beyond_end_advance" is at least a good start.
>
> The idea behind conditioning this behavior on having at least one
> beyond_end_advance array advancement is pretty simple: in practice
> that almost never happens during skip scans that actually end up
> skipping (either via another _bt_first that redesends the index, or
> via skipping "within the page" using the
> _bt_checkkeys_look_ahead/pstate->skip mechanism). So that definitely
> seems like a good general heuristic. It just isn't sufficient on its
> own, as you have shown.

Yes, I think so. This idea can make the worst-case execution time of a
skip scan almost the same as that of a full index scan.

>> Actually, test.sql shows a performance improvement, and the
>> performance
>> is almost the same as the master's seqscan. To be precise, the
>> master's
>> performance is 10-20% better than the v15 patch because the seqscan is
>> executed in parallel. However, the v15 patch is twice as fast as when
>> seqscan is not executed in parallel.
>
> I think that that's a good result, overall.
>
> Bear in mind that a case such as this might receive a big performance
> benefit if it can skip only once or twice. It's almost impossible to
> model those kinds of effects within the optimizer's cost model, but
> they're still important effects.
>
> FWIW, I notice that your "t" test table is 35 MB, whereas its t_idx
> index is 21 MB. That's not very realistic (the index size is usually a
> smaller fraction of the table size than we see here), which probably
> partly explains why the planner likes parallel sequential scan for
> this.

Yes, I agree that the above case is not realistic. If anything, the
purpose of this case might be to simply find regression scenarios.

One possible use case I can think of is enforcing a unique constraint
on all columns. However, such cases are probably very rare.

> I have an experimental fix in mind for this case. One not-very-good
> way to fix this new problem seems to work:
>
> diff --git a/src/backend/access/nbtree/nbtutils.c
> b/src/backend/access/nbtree/nbtutils.c
> index b70b58e0c..ddae5f2a1 100644
> --- a/src/backend/access/nbtree/nbtutils.c
> +++ b/src/backend/access/nbtree/nbtutils.c
> @@ -3640,7 +3640,7 @@ _bt_advance_array_keys(IndexScanDesc scan,
> BTReadPageState *pstate,
> * for skip scan, and stop maintaining the scan's skip arrays
> until we
> * reach the page's finaltup, if any.
> */
> - if (has_skip_array && beyond_end_advance &&
> + if (has_skip_array && !all_required_satisfied &&
> !has_required_opposite_direction_skip && pstate->finaltup)
> pstate->beyondskip = true;
>
> However, a small number of my test cases now fail. And (I assume) this
> approach has certain downsides on leaf pages where we're now too quick
> to stop maintaining skip arrays.

Since I've built with the above change and executed make check, I found
that there is an assertion error, which may not be related to what you
pointed
out.

* the reproducible simple query (you can see the original query in
opr_sanity.sql).
select * from pg_proc
where proname in (
'lo_lseek64',
'lo_truncate',
'lo_truncate64')
and pronamespace = 11;

* the assertion error
TRAP: failed Assert("sktrig_required && required"), File:
"nbtutils.c", Line: 3375, PID: 362411

While investigating the error, I thought we might need to consider
whether key->sk_flags does not have SK_BT_SKIP. The assertion error
occurs because
requiredSameDir doesn't become true since proname does not have
SK_BT_SKIP.

+ if (beyondskip)
+ {
+ /*
+ * "Beyond end advancement" skip scan optimization.
+ *
+ * Just skip over any skip array scan keys. Treat all other scan
+ * keys as not required for the scan to continue.
+ */
+ Assert(!prechecked);
+
+ if (key->sk_flags & SK_BT_SKIP)
+ continue;
+ }
+ else if (((key->sk_flags & SK_BT_REQFWD) &&
ScanDirectionIsForward(dir)) ||
+ ((key->sk_flags & SK_BT_REQBKWD) && ScanDirectionIsBackward(dir)))
requiredSameDir = true;

> What I really need to do next is to provide a vigorous argument for
> why the new pstate->beyondskip behavior is correct. I'm already
> imposing restrictions on range skip arrays in v16 of the patch --
> that's what the "!has_required_opposite_direction_skip" portion of the
> test is about. But it still feels too ad-hoc.
>
> I'm a little worried that these restrictions on range skip arrays will
> themselves be the problem for some other kind of query. Imagine a
> query like this:
>
> SELECT * FROM t WHERE id1 BETWEEN 0 AND 1_000_000 AND id2 = 1
>
> This is probably going to be regressed due to the aforementioned
> "!has_required_opposite_direction_skip" restriction. Right now I don't
> fully understand what restrictions are truly necessary, though. More
> research is needed.
>
> I think for v17 I'll properly fix all of the regressions that you've
> complained about so far, including the most recent "SELECT * FROM t
> WHERE id2 = 1_000_000" regression. Hopefully the best fix for this
> other "WHERE id1 BETWEEN 0 AND 1_000_000 AND id2 = 1" regression will
> become clearer once I get that far. What do you think?

To be honest, I don't fully understand
has_required_opposite_direction_skip
and its context at the moment. Please give me some time to understand
it,
and I'd like to provide feedback afterward.

FWIW, the optimization is especially important for types that don't
support
'skipsupport', like 'real'. Although the example case I provided uses
integer
types, they (like 'real') tend to have many different values and high
cardinality, which means the possibility of skip scan working
efficiently
can be low.

>> There may be a better way, such as the new idea you suggested, and I
>> think there
>> is room for discussion regarding how far we should go in handling
>> regressions,
>> regardless of whether we choose to accept regressions or sacrifice the
>> benefits of
>> skip scan to address them.
>
> There are definitely lots more options to address these regressions.
> For example, we could have the planner hint that it thinks that skip
> scan won't be a good idea, without that actually changing the basic
> choices that nbtree makes about which pages it needs to scan (only how
> to scan each individual leaf page). Or, we could remember that the
> previous page used "pstate-> beyondskip" each time _bt_readpage reads
> another page. I could probably think of 2 or 3 more ideas like that,
> if I had to.
>
> However, the problem is not a lack of ideas IMV. The important
> trade-off is likely to be the trade-off between how effectively we can
> avoid these regressions versus how much complexity each approach
> imposes. My guess is that complexity is more likely to impose limits
> on us than overall feasibility.

OK, I think you're right.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-22 09:17:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 2024-11-21 17:47, Masahiro Ikeda wrote:
>> On 2024-11-21 04:40, Peter Geoghegan wrote:
>> diff --git a/src/backend/access/nbtree/nbtutils.c
>> b/src/backend/access/nbtree/nbtutils.c
>> index b70b58e0c..ddae5f2a1 100644
>> --- a/src/backend/access/nbtree/nbtutils.c
>> +++ b/src/backend/access/nbtree/nbtutils.c
>> @@ -3640,7 +3640,7 @@ _bt_advance_array_keys(IndexScanDesc scan,
>> BTReadPageState *pstate,
>> * for skip scan, and stop maintaining the scan's skip arrays
>> until we
>> * reach the page's finaltup, if any.
>> */
>> - if (has_skip_array && beyond_end_advance &&
>> + if (has_skip_array && !all_required_satisfied &&
>> !has_required_opposite_direction_skip && pstate->finaltup)
>> pstate->beyondskip = true;
>>
>> However, a small number of my test cases now fail. And (I assume) this
>> approach has certain downsides on leaf pages where we're now too quick
>> to stop maintaining skip arrays.
>
> Since I've built with the above change and executed make check, I found
> that there is an assertion error, which may not be related to what you
> pointed
> out.
>
> * the reproducible simple query (you can see the original query in
> opr_sanity.sql).
> select * from pg_proc
> where proname in (
> 'lo_lseek64',
> 'lo_truncate',
> 'lo_truncate64')
> and pronamespace = 11;
>
> * the assertion error
> TRAP: failed Assert("sktrig_required && required"), File:
> "nbtutils.c", Line: 3375, PID: 362411
>
> While investigating the error, I thought we might need to consider
> whether key->sk_flags does not have SK_BT_SKIP. The assertion error
> occurs because
> requiredSameDir doesn't become true since proname does not have
> SK_BT_SKIP.
>
> + if (beyondskip)
> + {
> + /*
> + * "Beyond end advancement" skip scan optimization.
> + *
> + * Just skip over any skip array scan keys. Treat all other scan
> + * keys as not required for the scan to continue.
> + */
> + Assert(!prechecked);
> +
> + if (key->sk_flags & SK_BT_SKIP)
> + continue;
> + }
> + else if (((key->sk_flags & SK_BT_REQFWD) &&
> ScanDirectionIsForward(dir)) ||
> + ((key->sk_flags & SK_BT_REQBKWD) &&
> ScanDirectionIsBackward(dir)))
> requiredSameDir = true;
>

My previous investigation was incorrect, sorry. IIUC,
_bt_check_compare()
should return false as soon as possible with continuescan=true after the
tuple fails the key check when beyondskip=true, rather than setting
requiredSameDir to true. Because it has already been triggered to
perform
a full index scan for the page.

Though the change fixes the assertion error in 'make check', there are
still
cases where the number of return values is incorrect. I would also like
to
continue investigating.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-22 22:34:14
Message-ID:	CAH2-Wz=SXWMOMvaTZjzv96+QaZFsHZ_H=f-qP-xSrFCWxKwC3g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 22, 2024 at 4:17 AM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> Though the change fixes the assertion error in 'make check', there are
> still
> cases where the number of return values is incorrect. I would also like
> to
> continue investigating.

Thanks for taking a look at that! I've come up with a better approach,
though (sorry about how quickly this keeps changing!). See the
attached new revision, v17.

I'm now calling the new optimization from the third patch the
"skipskip" optimization. I believe that v17 will fix those bugs that
you saw -- let me know if those are gone now. It also greatly improves
the situation with regressions (more on that later).

New approach
------------

Before now, I was trying to preserve the usual invariants that we have
for the scan's array keys: the array keys must "track the progress" of
the scan through the index's key space -- that's what those
_bt_tuple_before_array_skeys precondition and postcondition asserts
inside _bt_advance_array_keys verify for us (these assertions have
proven very valuable, both now and during the earlier Postgres 17
nbtree SAOP project). My original approach (to managing the overhead
of maintaining skip arrays in adversarial/unsympathetic cases) was
overly complicated, inflexible, and brittle.

It's simpler (much simpler) to allow the scan to temporarily forget
about the invariants: once the "skipskip" optimization is activated,
we don't care about the rules that require that "the array keys track
progress through the key space" -- not until _bt_readpage actually
reaches the page's finaltup. Then _bt_readpage can handle the problem
using a trick that we already use elsewhere (e.g., in btrestrpos):
_bt_readpage just calls _bt_start_array_keys to get the array keys to
their initial positions (in the current scan direction), before
calling _bt_checkkeys for finaltup in the usual way.

Under this scheme, the _bt_checkkeys call for finaltup will definitely
call _bt_advance_array_keys to advance the array keys to the correct
place (the correct place when scanning forward is ">= finaltup in the
key space"). The truly essential thing is that we never accidentally
allow the array keys to "prematurely get ahead of the scan's tuple in
the keyspace" -- that leads to wrong answers to queries once we reach
the next page. But the arrays can be allowed to temporarily remain
*before* the scan's tuples without consequence (it's safe when the
skipskip optimization is in effect, at least, since the _bt_checkkeys
calls treat everything as a non-required key, and so
_bt_checkkeys/_bt_advance_array_keys will never expect any non-skipped
SAOP arrays to tells them anything about the scan's progress through
the index's key space -- there'll be no unsafe "cur_elem_trig" binary
searches, for example).

This approach also allowed me to restore all of the assertions in
nbtutils.c to their original form. That was important to me -- those
assertions have saved me quite a few times.

Regressions, performance improvements
-------------------------------------

As I touched on briefly already, the new approach is also
significantly *faster* than the master branch in certain "adversarial"
cases (this is explained in the next paragraph). Overall, all of the
cases that were unacceptably regressed before now, that I know of
(e.g., all of the cases that you brought to my attention so far,
Masahiro) are either significantly faster, or only very slightly
slower. The regressions that we still have in v17 are probably
acceptable -- though clearly I still have a lot of performance
validation work to do before reaching a final conclusion about
regressions.

I also attach a variant of your test_for_v15.sql test case, Masahiro.
I used this to do some basic performance validation of this latest
version of the patch -- it'll give you a general sense of where we are
with regressions, and will also show those "adversarial" cases that
end up significantly faster than the master branch with v17.
Significant improvements are sometimes seen because the "skipskip"
optimization replaces the requiredOppositeDirOnly optimization (see
commit e0b1ee17dc for context). We can do better than the existing
requiredOppositeDirOnly optimizations because we can skip more
individual scan key comparisons. For example, with this query:

SELECT * FROM t WHERE id1 BETWEEN 0 AND 1_000_000 AND id2 = 1_000_000

This goes from taking ~25ms with a warm cache on master, to only
taking ~17ms on v17 of the patch series. I wasn't really trying to
make anything faster, here -- it's a nice bonus.

There are 3 scan keys involved here, when the query is run on the
master branch: "id1 >= 0", "id1 <= 1_000_000", and "id2 = 1_000_000".
The existing requiredOppositeDirOnly optimization doesn't work because
only one page will ever have its pstate->first set to 'true' within
_bt_readpage -- that's due to "id2 = 1_000_000" (a non-required scan
key) only rarely evaluating to true. Whereas with skip scan (and its
"skipskip" optimization), there are only 2 scan keys (well, sort of):
the range skip array scan key on "id1", plus the "id2 = 1_000_000"
scan key. We'll be able to skip over the range skip array scan key
entirely with the "skipskip" optimization, so the range skip array's
lower_bound and upper_bound "subsidiary" scan keys won't need to be
evaluated more than once per affected leaf page. In other words, we'll
still need to evaluate the "id2 = 1_000_000" against every tuple on
every leaf page -- but we don't need to use the >= or <=
subsidiary-to-skip-array scan keys (except once per page, to establish
that the optimization is safe).

--
Peter Geoghegan

Attachment	Content-Type	Size
v17-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.5 KB
v17-0002-Add-skip-scan-to-nbtree.patch	application/octet-stream	174.1 KB
v17-0003-Add-skipskip-nbtree-skip-scan-optimization.patch	application/octet-stream	18.6 KB
test_for_v17.sql	application/octet-stream	6.5 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-25 09:07:39
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2024-11-23 07:34, Peter Geoghegan wrote:
> On Fri, Nov 22, 2024 at 4:17 AM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>> Though the change fixes the assertion error in 'make check', there are
>> still
>> cases where the number of return values is incorrect. I would also
>> like
>> to
>> continue investigating.
>
> Thanks for taking a look at that! I've come up with a better approach,
> though (sorry about how quickly this keeps changing!). See the
> attached new revision, v17.
>
> I'm now calling the new optimization from the third patch the
> "skipskip" optimization. I believe that v17 will fix those bugs that
> you saw -- let me know if those are gone now. It also greatly improves
> the situation with regressions (more on that later).
>
> New approach
> ------------
>
> Before now, I was trying to preserve the usual invariants that we have
> for the scan's array keys: the array keys must "track the progress" of
> the scan through the index's key space -- that's what those
> _bt_tuple_before_array_skeys precondition and postcondition asserts
> inside _bt_advance_array_keys verify for us (these assertions have
> proven very valuable, both now and during the earlier Postgres 17
> nbtree SAOP project). My original approach (to managing the overhead
> of maintaining skip arrays in adversarial/unsympathetic cases) was
> overly complicated, inflexible, and brittle.
>
> It's simpler (much simpler) to allow the scan to temporarily forget
> about the invariants: once the "skipskip" optimization is activated,
> we don't care about the rules that require that "the array keys track
> progress through the key space" -- not until _bt_readpage actually
> reaches the page's finaltup. Then _bt_readpage can handle the problem
> using a trick that we already use elsewhere (e.g., in btrestrpos):
> _bt_readpage just calls _bt_start_array_keys to get the array keys to
> their initial positions (in the current scan direction), before
> calling _bt_checkkeys for finaltup in the usual way.
>
> Under this scheme, the _bt_checkkeys call for finaltup will definitely
> call _bt_advance_array_keys to advance the array keys to the correct
> place (the correct place when scanning forward is ">= finaltup in the
> key space"). The truly essential thing is that we never accidentally
> allow the array keys to "prematurely get ahead of the scan's tuple in
> the keyspace" -- that leads to wrong answers to queries once we reach
> the next page. But the arrays can be allowed to temporarily remain
> *before* the scan's tuples without consequence (it's safe when the
> skipskip optimization is in effect, at least, since the _bt_checkkeys
> calls treat everything as a non-required key, and so
> _bt_checkkeys/_bt_advance_array_keys will never expect any non-skipped
> SAOP arrays to tells them anything about the scan's progress through
> the index's key space -- there'll be no unsafe "cur_elem_trig" binary
> searches, for example).
>
> This approach also allowed me to restore all of the assertions in
> nbtutils.c to their original form. That was important to me -- those
> assertions have saved me quite a few times.
>
> Regressions, performance improvements
> -------------------------------------
>
> As I touched on briefly already, the new approach is also
> significantly *faster* than the master branch in certain "adversarial"
> cases (this is explained in the next paragraph). Overall, all of the
> cases that were unacceptably regressed before now, that I know of
> (e.g., all of the cases that you brought to my attention so far,
> Masahiro) are either significantly faster, or only very slightly
> slower. The regressions that we still have in v17 are probably
> acceptable -- though clearly I still have a lot of performance
> validation work to do before reaching a final conclusion about
> regressions.
>
> I also attach a variant of your test_for_v15.sql test case, Masahiro.
> I used this to do some basic performance validation of this latest
> version of the patch -- it'll give you a general sense of where we are
> with regressions, and will also show those "adversarial" cases that
> end up significantly faster than the master branch with v17.
> Significant improvements are sometimes seen because the "skipskip"
> optimization replaces the requiredOppositeDirOnly optimization (see
> commit e0b1ee17dc for context). We can do better than the existing
> requiredOppositeDirOnly optimizations because we can skip more
> individual scan key comparisons. For example, with this query:
>
> SELECT * FROM t WHERE id1 BETWEEN 0 AND 1_000_000 AND id2 = 1_000_000
>
> This goes from taking ~25ms with a warm cache on master, to only
> taking ~17ms on v17 of the patch series. I wasn't really trying to
> make anything faster, here -- it's a nice bonus.
>
> There are 3 scan keys involved here, when the query is run on the
> master branch: "id1 >= 0", "id1 <= 1_000_000", and "id2 = 1_000_000".
> The existing requiredOppositeDirOnly optimization doesn't work because
> only one page will ever have its pstate->first set to 'true' within
> _bt_readpage -- that's due to "id2 = 1_000_000" (a non-required scan
> key) only rarely evaluating to true. Whereas with skip scan (and its
> "skipskip" optimization), there are only 2 scan keys (well, sort of):
> the range skip array scan key on "id1", plus the "id2 = 1_000_000"
> scan key. We'll be able to skip over the range skip array scan key
> entirely with the "skipskip" optimization, so the range skip array's
> lower_bound and upper_bound "subsidiary" scan keys won't need to be
> evaluated more than once per affected leaf page. In other words, we'll
> still need to evaluate the "id2 = 1_000_000" against every tuple on
> every leaf page -- but we don't need to use the >= or <=
> subsidiary-to-skip-array scan keys (except once per page, to establish
> that the optimization is safe).

Thanks for updating the patch!

The approach looks good to me. In fact, the significant regressions I
reported have disappeared in my environment. And the test_for_v17.sql
shows that the performance of the master and the master with your
patches
is almost same.

One thing I am concerned about is that it reduces the cases where the
optimization of _bt_checkkeys_look_ahead() works effectively, as the
approach
skips the skip immediately on the first occurrence per page. But, I'd
like to
take your approach because I prioritize stability.

FWIW, I conducted tests to understand the downside of 0003 patch. IIUC,
the worst-case scenario occurs when the first few tuples per page have
different values for the first attribute, while the rest are the same
value. The result
shows that the 0003 patch caused a 2x degradation in performance,
although the
performance is faster than that of the master branch.

* master with 0001, 0002 and 0003 patch.
-- SET skipscan_prefix_cols=32
Index Only Scan using t_idx on public.t (cost=0.42..3576.58 rows=2712
width=8) (actual time=0.019..15.107 rows=2717 loops=1)
Output: id1, id2
Index Cond: (t.id2 = 360)
Index Searches: 2696
Heap Fetches: 0
Buffers: shared hit=8126
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.049 ms
Execution Time: 15.203 ms
(9 rows)

* master with 0001 and 0002 patch. (without 0003 patch)
-- SET skipscan_prefix_cols=32
Index Only Scan using t_idx on public.t (cost=0.42..3576.55 rows=2709
width=8) (actual time=
0.011..6.886 rows=2717 loops=1)
Output: id1, id2
Index Cond: (t.id2 = 360)
Index Searches: 2696
Heap Fetches: 0
Buffers: shared hit=8126
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.062 ms
Execution Time: 6.981 ms
(9 rows)

* the way to get the above result.
-- create table
DROP TABLE IF EXISTS t;
CREATE unlogged TABLE t (id1 int, id2 int);
-- A special case where the 0003 patch performs poorly.
-- It inserts 367 index tuples per page, making only the first two id1
-- values different, and then makes the rest the same.
-- psql=# SELECT * FROM bt_page_stats('t_idx', 1);
-- -[ RECORD 1 ]-+-----
-- blkno | 1
-- type | l
-- live_items | 367
-- dead_items | 0
-- avg_item_size | 16
-- page_size | 8192
-- free_size | 808
-- btpo_prev | 0
-- btpo_next | 2
-- btpo_level | 0
-- btpo_flags | 1
INSERT INTO t (
SELECT CASE WHEN i%368<3 THEN i%368+i/368*100 ELSE i/368*1000+10 END,
i%368
FROM generate_series(1, 1_000_000) s(i)
);
CREATE INDEX t_idx on t (id1, id2);
VACUUM FREEZE ANALYZE;

-- example data
SELECT * FROM t LIMIT 10;
SELECT * FROM t WHERE id2 > 365 LIMIT 10;

-- performance
SET skipscan_prefix_cols=32;
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
FROM t WHERE id2 = 360; -- cache
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
FROM t WHERE id2 = 360;
SET skipscan_prefix_cols=0;
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
FROM t WHERE id2 = 360;

Again, the above results are provided for reference, as I believe that
many users prioritize stability and I'd like to take your new approach.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-25 22:32:34
Message-ID:	CAH2-Wzmd4TT-F6RSEjyXGNMoUEYUMdhit50onCEw02ud=Tb_Gw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 25, 2024 at 4:07 AM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> The approach looks good to me. In fact, the significant regressions I
> reported have disappeared in my environment. And the test_for_v17.sql
> shows that the performance of the master and the master with your
> patches
> is almost same.

That's great.

I'm also a bit worried about regressing queries that don't even
involve skip arrays -- particularly very simple queries. I'm thinking
of things like "pgbench -S" style SELECT queries. Adding almost any
new code to a hot code path such as _bt_check_compare comes with a
risk of such regressions. (Up until now it has been convenient to
pretend that "skipscan_prefix_cols=0" is 100% representative of
master, but obviously it is less than 100% representative -- since
"skipscan_prefix_cols=0" still has enlarged object code size in places
like _bt_check_compare.)

Attached is v18, which makes _bt_check_compare copy all relevant scan
key fields into local variables. This gives the compiler more freedom
due to not having to worry about aliasing.

I believe that this change to _bt_check_compare fixes some regressions
with simple queries. This includes a "pgbench -S" variant that I
previously used during the Postgres 17 SAOP array work -- a script
that showed a favorable case for that 17 work, involving "SELECT aid
FROM pgbench_accounts WHERE aid IN (....)", with hundreds of
contiguous array elements. More importantly, it avoids regressions
with standard "pgbench -S" SELECT queries.

It takes a long time to validate changes such as these -- the
regressions are within the range of noise, so it is difficult to tell
the difference between signal and noise. I probably went further with
some of the new changes to _bt_check_compare than really made sense.
For example, I probably added likely() or unlikely() annotations that
don't really help. This is something that I will continue to refine.

> One thing I am concerned about is that it reduces the cases where the
> optimization of _bt_checkkeys_look_ahead() works effectively, as the
> approach
> skips the skip immediately on the first occurrence per page.

I noticed that with the recent v17 revision of the patch, my original
MDAM paper "sales_mdam_paper" test case (the complicated query in the
introductory email of this thread) was about 2x slower. That's just
not okay, obviously. But the issue was relatively easy to fix: it was
fixed by making _bt_readpage not apply the "skipskip" optimization
when on the first page for the current primitive index scan -- we
already do this with the "precheck" optimization, so it's natural to
do it with the "skipskip" optimization as well.

The "sales_mdam_paper" test case involves thousands of primitive index
scans that each access only one leaf page. But each leaf page returns
2 non-adjoining tuples, with quite a few non-matching tuples "in
between" the matching tuples. There is one matching tuple for "store =
200", and another for "store = 250" -- and there's non-matching stores
201 - 249 between these two, which we want _bt_checkkeys_look_ahead to
skip over. This is exactly the kind of case where the
_bt_checkkeys_look_ahead() optimization is expected to help.

> -- performance
> SET skipscan_prefix_cols=32;
> EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
> FROM t WHERE id2 = 360; -- cache
> EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
> FROM t WHERE id2 = 360;
> SET skipscan_prefix_cols=0;
> EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT, SETTINGS, WAL, VERBOSE) SELECT *
> FROM t WHERE id2 = 360;
>
> Again, the above results are provided for reference, as I believe that
> many users prioritize stability and I'd like to take your new approach.

Adversarial cases specifically designed to "make the patch look bad"
are definitely useful review feedback. Ideally, the patch will be 100%
free of regressions -- no matter how unlikely (or even silly) they may
seem. I always prefer to not have to rely on anybody's opinion of what
is likely or unlikely. :-)

A quick test seems to show that this particular regression is more or
less fixed by v18. As you said, the _bt_checkkeys_look_ahead stuff is
the issue here (same with the MDAM sales query). You should confirm
that the issue has actually been fixed, though.

Thanks
--
Peter Geoghegan

Attachment	Content-Type	Size
v18-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.6 KB
v18-0003-Add-skipskip-nbtree-skip-scan-optimization.patch	application/octet-stream	24.3 KB
v18-0002-Add-skip-scan-to-nbtree.patch	application/octet-stream	174.1 KB

From:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-11-29 02:33:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2024-11-26 07:32, Peter Geoghegan wrote:
> On Mon, Nov 25, 2024 at 4:07 AM Masahiro Ikeda
> <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
>> One thing I am concerned about is that it reduces the cases where the
>> optimization of _bt_checkkeys_look_ahead() works effectively, as the
>> approach
>> skips the skip immediately on the first occurrence per page.
>
> I noticed that with the recent v17 revision of the patch, my original
> MDAM paper "sales_mdam_paper" test case (the complicated query in the
> introductory email of this thread) was about 2x slower. That's just
> not okay, obviously. But the issue was relatively easy to fix: it was
> fixed by making _bt_readpage not apply the "skipskip" optimization
> when on the first page for the current primitive index scan -- we
> already do this with the "precheck" optimization, so it's natural to
> do it with the "skipskip" optimization as well.
>
> The "sales_mdam_paper" test case involves thousands of primitive index
> scans that each access only one leaf page. But each leaf page returns
> 2 non-adjoining tuples, with quite a few non-matching tuples "in
> between" the matching tuples. There is one matching tuple for "store =
> 200", and another for "store = 250" -- and there's non-matching stores
> 201 - 249 between these two, which we want _bt_checkkeys_look_ahead to
> skip over. This is exactly the kind of case where the
> _bt_checkkeys_look_ahead() optimization is expected to help.

Great! Your new approach strikes a good balance between the trade-offs
of "skipskip" and "look ahead" optimization. Although the regression
case
I provided seems to be a corner case, your regression case is realistic
and should be addressed.

>> Again, the above results are provided for reference, as I believe that
>> many users prioritize stability and I'd like to take your new
>> approach.
>
> Adversarial cases specifically designed to "make the patch look bad"
> are definitely useful review feedback. Ideally, the patch will be 100%
> free of regressions -- no matter how unlikely (or even silly) they may
> seem. I always prefer to not have to rely on anybody's opinion of what
> is likely or unlikely. :-)
>
> A quick test seems to show that this particular regression is more or
> less fixed by v18. As you said, the _bt_checkkeys_look_ahead stuff is
> the issue here (same with the MDAM sales query). You should confirm
> that the issue has actually been fixed, though.

Thanks to your new patch, I have confirmed that the issue is fixed.

I have no comments on the new patches. If I find any new regression
cases, I'll report them.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2024-12-11 19:13:23
Message-ID:	CAH2-WznrDHoLK+6afOWr=pH6FMn7nLhC8wMzpQaes-mjRmQN1g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 28, 2024 at 9:33 PM Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com> wrote:
> Thanks to your new patch, I have confirmed that the issue is fixed.
>
> I have no comments on the new patches. If I find any new regression
> cases, I'll report them.

Attached is v19.

This is just to fix bitrot. The patch series stopped applying against
HEAD cleanly due to recent work that made EXPLAIN ANALYZE show buffers
output by default.

--
Peter Geoghegan

Attachment	Content-Type	Size
v19-0003-Add-skipskip-nbtree-skip-scan-optimization.patch	application/x-patch	22.3 KB
v19-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/x-patch	52.8 KB
v19-0002-Add-skip-scan-to-nbtree.patch	application/x-patch	172.3 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-01-03 19:43:45
Message-ID:	CAH2-WzmoVuvE_6h-HUuExu1KK1gKX5Hpbw0mUbWJcaqmK8VDVg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 11, 2024 at 2:13 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v19.

I now attach v20. This revision simplifies the "skipskip"
optimization, from the v20-0003-* patch. We now apply it on every page
that isn't the primitive index scan's first leaf page read (during
skip scans) -- we'll no longer activate it midway through scanning a
leaf page within _bt_readpage.

The newly revised "skipskip" optimization seems to get the regressions
down to only a 5% - 10% increase in runtime across a wide variety of
unsympathetic cases -- I'm now validating performance against a test
suite based on the adversarial cases presented by Masahiro Ikeda on
this thread. Although I think that I'll end up tuning the "skipskip"
mechanism some more (I may have been too conservative in marginal
cases that actually do benefit from skipping), I deem these
regressions to be acceptable. They're only seen in the most
unsympathetic cases, where an omitted leading column has groupings of
no more than about 50 index tuples, making skipping pretty hopeless.

I knew from the outset that the hardest part of this project would be
avoiding regressions in highly unsympathetic cases. The regressions
that are still there seem very difficult to minimize any further; the
overhead that remains comes from the simple need to maintain the
scan's skip arrays once per page, before leaving the page. Once a scan
decides to apply the "skipskip" optimization, it tends to stick with
it for all future leaf pages -- leaving only the overhead of checking
the high key while advancing the scan's arrays. I've cut just about
all that that I can reasonably cut from the hot code paths that are at
issue with the regressed cases.

It's important to have a sense of the context that these regressions
are seen in. We can reasonably hope that the optimizer wouldn't pick a
plan like this in the first place, and/or hope that the user would
create an appropriate index to avoid an inherently inefficient full
index scan (a scan like the one that I've regressed). Plus the
overhead only gets this high for index-only scans, where index
traversal costs will naturally dominate. If a user's query really is
made slower to the same degree (5% - 10%), then the user probably
doesn't consider the query very performance critical. They're unlikely
to notice the 5% - 10% regression -- creating the right index for the
job will make the query multiple times faster, at a minimum.

The break-even point where we should prefer to skip is pretty close to
a policy of simply always skipping -- especially with the "skipskip"
optimization/patch in place. That makes it seem unlikely that we could
do much better by giving the optimizer a greater role in things. I
just don't think that the optimizer has sufficiently accurate
information about the characteristics of the index to get anything
close to the level of precision that is required to avoid regressions.
For example, I see many queries that are ~5x faster than an equivalent
full index scan, despite only ever skipping over every second leaf
page -- there are still big savings in CPU costs for such cases. We
see big speedups in these "marginal" cases -- speedups that are really
hard to model using the available statistics. If a reliable cost
function could be built, then it would be very sensitive to its
parameters, and would exhibit very nonlinear behavior. In general,
something that behaves like that seems unlikely to ever be truly
reliable.

--
Peter Geoghegan

Attachment	Content-Type	Size
v20-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.8 KB
v20-0003-Lower-the-overhead-of-nbtree-runtime-skip-checks.patch	application/octet-stream	21.2 KB
v20-0002-Add-skip-scan-to-nbtree.patch	application/octet-stream	174.3 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-01-13 20:22:46
Message-ID:	CAH2-WzmOFgH-i4d64P6NOgiPhaTB98qLfe89iGCjnM_fXOfCqg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 3, 2025 at 2:43 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I now attach v20.

Attached is v21. This revision is just to fix bitrot against HEAD that
was caused by recent commits of mine -- all of which were related to
nbtree preprocessing.

Now that nbtree has a separate file for preprocessing related code
(nbtpreprocesskeys.c), it's easier to see how the code added by the
skip scan patch fits together with everything else. The patch
actually adds slightly more code to nbtpreprocesskeys.c (to
preprocessing scan keys in new ways) than it adds to nbtutils.c (to
evaluate those same scan keys in new ways). This new structure
definitely improves readability.

--
Peter Geoghegan

Attachment	Content-Type	Size
v21-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.8 KB
v21-0003-Lower-the-overhead-of-nbtree-runtime-skip-checks.patch	application/octet-stream	22.8 KB
v21-0002-Add-skip-scan-to-nbtree.patch	application/octet-stream	174.3 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-01-23 22:34:17
Message-ID:	CAH2-Wz=-i0QuNaNGJRjuLx7HDSgRWXcPo22ChX+2tTCMidxM_Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 13, 2025 at 3:22 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v21. This revision is just to fix bitrot against HEAD that
> was caused by recent commits of mine -- all of which were related to
> nbtree preprocessing.

Attached is v22.

This revision changes the optimizer's cost model, making its costing
exactly match the costing on the master branch in marginal cases --
cases where some skipping may be possible, but not enough to justify
*expecting* any saving during query planning.

Perhaps surprisingly, skipping only every second leaf page can be
5x-7x faster, even though the number of "buffers hit" could easily be
~3x higher due to all of the extra internal page reads. But it's
really hard to predict exactly how much we'll benefit from skipping
during planning, within btcostestimate. The costing is of course
driven by statistics, and estimating the cardinality of multiple
columns together with those statistics is bound to be quite inaccurate
much of the time. We should err in the direction of assuming a
relatively expensive full index scan (we always did, but now we do so
even more).

As a result of these improvements to the costing, v22 is the first
version without any changes to EXPLAIN output/query plans in expected
regression test output. That wasn't what I set out to do (I actually
set out to fix clearly nonsensical costing in certain edge cases), but
not having any plan changes in the regression tests does seem like a
good thing to me.

v22 also simplifies a number of things on the nbtree side:

* We no longer keep around a possibly-cross-type ORDER proc in skip
arrays (just the original scan key). We give up on a marginal
optimization that was used only during skip scans involving a skipped
column with a range containing either >= or <= inequalities for a
type/opclass that lacks skip support.

In v22, nbtree scans can no longer determine that a tuple "(a, b) =
('foo', 342)" with a qual "WHERE a <= 'foo' AND b = 342" doesn't have
to continue once it reaches the first tuple > '(foo, 342)' when "a" is
of a type that doesn't offer skip support, such as text (if "a" is of
a type like integer then we still get this behavior, without any of
the complexity). The downside of ripping this optimization out is that
there might now be an extra primitive index scan that finds the next
"a" value is > 'foo' before we can actually terminate the scan --
we'll now fail to notice that the existing skip array element is
'foo', so the next one in the index cannot possibly be greater than
'foo'. The upside is that it makes things simpler, and avoids extra
comparisons during scans that might not pay for themselves.

This optimization wasn't adding very much, and didn't seem to justify
the complexity that it imposed during preprocessing. Keeping around
extra ORDER procs had problems in cases that happened to involve
cross-type operators. I'm pretty sure that they were broken in the
presence of a redundant/duplicative inequality that couldn't be proven
to be safe to eliminate by preprocessing. I probably could have fixed
the problem instead, but it seemed better to just cut scope.

* The optimization that has nbtree preprocessing convert "WHERE a > 5
AND b = 9000" into "WHERE b >= 6 and b = 9000" where possible (i.e. in
cases involving a skip array that offers skip support) has been broken
out into its own commit/patch -- that's now in 0004-*.

It's possible that I'll ultimately conclude that this optimization
isn't worth the complexity, either -- and then rip it out as well. But
it isn't all that complicated, and only imposes overhead during
preprocessing (never during the scan proper), so I still lean towards
committing it. But it's certainly not essential.

* Reorders and renames the new functions in nbtutils.c and in
nbtpreprocesskeys.c for clarity.

* Polishes and slightly refactors array preprocessing, to make it
easier to understand the rules that determine when and how
preprocessing generates skip arrays.

--
Peter Geoghegan

Attachment	Content-Type	Size
v22-0005-DEBUG-Add-skip-scan-disable-GUCs.patch	application/octet-stream	4.4 KB
v22-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.8 KB
v22-0003-Lower-the-overhead-of-nbtree-runtime-skip-checks.patch	application/octet-stream	23.1 KB
v22-0002-Add-nbtree-skip-scan-optimizations.patch	application/octet-stream	157.2 KB
v22-0004-Convert-nbtree-inequalities-using-skip-support.patch	application/octet-stream	8.1 KB

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-01-25 03:07:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/01/2025 21:43, Peter Geoghegan wrote:
> The newly revised "skipskip" optimization seems to get the regressions
> down to only a 5% - 10% increase in runtime across a wide variety of
> unsympathetic cases -- I'm now validating performance against a test
> suite based on the adversarial cases presented by Masahiro Ikeda on
> this thread. Although I think that I'll end up tuning the "skipskip"
> mechanism some more (I may have been too conservative in marginal
> cases that actually do benefit from skipping), I deem these
> regressions to be acceptable. They're only seen in the most
> unsympathetic cases, where an omitted leading column has groupings of
> no more than about 50 index tuples, making skipping pretty hopeless.

On my laptop, this is the worst case I could come up with:

create table skiptest as select g / 10 as a, g%10 as b from
generate_series(1, 10000000) g;
vacuum freeze skiptest;
create index on skiptest (a, b);

set enable_seqscan=off; set max_parallel_workers_per_gather=0;

\timing on

After repeating a few times, to warm the cache:

postgres=# select count(*) from skiptest where b=1;
count
---------
1000000
(1 row)

Time: 202.501 ms

And after 'set skipscan_prefix_cols=0':

select count(*) from skiptest where b=1;
count
---------
1000000
(1 row)

Time: 164.762 ms

EXPLAIN ANALYZE confirms that it uses an Index Only scan in both cases.

> I knew from the outset that the hardest part of this project would be
> avoiding regressions in highly unsympathetic cases. The regressions
> that are still there seem very difficult to minimize any further; the
> overhead that remains comes from the simple need to maintain the
> scan's skip arrays once per page, before leaving the page. Once a scan
> decides to apply the "skipskip" optimization, it tends to stick with
> it for all future leaf pages -- leaving only the overhead of checking
> the high key while advancing the scan's arrays. I've cut just about
> all that that I can reasonably cut from the hot code paths that are at
> issue with the regressed cases.

Hmm, looking at the code and profile with perf, a lot of code is
executed per tuple. Just function calls, passing arguments, checking
flags etc. I suspect you could shave off some cycles by structuring the
code in a more compiler and branch-prediction friendly way. Or just with
more aggressive inlining. Not sure what exactly to suggest, but the code
of _bt_readpage() and all the subroutines it calls is complicated.

Aside from the performance of those routines, it doesn't feel very
intuitive. _bt_checkkeys() not only checks the current keys, but it can
also read ahead tuples on the same page and update the array keys.

One little thing I noticed by stepping with debugger is that it calls
index_getattr() twice for the same tuple and attribute. First in
_bt_check_compare(), and if that sets *continuescan=false, again in
_bt_tuple_before_array_skeys(). The first index_getattr() call is
certainly pretty expensive because that's where you get the cache miss
on the tuple when scanning. Not sure if the second call matters much,
but it feels like a bad smell.

The comment on _bt_tuple_before_array_skeys() says:

> * readpagetup callers must only call here when _bt_check_compare already set
> * continuescan=false. We help these callers deal with _bt_check_compare's
> * inability to distinguishing between the < and > cases (it uses equality
> * operator scan keys, whereas we use 3-way ORDER procs)
That begs the question, why does _bt_check_compare() not call the 3-way
ORDER proc in the first place? That would avoid the overhead of another
call here.

Aside from these micro-optimizations, I wonder about the "look-ahead"
logic in _bt_checkkeys_look_ahead. It feels pretty simplistic. Could you
use Exponential Search
(https://en.wikipedia.org/wiki/Exponential_search) instead of a plain
binary search on the page?

--
Heikki Linnakangas
Neon (https://neon.tech)

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-01-25 03:38:21
Message-ID:	CAH2-Wz==Q1jA7CTC_dDApUB2tyWantWaWdOJ7O2_fChoK9+wBg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 24, 2025 at 10:07 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On my laptop, this is the worst case I could come up with:
>
> create table skiptest as select g / 10 as a, g%10 as b from
> generate_series(1, 10000000) g;
> vacuum freeze skiptest;
> create index on skiptest (a, b);
>
> set enable_seqscan=off; set max_parallel_workers_per_gather=0;
>
> \timing on
>
> After repeating a few times, to warm the cache:
>
> postgres=# select count(*) from skiptest where b=1;
> count
> ---------
> 1000000
> (1 row)

I can reproduce this. However, it should be noted that the regression
completely goes away if I make one small change to your test case: all
I need to do is make sure that the CREATE INDEX happens *before*
inserting any rows into the table. Once I do that, suffix truncation
tends to be quite a bit more effective. This makes all the difference
with your test case, since it encourages the existing heuristics
within _bt_advance_array_keys to do the right thing and stick on the
leaf level. That allows the "skipskip" mechanism to kick in as
expected, which doesn't seem to be happening when the index is built
by CREATE INDEX.

Of course, this doesn't make your adversarial case invalid. But it
does suggest a solution: Maybe nbtsort.c could be taught to be more
careful about "picking a split point", matching the behavior of
nbtsplitloc.c. Alternatively, I could invent some new heuristics with
this issue in mind.

I already had an open item in my personal TODO. That open item
concerns backwards scans, which don't use the high key within
_bt_advance_array_keys. As such they cannot really expect to benefit
in the same way by my suggested changes to nbtsort.c.

Your adversarial case is probably exactly the same issue as the
backwards scan issue I plan on looking into, even though you used a
forward scan + CREATE INDEX. So I probably need a solution that'll
work just as well, regardless of how effective suffix truncation is
(since backwards scans will never have a "low key" to consider what's
likely to be on the next page in any case).

> Aside from the performance of those routines, it doesn't feel very
> intuitive. _bt_checkkeys() not only checks the current keys, but it can
> also read ahead tuples on the same page and update the array keys.

True. But none of that is new. That's all from Postgres 17.

> One little thing I noticed by stepping with debugger is that it calls
> index_getattr() twice for the same tuple and attribute. First in
> _bt_check_compare(), and if that sets *continuescan=false, again in
> _bt_tuple_before_array_skeys(). The first index_getattr() call is
> certainly pretty expensive because that's where you get the cache miss
> on the tuple when scanning. Not sure if the second call matters much,
> but it feels like a bad smell.

Possibly, but the right thing to do here is to just not maintain the
skip arrays at all. What's at issue with your test case is that the
scan doesn't quite manage to notice that that's what it should do. You
might still be right about this, but it is nevertheless true that this
*shouldn't* be relevant (it is relevant right now, but it's not hard
to see that it doesn't have to be relevant).

> The comment on _bt_tuple_before_array_skeys() says:
>
> > * readpagetup callers must only call here when _bt_check_compare already set
> > * continuescan=false. We help these callers deal with _bt_check_compare's
> > * inability to distinguishing between the < and > cases (it uses equality
> > * operator scan keys, whereas we use 3-way ORDER procs)
> That begs the question, why does _bt_check_compare() not call the 3-way
> ORDER proc in the first place? That would avoid the overhead of another
> call here.

Again, this is a design decision made by the Postgres 17 SAOP patch.

I think that there is something to be said for matching the behavior
of regular scans, including using operators (not 3-way ORDER procs)
when scanning on the leaf level.

> Aside from these micro-optimizations, I wonder about the "look-ahead"
> logic in _bt_checkkeys_look_ahead. It feels pretty simplistic. Could you
> use Exponential Search
> (https://en.wikipedia.org/wiki/Exponential_search) instead of a plain
> binary search on the page?

Maybe. But, again, I don't think that that's really the issue with
your test case. The issue is that it doesn't quite manage to use the
skipskip optimization, even though that's clearly possible, and
actually fixes the issue. Once it does that then
_bt_checkkeys_look_ahead won't ever be used, so it can't matter (at
least not as far as the query you came up with is concerned).

Let me get back to you on this.

Thanks for the review!

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-01-29 00:30:25
Message-ID:	CAH2-WzkEGz7diwA_2Oak2KAAOXcNMfNGec7gUmU6zrpTPdcAog@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 24, 2025 at 10:38 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I can reproduce this. However, it should be noted that the regression
> completely goes away if I make one small change to your test case: all
> I need to do is make sure that the CREATE INDEX happens *before*
> inserting any rows into the table. Once I do that, suffix truncation
> tends to be quite a bit more effective. This makes all the difference
> with your test case, since it encourages the existing heuristics
> within _bt_advance_array_keys to do the right thing and stick on the
> leaf level. That allows the "skipskip" mechanism to kick in as
> expected, which doesn't seem to be happening when the index is built
> by CREATE INDEX.

I think that I could have done better at explaining myself here. I'll
have another go at that:

Your test case showed an excessive number of primitive index scans:
EXPLAIN ANALYZE showed "Index Searches: 21859", even though the ideal
number of index searches is 1, given all these specifics. It would be
understandable if a person saw that and concluded that the added
overhead/regression comes from descending the index many more times
than is truly necessary -- blaming the added overhead that comes from
all those extra _bt_search calls is a good guess. But that's not it.
Not really.

You (Heikki) didn't actually make any claims about _bt_search being
too hot during profiling of this test case. You didn't actually see
_bt_search show up prominently when you ran "perf". What you actually
saw (and actually complained about) was stuff that is called from
_bt_readpage, to deal with array maintenance. Even still, I want to
avoid making this any more confusing than it needs to be. The nature
of the problem needs to be carefully teased apart.

There is an important sense in which the issue of excessive primitive
index scans *is* related to the regression from wasting cycles on
array maintenance, though only indirectly: the heuristics that decide
if the skipskip mechanism should be enabled during a _bt_readpage call
are indirectly influenced by certain other heuristics, in
_bt_advance_array_keys.
These other heuristics are the heuristics that determine
whether or not we'll start another primitive index scan (all of which
are in _bt_advance_array_key, and were added in Postgres 17, and
haven't been touched by this patch series).

More concretely: if we choose to start another primitive index scan,
and make a bad choice (because we land on the very next sibling leaf
page when we could gotten to by simply stepping right without calling
_bt_first/_bt_search again), then we also won't have an opportunity to
apply the skipskip mechanism when on that same sibling leaf page.
That's because in practice every leaf page read within _bt_readpage
will be the first leaf page of the ongoing primitive index scan with
this test case. Being the first leaf page of a primscan supposedly
makes a leaf page a bad target for the skipskip optimization, and so
we never actually apply the skipskip optimization in practice here.

Again, the real problem is simply that we're not applying the skipskip
optimization at all -- even though it was specifically written with
cases like Heikki's adversarial case in mind, and even though it
actually works as designed once it is forced to activate with Heikki's
test case. It may also be a bit of a problem that there's 21859 calls to
_bt_search instead of just 1, but that's a surprisingly small
contributor to the added overhead. (I'll probably fix the problem by
avoiding useless primitive index scans in the first place, rather than
by changing the heuristics that activate skipskip, which condition the
use of skipskip on firstPage=false. But, again, that doesn't mean that
the problem is excessive primscan overhead from all of the extra
_bt_first/_bt_search calls. The problem is indirect, and so my solution
can be indirect, too.)

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-02-05 23:43:17
Message-ID:	CAH2-Wzkz0wPe6+02kr+hC+JJNKfGtjGTzpG3CFVTQmKwWNrXNw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 24, 2025 at 10:38 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Fri, Jan 24, 2025 at 10:07 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > On my laptop, this is the worst case I could come up with:
> >
> > create table skiptest as select g / 10 as a, g%10 as b from
> > generate_series(1, 10000000) g;
> > vacuum freeze skiptest;
> > create index on skiptest (a, b);
> >
> > set enable_seqscan=off; set max_parallel_workers_per_gather=0;
> >
> > \timing on
> >
> > After repeating a few times, to warm the cache:
> >
> > postgres=# select count(*) from skiptest where b=1;
> > count
> > ---------
> > 1000000
> > (1 row)

> Your adversarial case is probably exactly the same issue as the
> backwards scan issue I plan on looking into, even though you used a
> forward scan + CREATE INDEX. So I probably need a solution that'll
> work just as well, regardless of how effective suffix truncation is
> (since backwards scans will never have a "low key" to consider what's
> likely to be on the next page in any case).

Attached is v23, which fixes this issue by making sure that the
"skipskip" optimization is actually applied -- something that was
always *supposed* to happen in cases such as this one. With v22, your
adversarial case was a little over 20% slower. With this v23 I have
the regression down to under 3%, since we'll now apply the skipskip
optimization as expected. This level of slowdown seems acceptable to
me, for a case such as this (after all, this is an index scan that the
optimizer is unlikely to ever actually choose).

The way that the "skipskip" optimization works is unchanged in v23.
And even the way that we decide whether to apply that optimization
didn't really change, either. What's new in v23 is that
v23-0003-*patch adds rules around primitive scan scheduling.
Obviously, I specifically targeted Heikki's regression when I came up
with this, but the new _bt_advance_array_keys rules are nevertheless
orthogonal: even scans that just use conventional SAOP arrays will
also use these new _bt_advance_array_keys heuristics (though it'll
tend to matter much less there).

As I attempted to explain recently (admittedly it's quite confusing),
making better choices around scheduling primitive index scans is
important for its own sake, for fairly obvious reasons, but it's even
more important for far more subtle reasons: better scheduling gives
the skipskip heuristics a clearer picture of what's really going on.
That's why more or less the same set of skipskip heuristics now work
much better in practice (in cases such as Heikki's adversarial case).
It's rather indirect.

Again, the problem that Heikki highlighted was more or less an
unforeseen, far removed consequence of not having a very clear picture
about how to schedule primitive index scans within
_bt_advance_array_keys. I proved this when I showed that Heikki's
problem went away, even with v22, once retail inserts were used
instead of CREATE INDEX. I demonstrated that that'll allow the
nbtsplitloc.c suffix truncation stuff to work, which is one way to
give the _bt_advance_array_keys primitive scan scheduling a clearer
picture of what it should be doing. With v23 I came up with another
way of doing that -- a way that actually works with indexes built with
CREATE INDEX (as well as with backwards scans, which had essentially
the same problem even when suffix truncation was working well).

The structure of the existing code changes somewhat to accommodate the
new requirements: v23 moves the scanBehind recheck process from
_bt_checkkeys into its _bt_readpage caller. This was arguably already
a missed opportunity for my commit 79fa7b3b from late last year. If we
expect _bt_readpage to explicitly check a flag to perform a
so->scanBehind recheck in one case, then we might as well have it do
so in all cases. It makes what's really going with on with the scan
quite a bit clearer.

--
Peter Geoghegan

Attachment	Content-Type	Size
v23-0002-Add-nbtree-skip-scan-optimizations.patch	application/octet-stream	166.6 KB
v23-0003-Lower-the-overhead-of-nbtree-runtime-skip-checks.patch	application/octet-stream	41.6 KB
v23-0005-DEBUG-Add-skip-scan-disable-GUCs.patch	application/octet-stream	4.4 KB
v23-0004-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/octet-stream	11.3 KB
v23-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.8 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-02-14 23:06:23
Message-ID:	CAH2-Wzn5-CV5GbNRbFKaUjBqGP8UTazgf7sgPdtMtnS=eKCqqA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 5, 2025 at 6:43 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> The way that the "skipskip" optimization works is unchanged in v23.
> And even the way that we decide whether to apply that optimization
> didn't really change, either. What's new in v23 is that
> v23-0003-*patch adds rules around primitive scan scheduling.
> Obviously, I specifically targeted Heikki's regression when I came up
> with this, but the new _bt_advance_array_keys rules are nevertheless
> orthogonal: even scans that just use conventional SAOP arrays will
> also use these new _bt_advance_array_keys heuristics (though it'll
> tend to matter much less there).

Attached is v24, which breaks out these recent changes to primscan
scheduling into their own commit/patch (namely
v24-0002-Improve-nbtree-SAOP-primitive-scan-scheduling.patch). The
primscan scheduling improvement stuff hasn't really changed since
v23, though (though I did polish it some more). I hope to be able to
commit this new primscan scheduling patch sooner rather than later
(though I don't think that it's quite committable yet). It is
technically an independent improvement to the scheduling of primitive
index scans during SAOP nbtree index scans, so it makes sense to get
it out of the way.

The changes in v24 aren't just structural. The real changes in v24 are
to the optimization previously known as the "skipskip" optimization,
which now appears alone in
v24-0004-Lower-the-overhead-of-nbtree-runtime-skip-checks.patch (since
I broke out the other changes into their own patch). I guess it's
called the "_bt_skip_ikeyprefix" optimization now, since that's the
name of the function that now activates the optimization (the function
that is sometimes called from _bt_readpage during so->skipScan scans).
It seemed better to place the emphasis on starting calls to
_bt_check_compare with a later pstate.ikey (i.e. one greater than 0),
since that's where most of the benefit comes from.

We now treat all of the scan's arrays as nonrequired when the
_bt_skip_ikeyprefix optimization is activated -- even skip arrays can
have nonrequired calls to _bt_advance_array_keys from
_bt_check_compare (that won't just happen for the scan's SAOP arrays
in v24). This approach seems clearer to me. More importantly, it
performs much better than the previous approach in certain complicated
cases involving multiple range skip arrays against several different
index columns. (I can elaborate on what those queries look like and
why they're better with this new approach, if anybody wants me to.)

--
Peter Geoghegan

Attachment	Content-Type	Size
v24-0006-DEBUG-Add-skip-scan-disable-GUCs.patch	application/x-patch	4.4 KB
v24-0004-Lower-the-overhead-of-nbtree-runtime-skip-checks.patch	application/x-patch	30.1 KB
v24-0005-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/x-patch	11.3 KB
v24-0003-Add-nbtree-skip-scan-optimizations.patch	application/x-patch	166.4 KB
v24-0002-Improve-nbtree-SAOP-primitive-scan-scheduling.patch	application/x-patch	16.7 KB
v24-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/x-patch	52.8 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-02-23 17:19:17
Message-ID:	CAH2-Wzmxzs1-SvgoT+y84FzSpz8K1g33Cyby4hpn_GNiOw+CJA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 14, 2025 at 6:06 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v24, which breaks out these recent changes to primscan
> scheduling into their own commit/patch (namely
> v24-0002-Improve-nbtree-SAOP-primitive-scan-scheduling.patch). The
> primscan scheduling improvement stuff hasn't really changed since
> v23, though (though I did polish it some more). I hope to be able to
> commit this new primscan scheduling patch sooner rather than later
> (though I don't think that it's quite committable yet).

The patch series recently bitrot due to conflicting changes on HEAD,
so I decided to post a new v25 now.

New in v25:

* Fixed a regression with parallel index scans caused by the improved
scheduling logic added by
0002-Improve-nbtree-SAOP-primitive-scan-scheduling.patch.

v24 did not account for how "firstPage=false" calls to _bt_readpage
might originate from _bt_first (not _bt_next) during parallel scans,
even during the first page for the parallel worker's primitive
scan/_bt_first call -- which made the heuristics added to
_bt_advance_array_keys do the wrong thing by not terminating primitive
scan based on faulty information. This was possible via the parallel
scan _bt_readnextpage "seized=true" path, which led to regressions. In
v24 we now pass "firstPage=true" whenever a call to _bt_readpage
happens through _bt_first, no matter the details (for the first
_bt_readpage call's page).

* The additional EXPLAIN ANALYZE logic (that shows "Index Searches:
N") added by 0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch has
been adjusted, and no longer divides by nloops.

I explain why dividing by nloops has some fairly bad consequences on
the thread I started for the quasi-independent enhancement to EXPLAIN
ANALYZE output:

https://postgr.es/m/CAH2-WzmebSkeKPGw7TEaNw9=Qx-X8fAnFw916Fd2V8VVqYqqaQ@mail.gmail.com

* Polished commit messages.

* Added more test coverage. The LOC covered percentage is now at just
over 90% for nbtutils.c. We now have coverage for almost all of the
new code that advances the scan's skip arrays, including code that
deals with NULL values that is seldom reached.

--
Peter Geoghegan

Attachment	Content-Type	Size
v25-0004-Lower-the-overhead-of-nbtree-runtime-skip-checks.patch	application/x-patch	32.4 KB
v25-0006-DEBUG-Add-skip-scan-disable-GUCs.patch	application/x-patch	4.4 KB
v25-0005-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/x-patch	11.3 KB
v25-0002-Improve-nbtree-SAOP-primitive-scan-scheduling.patch	application/x-patch	24.6 KB
v25-0003-Add-nbtree-skip-scan-optimizations.patch	application/x-patch	177.2 KB
v25-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/x-patch	53.1 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-02-27 18:23:57
Message-ID:	CAH2-Wz=1HyyFTuweQLZLw92ZPespgZfD7Ugh1gugT1scHgZK5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 23, 2025 at 12:19 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> The patch series recently bitrot due to conflicting changes on HEAD,
> so I decided to post a new v25 now.

Attached is v26, which has no functional changes. This is just to fix
yet more bitrot.

--
Peter Geoghegan

Attachment	Content-Type	Size
v26-0005-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/octet-stream	11.3 KB
v26-0004-Lower-nbtree-skip-array-maintenance-overhead.patch	application/octet-stream	33.7 KB
v26-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch	application/octet-stream	52.8 KB
v26-0006-DEBUG-Add-skip-scan-disable-GUCs.patch	application/octet-stream	4.4 KB
v26-0003-Add-nbtree-skip-scan-optimizations.patch	application/octet-stream	179.6 KB
v26-0002-Improve-nbtree-SAOP-primitive-scan-scheduling.patch	application/octet-stream	25.5 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-08 16:43:33
Message-ID:	CAH2-Wz=H_RG5weNGeUG_TkK87tRBnH9mGCQj6WpM4V4FNWKv2g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 27, 2025 at 1:23 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v26, which has no functional changes. This is just to fix
> yet more bitrot.

Attached is v27, which fixes bitrot. It also has some notable changes:

* New approach to "Index Searches: N" instrumentation patch, following
the recent revert of the original approach following
"debug_parallel_query=regress" buildfarm failures.

This is being discussed over on the thread that I started to discuss
the EXPLAIN ANALYZE instrumentation work:

https://www.postgresql.org/message-id/CAH2-Wzk%2BcXBD1tnhQ-oagHuY9Fw5uArJE%2BLxfAP2VjZmDawbeQ%40mail.gmail.com

* New patch that makes BTMaxItemSize not require a "Page" arg, so that
it can be used in contexts where a page image isn't close at hand.
This is preparation for the approach taken to parallel index scans,
where we apply a conservative "1/3 of a page" worst case limit on the
size of a datum, but don't have a page to pass to BTMaxItemSize.

I plan on committing this one soon. It's obviously pretty pointless to
make the BTMaxItemSize operate off of a page header, and not requiring
it is more flexible.

--
Peter Geoghegan

Attachment	Content-Type	Size
v27-0007-DEBUG-Add-skip-scan-disable-GUCs.patch	application/x-patch	4.4 KB
v27-0006-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/x-patch	11.7 KB
v27-0004-Add-nbtree-skip-scan-optimizations.patch	application/x-patch	174.2 KB
v27-0005-Lower-nbtree-skip-array-maintenance-overhead.patch	application/x-patch	33.7 KB
v27-0001-Show-index-search-count-in-EXPLAIN-ANALYZE-take-.patch	application/x-patch	98.3 KB
v27-0003-Improve-nbtree-SAOP-primitive-scan-scheduling.patch	application/x-patch	24.5 KB
v27-0002-Make-BTMaxItemSize-macro-not-require-a-Page-arg.patch	application/x-patch	7.2 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-11 15:52:44
Message-ID:	CAH2-Wznt07d47oQP5Zi0N9=cxfbJi=T4iC6gaMAV2UMD1xJ=5w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 8, 2025 at 11:43 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I plan on committing this one soon. It's obviously pretty pointless to
> make the BTMaxItemSize operate off of a page header, and not requiring
> it is more flexible.

Committed. And committed a revised version of "Show index search count
in EXPLAIN ANALYZE" that addresses the issues with non-parallel-aware
index scan executor nodes that run from a parallel worker.

Attached is v28. This is just to keep the patch series applying
cleanly -- no real changes here.

--
Peter Geoghegan

Attachment	Content-Type	Size
v28-0001-Improve-nbtree-SAOP-primitive-scan-scheduling.patch	application/octet-stream	24.5 KB
v28-0004-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/octet-stream	11.7 KB
v28-0002-Add-nbtree-skip-scan-optimizations.patch	application/octet-stream	176.4 KB
v28-0003-Lower-nbtree-skip-array-maintenance-overhead.patch	application/octet-stream	33.7 KB
v28-0005-DEBUG-Add-skip-scan-disable-GUCs.patch	application/octet-stream	4.4 KB

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-11 22:24:22
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11.03.2025 18:52, Peter Geoghegan wrote:
> On Sat, Mar 8, 2025 at 11:43 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I plan on committing this one soon. It's obviously pretty pointless to
>> make the BTMaxItemSize operate off of a page header, and not requiring
>> it is more flexible.
> Committed. And committed a revised version of "Show index search count
> in EXPLAIN ANALYZE" that addresses the issues with non-parallel-aware
> index scan executor nodes that run from a parallel worker.
>
Hi, reviewing the code I noticed that you removed the
parallel_aware check for DSM initialization for BitmapIndexScan,
IndexScan, IndexOnlyScan,
but you didn't do the same in the ExecParallelReInitializeDSM function
and I can't figure out why to be honest. I think it might be wrong or
I'm missing something.

As I see, it might be necessary if the parallel executor needs to
reinitialize the shared memory state before launching a fresh batches of
workers (it is based on
the comment of the ExecParallelReinitialize function), and when it
happens all child nodes reset their state (see the comment next to the
call to the ExecParallelReInitializeDSM
function).

So, what do you think? Should the check be removed in the
ExecParallelReInitializeDSM function too?

--
Regards,
Alena Rybakina
Postgres Professional

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-11 22:59:09
Message-ID:	CAH2-WzkMpFsE_hM9-5tecF22jVJSGtKMFMsYqMa-uo73MOxsWw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 11, 2025 at 6:24 PM Alena Rybakina
<a(dot)rybakina(at)postgrespro(dot)ru> wrote:
> Hi, reviewing the code I noticed that you removed the
> parallel_aware check for DSM initialization for BitmapIndexScan,
> IndexScan, IndexOnlyScan,
> but you didn't do the same in the ExecParallelReInitializeDSM function
> and I can't figure out why to be honest. I think it might be wrong or
> I'm missing something.

I didn't exactly remove the check -- not completely. You could say
that I *moved* the check, from the caller (i.e. from functions in
execParallel.c such as ExecParallelInitializeDSM) to the callee (i.e.
into individual executor node functions such as
ExecIndexScanInitializeDSM). I did it that way because it's more
flexible.

We need this flexibility because we need to allocate DSM for
instrumentation state when EXPLAIN ANALYZE runs a parallel query with
an index scan node -- even when the scan node runs inside a parallel
worker, but is non-parallel-aware (parallel oblivious). Obviously, we
might also need to allocate space for a shared index scan descriptor
(including index AM opaque state), in the same way as before
(or we might need to do both).

> As I see, it might be necessary if the parallel executor needs to
> reinitialize the shared memory state before launching a fresh batches of
> workers (it is based on
> the comment of the ExecParallelReinitialize function), and when it
> happens all child nodes reset their state (see the comment next to the
> call to the ExecParallelReInitializeDSM
> function).

I did not move/remove the parallel_aware check in
ExecParallelReInitializeDSM because it doesn't have the same
requirements -- we *don't* need that flexibility there, because it
isn't necessary (or correct) to reinitialize anything when the only
thing that's in DSM is instrumentation state. (Though I did add an
assertion about parallel_aware-ness to functions like
ExecIndexScanReInitializeDSM, which I thought might make this a little
clearer to people reading files like nodeIndexScan.c, and wondering
why ExecIndexScanReInitializeDSM doesn't specifically test
parallel_aware.)

Obviously, these node types don't have their state reset (quoting
ExecParallelReInitializeDSM switch statement here):

case T_BitmapIndexScanState:
case T_HashState:
case T_SortState:
case T_IncrementalSortState:
case T_MemoizeState:
/* these nodes have DSM state, but no reinitialization is
required */
break;

I added T_BitmapIndexScanState to the top of this list -- the rest are
from before today's commit. I did this since (like the other nodes
shown) BitmapIndexScan's use of DSM is limited to instrumentation
state -- which we never want to reset (there is no such thing as a
parallel bitmap index scan, though bitmap index scans can run in
parallel workers, and still need the instrumentation to work with
EXPLAIN ANALYZE).

We still need to call functions like ExecIndexOnlyScanReInitializeDSM
from here/ExecParallelReInitializeDSM, of course, but that won't reset
the new instrumentation state (because my patch didn't touch it at
all, except for adding that assertion I already mentioned in passing).

We actually specifically rely on *not* resetting the shared memory
state to get correct behavior in cases like this one:

https://www.postgresql.org/message-id/CAAKRu_YjBPfGp85ehY1t9NN%3DR9pB9k%3D6rztaeVkAm-OeTqUK4g%40mail.gmail.com

See comments about this in places like ExecEndBitmapIndexScan (added
by today's commit), or in ExecEndBitmapHeapScan (added by the similar
bitmap heap instrumentation patch discussed on that other thread,
which became commit 5a1e6df3).

--
Peter Geoghegan

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-12 20:28:34
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12.03.2025 01:59, Peter Geoghegan wrote:
> On Tue, Mar 11, 2025 at 6:24 PM Alena Rybakina
> <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>> Hi, reviewing the code I noticed that you removed the
>> parallel_aware check for DSM initialization for BitmapIndexScan,
>> IndexScan, IndexOnlyScan,
>> but you didn't do the same in the ExecParallelReInitializeDSM function
>> and I can't figure out why to be honest. I think it might be wrong or
>> I'm missing something.
> I didn't exactly remove the check -- not completely. You could say
> that I *moved* the check, from the caller (i.e. from functions in
> execParallel.c such as ExecParallelInitializeDSM) to the callee (i.e.
> into individual executor node functions such as
> ExecIndexScanInitializeDSM). I did it that way because it's more
> flexible.
>
> We need this flexibility because we need to allocate DSM for
> instrumentation state when EXPLAIN ANALYZE runs a parallel query with
> an index scan node -- even when the scan node runs inside a parallel
> worker, but is non-parallel-aware (parallel oblivious). Obviously, we
> might also need to allocate space for a shared index scan descriptor
> (including index AM opaque state), in the same way as before
> (or we might need to do both).
I see and agree with this changes now.
>> As I see, it might be necessary if the parallel executor needs to
>> reinitialize the shared memory state before launching a fresh batches of
>> workers (it is based on
>> the comment of the ExecParallelReinitialize function), and when it
>> happens all child nodes reset their state (see the comment next to the
>> call to the ExecParallelReInitializeDSM
>> function).
> I did not move/remove the parallel_aware check in
> ExecParallelReInitializeDSM because it doesn't have the same
> requirements -- we *don't* need that flexibility there, because it
> isn't necessary (or correct) to reinitialize anything when the only
> thing that's in DSM is instrumentation state. (Though I did add an
> assertion about parallel_aware-ness to functions like
> ExecIndexScanReInitializeDSM, which I thought might make this a little
> clearer to people reading files like nodeIndexScan.c, and wondering
> why ExecIndexScanReInitializeDSM doesn't specifically test
> parallel_aware.)
>
> Obviously, these node types don't have their state reset (quoting
> ExecParallelReInitializeDSM switch statement here):
>
> case T_BitmapIndexScanState:
> case T_HashState:
> case T_SortState:
> case T_IncrementalSortState:
> case T_MemoizeState:
> /* these nodes have DSM state, but no reinitialization is
> required */
> break;
>
> I added T_BitmapIndexScanState to the top of this list -- the rest are
> from before today's commit. I did this since (like the other nodes
> shown) BitmapIndexScan's use of DSM is limited to instrumentation
> state -- which we never want to reset (there is no such thing as a
> parallel bitmap index scan, though bitmap index scans can run in
> parallel workers, and still need the instrumentation to work with
> EXPLAIN ANALYZE).
>
> We still need to call functions like ExecIndexOnlyScanReInitializeDSM
> from here/ExecParallelReInitializeDSM, of course, but that won't reset
> the new instrumentation state (because my patch didn't touch it at
> all, except for adding that assertion I already mentioned in passing).
>
> We actually specifically rely on *not* resetting the shared memory
> state to get correct behavior in cases like this one:
>
> https://www.postgresql.org/message-id/CAAKRu_YjBPfGp85ehY1t9NN%3DR9pB9k%3D6rztaeVkAm-OeTqUK4g%40mail.gmail.com
>
> See comments about this in places like ExecEndBitmapIndexScan (added
> by today's commit), or in ExecEndBitmapHeapScan (added by the similar
> bitmap heap instrumentation patch discussed on that other thread,
> which became commit 5a1e6df3).
>
>
Thank you for the explanation!

Now I see why these changes were made.

After your additional explanations, everything really became clear and I
fully agree with the current code regarding this part.
However I did not see an explanation to the commit regarding this place,
as well as a comment next to the assert and the parallel_aware check and
why BitmapIndexScanState was added in the ExecParallelReInitializeDSM.
In my opinion, there is not enough additional explanation about this in
the form of comments, although I think that it has already been
explained here enough for someone who will look at this code.

--
Regards,
Alena Rybakina
Postgres Professional

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-12 20:50:44
Message-ID:	CAH2-WznDh2Vm+ZCp0VJQTUKh+_+MJZPciEdR+n-BxtQZ4SeNaQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 12, 2025 at 4:28 PM Alena Rybakina
<a(dot)rybakina(at)postgrespro(dot)ru> wrote:
> Thank you for the explanation!
>
> Now I see why these changes were made.
>
> After your additional explanations, everything really became clear and I
> fully agree with the current code regarding this part.

Cool.

> However I did not see an explanation to the commit regarding this place,
> as well as a comment next to the assert and the parallel_aware check and
> why BitmapIndexScanState was added in the ExecParallelReInitializeDSM.

I added BitmapIndexScanState to the switch statement in
ExecParallelReInitializeDSM because it is in the category of
planstates that never need their shared memory reinitialized -- that's
just how we represent such a plan state there.

I think that this is supposed to serve as a kind of documentation,
since it doesn't really affect how things behave. That is, it wouldn't
actually affect anything if I had forgotten to add
BitmapIndexScanState to the ExecParallelReInitializeDSM switch
statement "case" that represents that it is in this "plan state
category": the switch ends with catch-all "default: break;".

> In my opinion, there is not enough additional explanation about this in
> the form of comments, although I think that it has already been
> explained here enough for someone who will look at this code.

What can be done to improve the situation? For example, would adding a
comment next to the new assertions recently added to
ExecIndexScanReInitializeDSM and ExecIndexOnlyScanReInitializeDSM be
an improvement? And if so, what would the comment say?

--
Peter Geoghegan

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-17 22:51:25
Message-ID:	CAEze2WhiSkqifvZqrJQNgqktejJMb_-7pTWQUETN1w8a=h_-Yg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 11 Mar 2025 at 16:53, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Sat, Mar 8, 2025 at 11:43 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > I plan on committing this one soon. It's obviously pretty pointless to
> > make the BTMaxItemSize operate off of a page header, and not requiring
> > it is more flexible.
>
> Committed. And committed a revised version of "Show index search count
> in EXPLAIN ANALYZE" that addresses the issues with non-parallel-aware
> index scan executor nodes that run from a parallel worker.
>
> Attached is v28. This is just to keep the patch series applying
> cleanly -- no real changes here.

You asked off-list for my review of 0003. I'd already reviewed 0001
before that, so that review also included. I'll see if I can spend
some time on the other patches too, but for 0003 I think I got some
good consistent feedback.

0001:

> src/backend/access/nbtree/nbtsearch.c
> _bt_readpage

This hasn't changed meaningfully in this patch, but I noticed that
pstate.finaltup is never set for the final page of the scan direction
(i.e. P_RIGHTMOST or P_LEFTMOST for forward or backward,
respectively). If it isn't used more than once after the first element
of non-P_RIGHTMOST/LEFTMOST pages, why is it in pstate? Or, if it is
used more than once, why shouldn't it be used in

Apart from that, 0001 looks good to me.

0003:

> _bt_readpage

In forward scan mode, recovery from forcenonrequired happens after the
main loop over all page items. In backward mode, it's in the loop:

> + if (offnum == minoff && pstate.forcenonrequired)
> + {
> + Assert(so->skipScan);

I think there's a comment missing that details _why_ we do this;
probably something like:

/*
* We're about to process the final item on the page.
* Un-set forcenonrequired, so the next _bt_checkkeys will
* evaluate required scankeys and signal an end to this
* primitive scan if we've reached a stopping point.
*/

In line with that, could you explain a bit more about the
pstate.forcenonrequired optimization? I _think_ it's got something to
do with "required" scankeys adding some overhead per scankey, which
can be significant with skipscan evaluations and ignoring the
requiredness can thus save some cycles, but the exact method doesn't
seem to be very well articulated.

> _bt_skip_ikeyprefix

I _think_ it's worth special-casing firstchangingattnum=1, as in that
case we know in advance there is no (immediate) common ground between
the index tuples and thus any additional work we do towards parsing
the scankeys would be wasted - except for matching inequality bounds
for firstchangingatt, or matching "open" skip arrays for a prefix of
attributes starting at firstchangingattnum (as per the
array->null_elem case).

I also notice somed some other missed opportunities for optimizing
page accesses:

> + if (key->sk_strategy != BTEqualStrategyNumber)

The code halts optimizing "prefix prechecks" when we notice a
non-equality key. It seems to me that we can do the precheck on shared
prefixes with non-equality keys just the same as with equality keys;
and it'd improve performance in those cases, too.

> + if (!(key->sk_flags & SK_SEARCHARRAY))
> + if (key->sk_attno < firstchangingattnum)
> + {
> + if (result == 0)
> + continue; /* safe, = key satisfied by every tuple */
> + }
> + break; /* pstate.ikey to be set to scalar key's ikey */

This code finds out that no tuple on the page can possibly match the
scankey (idxtup=scalar returns non-0 value) but doesn't (can't) use it
to exit the scan. I think that's a missed opportunity for
optimization; now we have to figure that out for every tuple in the
scan. Same applies to the SAOP -array case (i.e. non-skiparray).

Thank you for working on this.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-18 10:54:48
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12.03.2025 23:50, Peter Geoghegan wrote:
> On Wed, Mar 12, 2025 at 4:28 PM Alena Rybakina
> <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>> Thank you for the explanation!
>>
>> Now I see why these changes were made.
>>
>> After your additional explanations, everything really became clear and I
>> fully agree with the current code regarding this part.
> Cool.
>
>> However I did not see an explanation to the commit regarding this place,
>> as well as a comment next to the assert and the parallel_aware check and
>> why BitmapIndexScanState was added in the ExecParallelReInitializeDSM.
> I added BitmapIndexScanState to the switch statement in
> ExecParallelReInitializeDSM because it is in the category of
> planstates that never need their shared memory reinitialized -- that's
> just how we represent such a plan state there.
>
> I think that this is supposed to serve as a kind of documentation,
> since it doesn't really affect how things behave. That is, it wouldn't
> actually affect anything if I had forgotten to add
> BitmapIndexScanState to the ExecParallelReInitializeDSM switch
> statement "case" that represents that it is in this "plan state
> category": the switch ends with catch-all "default: break;".
Agree.
>> In my opinion, there is not enough additional explanation about this in
>> the form of comments, although I think that it has already been
>> explained here enough for someone who will look at this code.
> What can be done to improve the situation? For example, would adding a
> comment next to the new assertions recently added to
> ExecIndexScanReInitializeDSM and ExecIndexOnlyScanReInitializeDSM be
> an improvement? And if so, what would the comment say?
>
After reviewing the logic again, I realized that I was confused
precisely in the reinitialization of memory for IndexScanState and
IndexOnlyScanState.

As far as I can see, either assert is not needed here, the functions
ExecIndexScanReInitializeDSM and ExecIndexScanReInitializeDSM can be
called only if parallel_aware is positive, or it makes sense that
reinitialization is needed only if parallel_aware is positive, then the
condition noted above is not needed. According to your letter (0), the
check should be removed there too, but I got confused in the comment. We
do not need to reinitialize memory because DSM is instrumentation state
only, but it turns out that we are reinitializing the memory, so we
don't do it at all?

I attached a diff file to the letter with the comment.

[0]
https://www.postgresql.org/message-id/CAH2-WzkMpFsE_hM9-5tecF22jVJSGtKMFMsYqMa-uo73MOxsWw%40mail.gmail.com

--
Regards,
Alena Rybakina
Postgres Professional

Attachment	Content-Type	Size
for_reinitialize_memory.diff.no-cfbot	text/plain	946 bytes

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-18 13:37:33
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18.03.2025 13:54, Alena Rybakina wrote:
>
>
> On 12.03.2025 23:50, Peter Geoghegan wrote:
>> On Wed, Mar 12, 2025 at 4:28 PM Alena Rybakina
>> <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>>> Thank you for the explanation!
>>>
>>> Now I see why these changes were made.
>>>
>>> After your additional explanations, everything really became clear and I
>>> fully agree with the current code regarding this part.
>> Cool.
>>
>>> However I did not see an explanation to the commit regarding this place,
>>> as well as a comment next to the assert and the parallel_aware check and
>>> why BitmapIndexScanState was added in the ExecParallelReInitializeDSM.
>> I added BitmapIndexScanState to the switch statement in
>> ExecParallelReInitializeDSM because it is in the category of
>> planstates that never need their shared memory reinitialized -- that's
>> just how we represent such a plan state there.
>>
>> I think that this is supposed to serve as a kind of documentation,
>> since it doesn't really affect how things behave. That is, it wouldn't
>> actually affect anything if I had forgotten to add
>> BitmapIndexScanState to the ExecParallelReInitializeDSM switch
>> statement "case" that represents that it is in this "plan state
>> category": the switch ends with catch-all "default: break;".
> Agree.
>>> In my opinion, there is not enough additional explanation about this in
>>> the form of comments, although I think that it has already been
>>> explained here enough for someone who will look at this code.
>> What can be done to improve the situation? For example, would adding a
>> comment next to the new assertions recently added to
>> ExecIndexScanReInitializeDSM and ExecIndexOnlyScanReInitializeDSM be
>> an improvement? And if so, what would the comment say?
>>
> After reviewing the logic again, I realized that I was confused
> precisely in the reinitialization of memory for IndexScanState and
> IndexOnlyScanState.
>
> As far as I can see, either assert is not needed here, the functions
> ExecIndexScanReInitializeDSM and ExecIndexScanReInitializeDSM can be
> called only if parallel_aware is positive, or it makes sense that
> reinitialization is needed only if parallel_aware is positive, then
> the condition noted above is not needed. According to your letter (0),
> the check should be removed there too, but I got confused in the
> comment. We do not need to reinitialize memory because DSM is
> instrumentation state only, but it turns out that we are
> reinitializing the memory, so we don't do it at all?
>
> I attached a diff file to the letter with the comment.
>
> [0]
> https://www.postgresql.org/message-id/CAH2-WzkMpFsE_hM9-5tecF22jVJSGtKMFMsYqMa-uo73MOxsWw%40mail.gmail.com
>
>
Sorry, I figured it out. The Assert was added to avoid misuse of the
function to reinitialize memory and to ensure that it happens when
parallel_aware is positive.

--
Regards,
Alena Rybakina
Postgres Professional

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-18 16:41:18
Message-ID:	CAH2-Wznbf4X-rJWGReTCKtA8U0+j0A6mkjx1oRU5FrmYN00p2A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 18, 2025 at 9:37 AM Alena Rybakina
<a(dot)rybakina(at)postgrespro(dot)ru> wrote:
> Sorry, I figured it out. The Assert was added to avoid misuse of the function to reinitialize memory and to ensure that it happens when parallel_aware is positive.

Yeah. The assertion is supposed to suggest "don't worry, I didn't
forget to test parallel_aware in this function
[ExecIndexScanReInitializeDSM or ExecIndexScanReInitializeDSM], it's
just that I expect to not be called unless parallel_aware is true".

An alternative approach would have been to explicitly handle
parallel_aware=false in ExecIndexScanReInitializeDSM and
ExecIndexScanReInitializeDSM: they could just return false. That would
make ExecIndexScanReInitializeDSM and ExecIndexScanReInitializeDSM
more consistent with other nearby functions (in nodeIndexScan.c and
nodeIndexOnlyScan.c), at the cost of making the calling code (in
execParallel.c) less consistent. I really don't think that that
alternative is any better to what I actually did (I actually
considered doing things this way myself at one point), though it also
doesn't seem worse. So I'm inclined to do nothing about it now.

--
Peter Geoghegan

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-18 17:01:00
Message-ID:	CAEze2WjgrLPBt9YBCCYY1p=9TSjQgL5xoaeH3ob3gmekTD3XOw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 17 Mar 2025 at 23:51, Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
>
> On Tue, 11 Mar 2025 at 16:53, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> >
> > On Sat, Mar 8, 2025 at 11:43 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > > I plan on committing this one soon. It's obviously pretty pointless to
> > > make the BTMaxItemSize operate off of a page header, and not requiring
> > > it is more flexible.
> >
> > Committed. And committed a revised version of "Show index search count
> > in EXPLAIN ANALYZE" that addresses the issues with non-parallel-aware
> > index scan executor nodes that run from a parallel worker.
> >
> > Attached is v28. This is just to keep the patch series applying
> > cleanly -- no real changes here.
>
> You asked off-list for my review of 0003. I'd already reviewed 0001
> before that, so that review also included. I'll see if I can spend
> some time on the other patches too

My comments on 0004:

> _bt_skiparray_strat_decrement
> _bt_skiparray_strat_increment

In both functions the generated value isn't used when the in/decrement
overflows (and thus invalidates the qual), or when the opclass somehow
doesn't have a <= or >= operator, respectively.
For byval types that's not much of an issue, but for by-ref types
(such as uuid, or bigint on 32-bit systems) that's not great, as btree
explicitly allows no leaks for the in/decrement functions, and now we
use those functions and leak the values.

Additionally, the code is essentially duplicated between the
functions, with as only differences which sksup function to call;
which opstrategies to check, and where to retrieve/put the value. It's
only 2 instances total, but if you figure out how to make a nice
single function from the two that'd be appreciated, as it reduces
duplication and chances for divergence.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-18 19:15:20
Message-ID:	CAH2-Wz=G+amASrOh3FJ+T1N0B_2bhAKbB5aq-f3YUCXk-sVVLQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 17, 2025 at 6:51 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> This hasn't changed meaningfully in this patch, but I noticed that
> pstate.finaltup is never set for the final page of the scan direction
> (i.e. P_RIGHTMOST or P_LEFTMOST for forward or backward,
> respectively). If it isn't used more than once after the first element
> of non-P_RIGHTMOST/LEFTMOST pages, why is it in pstate? Or, if it is
> used more than once, why shouldn't it be used in

We don't set pstate.finaltup on either the leftmost or rightmost page
(nothing new, this is all from the Postgres 17 SAOP patch) because we
cannot possibly need to start another primitive index scan from that
point. We only use pstate.finaltup to determine if we should start a
new primitive index scan, every time the scan's array keys advance
(except during "sktrig_required=false" array advancement, and barring
the case where we don't have to check pstate.finaltup because we see
that the new set of array keys are already an exact match for the
tuple passed to _bt_advance_array_keys). In short, pstate.finaltup is
used during a significant fraction of all calls to
_bt_advance_array_keys, and there is no useful limit on the number of
times that pstate.finaltup can be used per _bt_readpage.

Although you didn't say it in your email (you said it to me on our
most recent call), I believe that you're asking about pstate.finaltup
for reasons that are more related to 0003-* than to 0001-*. As I
recall, you asked me about why it was that the 0003-* patch avoids
_bt_readpage's call to _bt_skip_ikeyprefix on the leftmost or
rightmost leaf page (i.e. a page that lacks a pstate.finaltup). You
were skeptical about that -- understandably so.

To recap, 0003-* avoids calling _bt_skip_ikeyprefix on the rightmost
(and leftmost) page because it reasons that activating that
optimization is only okay when we know that we'll be able to "recover"
afterwards, during the finaltup _bt_checkkeys call. By "recover", I
mean restore the invariant for required array keys: that they track
our progress through the key space. It felt safer and easier for
0003-* to just not call _bt_skip_ikeyprefix on a page that has no
finaltup -- that way we won't have anything that we need to recover
from, when we lack the pstate.finaltup that we generally expect to use
for this purpose. This approach allowed me to add the following
assertions a little bit later on, right at the end of _bt_readpage
(quoting 0003-* here):

@@ -1993,6 +2031,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection
dir, OffsetNumber offnum,
so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
}

+ /*
+ * As far as our caller is concerned, the scan's arrays always track its
+ * progress through the index's key space.
+ *
+ * If _bt_skip_ikeyprefix told us to temporarily treat all scan keys as
+ * nonrequired (during a skip scan), then we must recover afterwards by
+ * advancing our arrays using finaltup (with !pstate.forcenonrequired).
+ */
+ Assert(pstate.ikey == 0 && !pstate.forcenonrequired);
+
return (so->currPos.firstItem <= so->currPos.lastItem);
}

I've actually come around to your point of view on this (or what I
thought was your PoV from our call). That is, I now *think* that it
would be better if the code added by 0003-* called
_bt_skip_ikeyprefix, without regard for whether or not we'll have a
finaltup _bt_checkkeys call to "recover" (i.e. whether we're on the
leftmost or rightmost page shouldn't matter).

My change in perspective on this question is related to another change
of perspective, on the question of whether we actually need to call
_bt_start_array_keys as part of "recovering/restoring the array
invariant", just ahead of the finaltup _bt_checkkeys call. As you
know, 0003-* calls _bt_start_array_keys in this way, but that now
seems like overkill. It can have undesirable side-effects when the
array keys spuriously appear to advance, when in fact they were
restarted via the _bt_start_array_keys call, only to have their
original values restored via the finaltup call that immediately
follows.

Here's what I mean about side-effects:

We shouldn't allow _bt_advance_array_keys' primitive index scan
scheduling logic to falsely believe that the array keys have advanced,
when they didn't really advance in any practical sense. The commit
message of 0003-* promises that its optimization cannot influence
primitive index scan scheduling at all, which seems like a good
general principle for us to follow. But I recently discovered that
that promise was subtly broken, which I tied to the way that 0003-*
calls _bt_start_array_keys (this didn't produce wrong answers, and
didn't even really hurt performance, but it seems wonky and hard to
test). So I now think that I need to expect more from
_bt_skip_ikeyprefix and its pstate.forcenonrequired optimization, so
that my promise about primscan scheduling is honored.

Tying it back to your concern, once I do that (once I stop calling
_bt_start_array_keys in 0003-* to "hard reset" the arrays), I can also
stop caring about finaltup being set on the rightmost page, at the
point where we decide if _bt_skip_ikeyprefix should be called.

Here's how I think that this will be safe:

Obviously, we can't expect _bt_skip_ikeyprefix/pstate.forcenonrequired
mode to maintain the scan's required arrays in the usual way -- the
whole idea in 0003-* is to stop properly maintaining the arrays, until
right at the end of the _bt_readpage call, so as to save cycles.
However, it should be possible to teach _bt_skip_ikeyprefix to not
leave the array keys in an irredeemably bad state -- even when no
finaltup call is possible during the same _bt_readpage. And, even when
the scan direction changes at an inconvenient time. The next revision
(v29) is likely to strengthen the guarantees that _bt_skip_ikeyprefix
makes about the state that it'll leave the scan's array keys in, so
that its _bt_readpage caller can be laxer about "fully undoing" its
call to _bt_skip_ikeyprefix. The invariants we need to restore only
really apply when the scan needs to continue in the same scan
direction, at least one more page.

In short, as long as the array keys can never "go backwards" (relative
to the current scan direction), then we'll be able to recover during
the next conventional call to _bt_checkkeys (meaning the next
!pstate.forcenonrequired call) -- even if that next call should happen
on some other page (in practice it is very unlikely that it will, but
we can still be prepared for that). While it's true that
_bt_advance_array_keys (with sktrig_required=true) always promises to
advance the array keys to the maximum possible extent that it can know
to be safe, based on the caller's tuple alone, that in itself doesn't
obligate _bt_readpage to make sure that _bt_advance_array_keys will be
called when the top-level scan is over. We have never expected
_bt_advance_array_keys to *reliably* reach the final set of array keys
at the end of the scan (e.g., this won't happen when the index is
completely empty, since we'll never call _bt_readpage in the first
place).

> In forward scan mode, recovery from forcenonrequired happens after the
> main loop over all page items. In backward mode, it's in the loop:
>
> > + if (offnum == minoff && pstate.forcenonrequired)
> > + {
> > + Assert(so->skipScan);
>
> I think there's a comment missing that details _why_ we do this;
> probably something like:
>
> /*
> * We're about to process the final item on the page.
> * Un-set forcenonrequired, so the next _bt_checkkeys will
> * evaluate required scankeys and signal an end to this
> * primitive scan if we've reached a stopping point.
> */

I think that the right place to talk about this is above
_bt_skip_ikeyprefix itself.

> In line with that, could you explain a bit more about the
> pstate.forcenonrequired optimization? I _think_ it's got something to
> do with "required" scankeys adding some overhead per scankey, which
> can be significant with skipscan evaluations and ignoring the
> requiredness can thus save some cycles, but the exact method doesn't
> seem to be very well articulated.

The main benefit of the _bt_skip_ikeyprefix optimization is that it
allows us to skip a pstate.ikey prefix of scan keys in many important
cases. But that is not compatible with certain existing
_bt_advance_array_keys "sktrig_required=true" optimizations.

Most notably, we cannot assume that the array keys perfectly track our
progress through the index's key space when calling
_bt_advance_array_keys with "sktrig_required=false". In particular, it
would be wrong to allow the SAOP array binary search
cur_elem_trig=true optimization (see _bt_binsrch_array_skey) to be
used. We also don't want to attempt to end the primscan during
"sktrig_required=false" calls to _bt_advance_array_keys (nothing about
that is new here, it's just that _bt_skip_ikeyprefix now temporarily
forces the scan to behave this way, dynamically, rather than it being
static behavior that is fixed for the whole scan).

The underlying problem is that "sktrig_required=false" array
advancement cannot precisely reason about the relationship between the
scan's progress and the current required array key positions. For
example, with a query "WHERE a BETWEEN 0 AND 100 AND b in (42, 44)",
on a page whose "a" attribute values all satisfy the range qual on "a"
(i.e. with pstate.ikey = 1), our "a" skip array won't advance at all
(if we didn't use the _bt_skip_ikeyprefix optimization then we'd only
ever do "sktrig_required=false" advancement, and the skip array might
advance several times within a page, but we're ignoring "a" here). We
cannot reuse work across "b" SAOP binary searches, because in general
we're not paying attention to "a" at all -- and so we won't even try
to roll over "b" when the value of "a" advances (we're just looking at
"b", never "a").

> > _bt_skip_ikeyprefix
>
> I _think_ it's worth special-casing firstchangingattnum=1, as in that
> case we know in advance there is no (immediate) common ground between
> the index tuples and thus any additional work we do towards parsing
> the scankeys would be wasted - except for matching inequality bounds
> for firstchangingatt, or matching "open" skip arrays for a prefix of
> attributes starting at firstchangingattnum (as per the
> array->null_elem case).

Not sure what you mean by "special-casing firstchangingattnum=1"? What
"additional work we do towards parsing the scankeys" are you concerned
about?

It's fairly common for firstchangingattnum=1, even when the
_bt_skip_ikeyprefix optimization is working well. For example, that's
what'd happen with the example query I just gave (a query "WHERE a
BETWEEN 0 AND 100 AND b in (42, 44)" can skip "a" by setting
pstate.ikey=1, provided all of the "a" attribute values on the page
are within the range of the skip array).

> I also notice somed some other missed opportunities for optimizing
> page accesses:
>
> > + if (key->sk_strategy != BTEqualStrategyNumber)
>
> The code halts optimizing "prefix prechecks" when we notice a
> non-equality key. It seems to me that we can do the precheck on shared
> prefixes with non-equality keys just the same as with equality keys;
> and it'd improve performance in those cases, too.

Yeah, I was thinking of doing this already (though not for RowCompare
inequalities, which would be hard to evaluate from here). It makes
sense because it's exactly the same case as the range skip array case,
really -- why not just do it the same way?

> > + if (!(key->sk_flags & SK_SEARCHARRAY))
> > + if (key->sk_attno < firstchangingattnum)
> > + {
> > + if (result == 0)
> > + continue; /* safe, = key satisfied by every tuple */
> > + }
> > + break; /* pstate.ikey to be set to scalar key's ikey */
>
> This code finds out that no tuple on the page can possibly match the
> scankey (idxtup=scalar returns non-0 value) but doesn't (can't) use it
> to exit the scan. I think that's a missed opportunity for
> optimization; now we have to figure that out for every tuple in the
> scan. Same applies to the SAOP -array case (i.e. non-skiparray).

Maybe, but it's not much of a missed opportunity. It doesn't guarantee
that the scan can end in the case of a SAOP (the very next leaf page
could satisfy the same scan key, given a SAOP array with "gaps" in the
elements). So it can only really end the scan with a scalar = key --
though never when it is preceded by a skip array (doesn't matter if
the skip array is definitely satisfied/has only one distinct attribute
value on the page). Is this idea related to your previous idea
involving "special-casing firstchangingattnum=1"?

If I was going to do something like this, I think that it'd work by
backing out of applying the optimization entirely. Right now, 0003-*
applies the optimization whenever _bt_readpage decides to call
_bt_skip_ikeyprefix, regardless of the details after that (it would be
easy to teach _bt_skip_ikeyprefix to decide against applying its
optimization entirely, but testing seems to show that it always makes
sense to go ahead when _bt_skip_ikeyprefix is called, regardless of
what _bt_skip_ikeyprefix sees on the page).

> Thank you for working on this.

Thanks for the review!

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-18 19:31:41
Message-ID:	CAH2-Wz=bPktdXB6+VrD6T=n4ic-Czv6zHH4WP4jAKBfOHkRivg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 18, 2025 at 1:01 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> My comments on 0004:
>
> > _bt_skiparray_strat_decrement
> > _bt_skiparray_strat_increment
>
> In both functions the generated value isn't used when the in/decrement
> overflows (and thus invalidates the qual), or when the opclass somehow
> doesn't have a <= or >= operator, respectively.
> For byval types that's not much of an issue, but for by-ref types
> (such as uuid, or bigint on 32-bit systems) that's not great, as btree
> explicitly allows no leaks for the in/decrement functions, and now we
> use those functions and leak the values.

We don't leak any memory here. We temporarily switch over to using
so->arrayContext, for the duration of these calls. And so the memory
will be freed on rescan -- there's a MemoryContextReset call at the
top of _bt_preprocess_array_keys to take care of this, which is hit on
rescan.

> Additionally, the code is essentially duplicated between the
> functions, with as only differences which sksup function to call;
> which opstrategies to check, and where to retrieve/put the value. It's
> only 2 instances total, but if you figure out how to make a nice
> single function from the two that'd be appreciated, as it reduces
> duplication and chances for divergence.

I'll see if I can do something like that for the next revision, but I
think that it might be more awkward than it's worth.

The duplication isn't quite duplication, so much as it's two functions
that are mirror images of each other. The details/direction of things
is flipped in a number of places. The strategy number differs, even
though the same function is called.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-19 21:08:38
Message-ID:	CAH2-Wz=E-mMtFZLq-y7C0+EfaAKtypm=Fm28dYBZHtVRmJrH3g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 18, 2025 at 3:15 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I've actually come around to your point of view on this (or what I
> thought was your PoV from our call). That is, I now *think* that it
> would be better if the code added by 0003-* called
> _bt_skip_ikeyprefix, without regard for whether or not we'll have a
> finaltup _bt_checkkeys call to "recover" (i.e. whether we're on the
> leftmost or rightmost page shouldn't matter).
>
> My change in perspective on this question is related to another change
> of perspective, on the question of whether we actually need to call
> _bt_start_array_keys as part of "recovering/restoring the array
> invariant", just ahead of the finaltup _bt_checkkeys call. As you
> know, 0003-* calls _bt_start_array_keys in this way, but that now
> seems like overkill. It can have undesirable side-effects when the
> array keys spuriously appear to advance, when in fact they were
> restarted via the _bt_start_array_keys call, only to have their
> original values restored via the finaltup call that immediately
> follows.

> Tying it back to your concern, once I do that (once I stop calling
> _bt_start_array_keys in 0003-* to "hard reset" the arrays), I can also
> stop caring about finaltup being set on the rightmost page, at the
> point where we decide if _bt_skip_ikeyprefix should be called.

Attached is v29, which teaches _bt_skip_ikeyprefix to do things along
these lines (_bt_skip_ikeyprefix is added by 0003-*, same as last
time) . That's the first notable change for v29.

> > The code halts optimizing "prefix prechecks" when we notice a
> > non-equality key. It seems to me that we can do the precheck on shared
> > prefixes with non-equality keys just the same as with equality keys;
> > and it'd improve performance in those cases, too.
>
> Yeah, I was thinking of doing this already (though not for RowCompare
> inequalities, which would be hard to evaluate from here). It makes
> sense because it's exactly the same case as the range skip array case,
> really -- why not just do it the same way?

Second notable change for v29:

v29 also teaches 0003-* to handle plain inequalities (not just range
skip arrays) within _bt_skip_ikeyprefix, as I outlined to Matthias
here.

FWIW this doesn't really help much in practice, because scans that
benefit only do so to a limited degree, under fairly narrow
circumstances (I can explain what I mean by that if you're interested,
but it's not all that interesting). Even still, as I said the other
day, I think that it's worth doing this on consistency grounds. The
underlying rules that make this safe are literally identical to the
rules for range skip arrays (where determining if a key can be skipped
in _bt_skip_ikeyprefix tends to be much more important), so not doing
it with simple lower-order inequalities could confuse the reader.

Third notable change for v29:

Quite a few small improvements have been made to our new cost model,
in selfuncs.c/btcostestimate (see 0002-*, the main commit/patch that
introduces skip scan). The code is much better commented, and is more
idiomatic.

Most notably, I'm now using clauselist_selectivity() to adjust
ndistinct, as part of estimating the scan's num_sa_scans (I'm no
longer doing that in an ad-hoc way). The control flow is a lot easier
to understand. There are now a couple of new cases where we
conservatively fall back on using the old costing (i.e. we cost the
scan as if it was a Postgres 17 full index scan) for lack of a better
idea about what to do. We do this when we detect a default/generic
ndistinct estimate (we chicken out there), and we do it when we see a
range of values/a set of RestringInfos against a single column whose
selectivity is already relatively high (specifically, under
DEFAULT_RANGE_INEQ_SEL, meaning a selectivity that's under half a
percentage point).

Here are my plans around committing the patches:

* 0001-* can be committed within a matter of days. Probably before the
week is out.

It is commitable now, and is independently useful work.

* The remaining patches in the patch series (0002-* through to 0004-*)
are very close to being committable, but are still not quite ready.

I'm going to hold off on committing the big patches until late next
week, at the earliest. My current best estimate is that the bulk of
this work (0002-* through to 0004-*) will be committed together, on or
about April 2 (I suppose that I could commit 0004-* a few days later,
just to give things time to settle, but it wouldn't make sense to
commit 0002-* without also committing 0003-* immediately afterwards).

The tricky logic added by 0003-* is my single biggest outstanding
concern about the patch series. Particularly its potential to confuse
the existing invariants for required arrays. And particularly in light
of the changes that I made to 0003-* in the past few days. In short, I
need to think long and hard about the stuff I described under "Here's
how I think that this will be safe:" in my email to Matthias from
yesterday. The precondition and postcondition assertions in
_bt_advance_array_keys (added to Postgres 17) remain effective, which
gives me a certain amount of confidence in the changes made by 0003-*.

Thanks
--
Peter Geoghegan

Attachment	Content-Type	Size
v29-0001-Improve-nbtree-array-primitive-scan-scheduling.patch	application/octet-stream	26.8 KB
v29-0004-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/octet-stream	11.7 KB
v29-0002-Add-nbtree-skip-scan-optimizations.patch	application/octet-stream	179.0 KB
v29-0003-Lower-nbtree-skip-array-maintenance-overhead.patch	application/octet-stream	35.0 KB
v29-0005-DEBUG-Add-skip-scan-disable-GUCs.patch	application/octet-stream	5.4 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-21 15:36:18
Message-ID:	CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 19, 2025 at 5:08 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> The tricky logic added by 0003-* is my single biggest outstanding
> concern about the patch series. Particularly its potential to confuse
> the existing invariants for required arrays. And particularly in light
> of the changes that I made to 0003-* in the past few days.

A big part of the concern here is with the existing pstate.prechecked
optimization (the one added to Postgres 17 by Alexander Korotkov's
commit e0b1ee17). It now seems quite redundant -- the new
_bt_skip_ikeyprefix mechanism added by my 0003-* patch does the same
thing, but does it better (especially since I taught
_bt_skip_ikeyprefix to deal with simple inequalities in v29). I now
think that it makes most sense to totally replace pstate.prechecked
with _bt_skip_ikeyprefix -- we should use _bt_skip_ikeyprefix during
every scan (not just during skip scans, not just during scans with
SAOP array keys), and be done with it.

To recap, right now the so->scanBehind flag serves two roles:

1. It lets us know that it is unsafe to apply the pstate.prechecked
optimization when _bt_readpage gets to the next page.

2. It lets us know that a recheck of the finaltup tuple is required on
the next page (right now that is limited to forwards scans that
encounter a truncated high key, but I intend to expand it in the
0001-* patch to other cases where we want to move onto the next page
without being 100% sure that it's the right thing to do).

Role #1 is related to the fact that pstate.prechecked works in a way
that doesn't really know anything about SAOP array keys. My Postgres
17 commit 5bf748b8 added role #1, to work around the problem of the
precheck being confused in cases where the scan's array keys are
advanced before we reach the final tuple on the page (the precheck
tuple). That way, scans with array keys were still able to safely use
the pstate.prechecked optimization, at least in simpler cases.

The new, similar _bt_skip_ikeyprefix mechanism has no need to check
scanBehind like this. That's because it is directly aware of both SAOP
arrays and skip arrays (it doesn't actually examine the sk_argument
from a scan key associated with an array, it just looks at the array
itself). That's a big advantage -- especially in light of the
scheduling improvements from 0001-*, which will make many more
_bt_readpage calls see the so->scanBehind flag as set (making us not
use pstate.prechecked). The new scheduling stuff from 0001-* affects
scans with skip arrays and SAOP arrays in just the same way, and yet
right now there's an unhelpful disconnect between what each case will
do once it reaches the next page. This is probably the single biggest
point that makes pstate.prechecked seem like it clashes with
_bt_skip_ikeyprefix: it undermines the principle that array type (SAOP
array vs skip array) should not matter anywhere outside of the lowest
level nbtutils.c code, which is a principle that now seems important.

Another key advantage of the _bt_skip_ikeyprefix mechanism is that it
can plausibly work in many cases that pstate.prechecked currently
can't handle at all. The design of pstate.prechecked makes it an "all
or nothing" thing -- it either works for every scan key required in
the scan direction, or it works for none at all. Not so for
_bt_skip_ikeyprefix; it can set pstate.ikey to skip a prefix of = scan
keys (a prefix of keys known to satisfy all tuples on pstate.page),
without the prefix necessarily including every = scan key. That
flexibility could matter a lot.

Furthermore, it can do this (set pstate.ikey to a value greater than
0) *without* needing to set pstate.forcenonrequired to make it safe
(pstate.forcenonrequired must be set during scans with array keys to
set pstate.ikey to a non-zero value, but otherwise doesn't need to be
set). In other words, it's possible for most scans to use
_bt_skip_ikeyprefix without having to commit themselves to scan all of
the tuples on the page (unlike pstate.prechecked).

I'm pretty sure that this is the right direction already, for the
reasons given, but also because I already have a draft version that
passes all tests. It significantly improves performance in many of my
most important microbenchmarks: unsympathetic cases that previously
had 5%-10% regression in query execution time end up with only 2% - 5%
regressions. I think that that's because the draft patch completely
removes the "prechecked" code path from _bt_check_compare, which is a
very hot function -- especially during these adversarial
microbenchmarks.

Under this new scheme, so->scanBehind is strictly a flag that
indicates that a recheck is scheduled, to be performed once the scan
calls _bt_readpage for the next page. It no longer serves role #1,
only role #2. That seems significantly simpler.

Note that I'm *not* proposing to remove/replace the similar
pstate.firstmatch optimization (also added by Alexander Korotkov's
commit e0b1ee17). That's still independently useful (it'll complement
_bt_skip_ikeyprefix in just the same way as it complemented
pstate.prechecked).

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-22 17:47:41
Message-ID:	CAH2-WzmnWZmwJjKBV0XYRGsYsXYqfOhqu4_+gKpQgV8dv=59Pg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 21, 2025 at 11:36 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> A big part of the concern here is with the existing pstate.prechecked
> optimization (the one added to Postgres 17 by Alexander Korotkov's
> commit e0b1ee17). It now seems quite redundant -- the new
> _bt_skip_ikeyprefix mechanism added by my 0003-* patch does the same
> thing, but does it better (especially since I taught
> _bt_skip_ikeyprefix to deal with simple inequalities in v29). I now
> think that it makes most sense to totally replace pstate.prechecked
> with _bt_skip_ikeyprefix -- we should use _bt_skip_ikeyprefix during
> every scan (not just during skip scans, not just during scans with
> SAOP array keys), and be done with it.

I just committed "Improve nbtree array primitive scan scheduling".

Attached is v30, which fully replaces the pstate.prechecked
optimization with the new _bt_skip_ikeyprefix optimization (which now
appears in v30-0002-Lower-nbtree-skip-array-maintenance-overhead.patch,
and not in 0003-*, due to my committing the primscan scheduling patch
just now).

I'm now absolutely convinced that fully generalizing
_bt_skip_ikeyprefix (as described in yesterday's email) is the right
direction to take things in. It seems to have no possible downside.

> Under this new scheme, so->scanBehind is strictly a flag that
> indicates that a recheck is scheduled, to be performed once the scan
> calls _bt_readpage for the next page. It no longer serves role #1,
> only role #2. That seems significantly simpler.

I especially like this about the new _bt_skip_ikeyprefix scheme.
Having so->scanBehind strictly be a flag (that tracks if we need a
recheck at the start of reading the next page) substantially lowers
the cognitive burden for somebody trying to understand how the
primitive scan scheduling stuff works.

The newly expanded _bt_skip_ikeyprefix needs quite a bit more testing
and polishing to be committable. I didn't even update the relevant
commit message for v30. Plus I'm not completely sure what to do about
RowCompare keys just yet, which have some funny rules when dealing
with NULLs.

--
Peter Geoghegan

Attachment	Content-Type	Size
v30-0004-DEBUG-Add-skip-scan-disable-GUCs.patch	application/octet-stream	5.4 KB
v30-0003-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/octet-stream	11.6 KB
v30-0002-Lower-nbtree-skip-array-maintenance-overhead.patch	application/octet-stream	39.0 KB
v30-0001-Add-nbtree-skip-scan-optimizations.patch	application/octet-stream	178.9 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-25 23:45:16
Message-ID:	CAH2-Wzk68UGN_9hyOmeAB6BbEagJhe6kP+PmxNC5orkeDA9hGQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 22, 2025 at 1:47 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v30, which fully replaces the pstate.prechecked
> optimization with the new _bt_skip_ikeyprefix optimization (which now
> appears in v30-0002-Lower-nbtree-skip-array-maintenance-overhead.patch,
> and not in 0003-*, due to my committing the primscan scheduling patch
> just now).

Attached is v31, which has a much-improved _bt_skip_ikeyprefix (which
I've once again renamed, this time to _bt_set_startikey).

There are some bug fixes here (nothing very interesting), and the
heuristics have been tuned to take into account the requirements of
conventional SAOP scans with dense/contiguous array keys (we should
mostly just be preserving existing performance characteristics here).

> The newly expanded _bt_skip_ikeyprefix needs quite a bit more testing
> and polishing to be committable. I didn't even update the relevant
> commit message for v30. Plus I'm not completely sure what to do about
> RowCompare keys just yet, which have some funny rules when dealing
> with NULLs.

v31 fixes all these problems, too.

Most notably, v31 differs from v30 in that it removes *both* of the
optimizations added to Postgres 17 by Alexander Korotkov's commit
e0b1ee17 -- including the pstate.firstmatch optimization. I didn't
expect things to go this way. I recently said that pstate.firstmatch
is something that can and should be kept (and v30 only removed
pstate.prechecked). Obviously, I've since changed my mind.

I changed my mind because lots of things are noticeably faster this
way, across most of my microbenchmarks. These performance validation
tests/microbenchmarks are really sensitive to the number of branches
added to _bt_check_compare; removing anything nonessential from that
hot code path really matters here.

Notably, removing the pstate.firstmatch optimization (along with
removing pstate.prechecked) is enough to fully eliminate what I've
long considered to be the most important microbenchmark regressions. I
refer to the microbenchmark suite originally written by Masahiro
Ikeda, and later enhanced/expanded by me to use a wider variety of
data cardinalities and datatypes. For the last several months, I
thought we'd need to live with a 5% - 10% regression with such cases
(those were the numbers I've thrown around when giving a high-level
summary of the extent of the regressions in unfavorable cases). Now
these microbenchmarks show that the queries are all about ~2% *faster*
instead. What's more, there may even be a similar small improvement
for important standard benchmarks (not microbenchmarks), such as the
standard pgbench SELECT benchmark. (I think simple pgbench is that
much faster, which is enough to matter, but not enough that it's easy
to prove under time pressure.)

There is at least a theoretical downside to replacing
pstate.firstmatch with the new _bt_set_startikey path, that I must
acknowledge: we only actually call _bt_set_startikey on the second or
subsequent leaf page, so that's the earliest possible point that it
can help speed things up (exactly like pstate.prechecked). Whereas
pstate.firstmatch is usually effective right away, on the first page
(effective at allowing us to avoid evaluating > or >= keys in the
common case where we're scanning the index forwards). It's fair to
wonder how much we'd be missing out, by giving up on that advantage.
It's very difficult to actually see any benefit that can be tied to
that theoretical advantage for pstate.firstmatch, though.

What I actually see when I run my range microbenchmark suite with v31
(not the aforementioned main microbenchmark suite) is that simple
cases involving no skip arrays (only simple scalar inequalities), with
quals like "WHERE col BETWEEN 0 and 1_000_000" are now *also* about 2%
faster. Any slowdown is more than made up for. Granted, if I worked
hard enough I might find a counter-example, where it actually is
slower overall, but I just can't see that ever mattering enough to
make me reconsider getting rid of pstate.firstmatch.

--
Peter Geoghegan

Attachment	Content-Type	Size
v31-0002-Enhance-nbtree-tuple-scan-key-optimizations.patch	application/octet-stream	45.4 KB
v31-0003-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/octet-stream	11.7 KB
v31-0004-DEBUG-Add-skip-scan-disable-GUCs.patch	application/octet-stream	5.4 KB
v31-0001-Add-nbtree-skip-scan-optimizations.patch	application/octet-stream	179.9 KB

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-27 22:03:53
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On 26.03.2025 02:45, Peter Geoghegan wrote:
> On Sat, Mar 22, 2025 at 1:47 PM Peter Geoghegan<pg(at)bowt(dot)ie> wrote:
>> Attached is v30, which fully replaces the pstate.prechecked
>> optimization with the new _bt_skip_ikeyprefix optimization (which now
>> appears in v30-0002-Lower-nbtree-skip-array-maintenance-overhead.patch,
>> and not in 0003-*, due to my committing the primscan scheduling patch
>> just now).
> Attached is v31, which has a much-improved _bt_skip_ikeyprefix (which
> I've once again renamed, this time to _bt_set_startikey).
>
I reviewed your first patch and noticed that you added the ability to
define new quals if the first column isn't usedin the query.

I replied an example like this:

CREATE TABLE sales ( id SERIAL PRIMARY KEY, region TEXT, product TEXT,
year INT );

INSERT INTO sales (region, product, year) SELECT CASE WHEN i % 4 <= 1
THEN 'North' WHEN i % 4 <= 2 THEN 'South' WHEN i % 4 <= 3 THEN 'East'
ELSE 'West' END, CASE WHEN random() > 0.5 THEN 'Laptop' ELSE 'Phone'
END, 2023 FROM generate_series(1, 100000) AS i;

vacuum analyze;

CREATE INDEX sales_idx ON sales (region, product, year);

EXPLAIN ANALYZE SELECT * FROM sales WHERE product = 'Laptop' AND year =
2023;

master gives the query plan with SeqScan:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on sales (cost=0.00..2137.00 rows=49703 width=19) (actual
time=0.035..31.438 rows=50212.00 loops=1) Filter: ((product =
'Laptop'::text) AND (year = 2023)) Rows Removed by Filter: 49788
Buffers: shared hit=637 Planning: Buffers: shared hit=35 Planning Time:
0.695 ms Execution Time: 33.942 ms (8 rows)

Your patch sets fair costs for it and it helps take into account index
scan path and in my opinion it is perfect!)

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on sales (cost=652.46..2031.86 rows=49493 width=19)
(actual time=18.039..33.723 rows=49767.00 loops=1) Recheck Cond:
((product = 'Laptop'::text) AND (year = 2023)) Heap Blocks: exact=637
Buffers: shared hit=642 read=50 -> Bitmap Index Scan on sales_idx
(cost=0.00..640.09 rows=49493 width=0) (actual time=17.756..17.756
rows=49767.00 loops=1) Index Cond: ((product = 'Laptop'::text) AND (year
= 2023)) Index Searches: 4 Buffers: shared hit=5 read=50 Planning:
Buffers: shared hit=55 read=1 Planning Time: 0.984 ms Execution Time:
36.940 ms

I think it would be useful to show information that we used an index
scan but at the same time we skipped the "region" column and I assume we
should output how many distinct values the "region" column had.

For example it will look like this "*Skip Scan on region (4 distinct
values)"*:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on sales (cost=652.46..2031.86 rows=49493 width=19)
(actual time=18.039..33.723 rows=49767.00 loops=1) Recheck Cond:
((product = 'Laptop'::text) AND (year = 2023)) Heap Blocks: exact=637
Buffers: shared hit=642 read=50 -> Bitmap Index Scan on sales_idx
(cost=0.00..640.09 rows=49493 width=0) (actual time=17.756..17.756
rows=49767.00 loops=1) Index Cond: ((product = 'Laptop'::text) AND (year
= 2023)) *Skip Scan on region (4 distinct values)* Index Searches: 4
Buffers: shared hit=5 read=50 Planning: Buffers: shared hit=55 read=1
Planning Time: 0.984 ms Execution Time: 36.940 ms

What do you think?

I didn't see any regression tests. Maybe we should add some tests? To be
honest I didn't see it mentioned in the commit message but I might have
missed something.

--
Regards,
Alena Rybakina
Postgres Professional

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-28 20:15:19
Message-ID:	CAH2-WzmL_Ash+pfBPO29k3GqL1wJ7j9=WttSy2KHdngegkp1Ew@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 27, 2025 at 6:03 PM Alena Rybakina
<a(dot)rybakina(at)postgrespro(dot)ru> wrote:
> I replied an example like this:

This example shows costs that are dominated by heap access costs. Both
the sequential scan and the bitmap heap scan must access 637 heap
blocks. So I don't think that this is such a great example -- the heap
accesses are irrelevant from the point of view of assessing how well
we're modelling index scan related costs.

> I think it would be useful to show information that we used an index scan but at the same time we skipped the "region" column and I assume we should output how many distinct values the "region" column had.
>
> For example it will look like this "Skip Scan on region (4 distinct values)":

> What do you think?

As I said on our call today, I think that we should keep the output
for EXPLAIN ANALYZE simple. While I'm sympathetic to the idea that we
should show more information about how quals can be applied in index
scan node output, that seems like it should be largely independent
work to me.

Masahiro Ikeda wrote a patch that aimed to improve matters in this
area some months back. I'm supportive of that (there is definitely
value in signalling to users that the index might actually look quite
different to how they imagine it looks, say by having an
omitted-by-query prefix attribute). I don't exactly know what the most
useful kind of information to show is with skip scan in place, since
skip scan makes the general nature of quals (whether a given qual is
what oracle calls "access predicates", or what oracle calls "filter
predicates") is made squishy/dynamic by skip scan, in a way that is
new.

The relationship between the number of values that a skip array ever
uses, and the number of primitive index scans is quite complicated.
Sometimes it is actually as simple as your example query, but that's
often not true. "Index Searches: N" can be affected by:

* The use of SAOP arrays, which also influence primitive scan
scheduling, in the same way as they have since Postgres 17 -- and can
be mixed freely with skip arrays.

* The availability of opclass skipsupport, which makes skip arrays
generate their element values by addition/subtraction from the current
array element, rather than using NEXT/PRIOR sentinel keys.

The sentinel keys act as probes that get the next real (non-sentinel)
value that we need to look up next. Whereas skip support can often
successfully guess that (for example) the next value in the index
after 268 is 269, saving a primitive scan each time (this might not
happen at all, or it might work only some of the time, or it might
work all of the time).

* Various primitive index scan scheduling heuristics.

Another concern here is that I don't want to invent a special kind of
"index search" just for skip scan. We're going to show an "Index
Searches: N" that's greater than 1 with SAOP array keys, too -- which
don't use skip scan at all (nothing new about that, except for the
fact that we report the number of searches directly from EXPLAIN
ANALYZE in Postgres 18). There really is almost no difference between
a scan with a skip array and a scan of the same index with a similar
SAOP array (when each array "contains the same elements", and is used
to scan the same index, in the same way). That's why the cost model is
as similar as possible to the Postgres 17 costing of SAOP array scans
-- it's really the same access method. Reusing the cost model makes
sense because actual execution times are almost identical when we
compare a skip array to a SAOP array in the way that I described.

The only advantage that I see from putting something about "skip scan"
in EXPLAIN ANALYZE is that it is more googleable that way. But it
seems like "Index Searches: N" is almost as good, most of the time. In
any case, the fact that we don't need a separate optimizer index
path/executor node for this is something that I see as a key
advantage, and something that I'd like EXPLAIN ANALYZE to preserve.

The problem with advertising that an index scan node is a skip scan
is: what if it just never skips? Never skipping like this isn't
necessarily unexpected. And even if it is unexpected, it's not
necessarily a problem.

> I didn't see any regression tests. Maybe we should add some tests? To be honest I didn't see it mentioned in the commit message but I might have missed something.

There are definitely new regression tests -- I specifically tried to
keep the test coverage high, using gcov html reports (like the ones
from coverage.postgresql.org) The test updates appear towards the end
of the big patch file, though. Maybe you're just not used to seeing
tests appear last like this?

I use "git config diff.orderfile ... " to get this behavior. I find it
useful to put the important changes (particularly header file changes)
first, and less important changes (like tests) much later.

Thanks for taking a look at my patch!
--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-03-28 21:59:22
Message-ID:	CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 25, 2025 at 7:45 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v31, which has a much-improved _bt_skip_ikeyprefix (which
> I've once again renamed, this time to _bt_set_startikey).

Attached is v32, which has very few changes, but does add a new patch:
a patch that adds skip-array-specific primitive index scan heuristics
to _bt_advance_array_keys (this is
v32-0003-Improve-skip-scan-primitive-scan-scheduling.patch).

I feel compelled to do something about cases like the ones highlighted
by the attached test case, heikki-testcase-variant.sql. This shows how
the new heuristic added by my recent commit 9a2e2a28 can still
sometimes fail to ever be reached, in cases rather like the ones that
that was supposed to address [1]. This test case I'm posting now is
based on Heikki's adversarial test case (the one he posted back in
January). I've pushed it a bit further, demonstrating a remaining need
to tweak the primscan scheduling heuristics in light of skip scan. If
you run heikki-testcase-variant.sql against a patched server, without
v32-0003-Improve-skip-scan-primitive-scan-scheduling.patch, you'll see
exactly what I'm concerned about .

I am somewhat breaking my own rule about not inventing heuristics that
specifically care about which type of array (skip vs SAOP array) are
involved here. In my defense, the new heuristic isn't particularly
likely to influence primscan scheduling. It will seldom be needed or
have any noticeable influence, but just might be crucial with cases
like the one from the test case. It seems a little too hard for skip
scans to actually get the behavior from commit 9a2e2a28 -- which is
something that we really shouldn't leave to chance.

The plan around committing this work hasn't changed: I still intend to
commit everything next Wednesday or Thursday. Hope to get a review of
the new scan heuristic before then, but it's a small adjunct to what I
did in commit 9a2e2a28, so not too concerned about it adding risk.

Thanks

[1] https://postgr.es/m/[email protected]

--
Peter Geoghegan

Attachment	Content-Type	Size
heikki-testcase-variant.sql	application/sql	1.5 KB
v32-0004-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/x-patch	11.7 KB
v32-0003-Improve-skip-scan-primitive-scan-scheduling.patch	application/x-patch	7.1 KB
v32-0002-Enhance-nbtree-tuple-scan-key-optimizations.patch	application/x-patch	45.5 KB
v32-0001-Add-nbtree-skip-scan-optimizations.patch	application/x-patch	182.1 KB
v32-0005-DEBUG-Add-skip-scan-disable-GUCs.patch	application/x-patch	5.4 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 02:02:25
Message-ID:	CAH2-WznK55udrFQm4umLjOmxcBh8k_5Ybxmt1_rXSYW9N8j64A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 28, 2025 at 5:59 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v32, which has very few changes, but does add a new patch:
> a patch that adds skip-array-specific primitive index scan heuristics
> to _bt_advance_array_keys (this is
> v32-0003-Improve-skip-scan-primitive-scan-scheduling.patch).

Attached is v33, which:

* Polishes the recently introduced
0003-Improve-skip-scan-primitive-scan-scheduling.patch logic, and
tweaks related code in _bt_advance_array_keys.

This includes one bug fix affecting backwards scans, which no longer
reuse pstate.skip to stop reading tuples when so->scanBehind was set.
Now _bt_readpage's backwards scan loop tests so.scanBehind directly
and explicitly, instead (there's no such change to its forwards scan
loop, though).

In v32 we could accidentally set pstate.skip (which is mostly intended
to be used by the "look-ahead" optimization added by the Postgres 17
SAOP project) to 0 (InvalidOffsetNumber), which actually represents
"don't skip at all" -- that was wrong. This happened when the
"pstate.skip = pstate.minoffnum - 1" statement gave us
InvalidOffsetNumber because pstate.minoffnum was already 1
(FirstOffsetNumber).

(FWIW, this only affected backwards scans at the point that they read
the index's rightmost leaf page, since any other leaf page has to have
a high key at FirstOffSetNumber, which implies a pstate.minoffnum of
FirstOffsetNumber+1, which meant we set pstate.skip = 1
(FirstOffsetNumber) by subtraction, which accidentally failed to
fail).

* v33 also makes small tweaks and comment clean-ups to the logic in
and around _bt_set_startikey, added by 0002-*, with the goal
simplifying the code, and in particular making the possible impact of
pstate.forcenonrequired on maintenance of the scan's arrays clearer.

We must not break the rule established by the Postgres 17 SAOP work:
the scan's array keys should always track the scan's progress through
the index's key space (I refer to the rule explained by the 17 SAOP
commit's message, see commit
5bf748b86bc6786a3fc57fc7ce296c37da6564b0). While we do "temporarily
stop following that rule" within a given pstate.forcenonrequired call
to _bt_readpage (to save some cycles), that should never have lasting
side-effects; there should be no observable effect outside of
_bt_readpage itself. In other words, the use of
pstate.forcenonrequired by _bt_readpage should leave the scan's arrays
in exactly the same state as _bt_readpage would have left them had it
never used pstate.forcenonrequired mode to begin with.

(FWIW, I have no reason to believe that v32 had any bugs pertaining to
this. AFAICT it didn't actually break "the general rule established by
the Postgres 17 SAOP work", but the explanation for why that was so
was needlessly complicated.)

I'm now very close to committing everything. Though I do still want
another pair of eyes on the newer
0003-Improve-skip-scan-primitive-scan-scheduling.patch stuff before
commiting (since I still intend to commit all the remaining patches
together).

--
Peter Geoghegan

Attachment	Content-Type	Size
v33-0003-Improve-skip-scan-primitive-scan-scheduling.patch	application/octet-stream	11.1 KB
v33-0004-Apply-low-order-skip-key-in-_bt_first-more-often.patch	application/octet-stream	11.7 KB
v33-0005-DEBUG-Add-skip-scan-disable-GUCs.patch	application/octet-stream	5.4 KB
v33-0001-Add-nbtree-skip-scan-optimizations.patch	application/octet-stream	182.6 KB
v33-0002-Enhance-nbtree-tuple-scan-key-optimizations.patch	application/octet-stream	44.7 KB

From:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 13:26:13
Message-ID:	CAJ7c6TNArZC2tAG5NTzahT88cdsQadm6qOVtuG17QGJD7QF5Qw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Peter,

> I'm now very close to committing everything. Though I do still want
> another pair of eyes on the newer
> 0003-Improve-skip-scan-primitive-scan-scheduling.patch stuff before
> commiting (since I still intend to commit all the remaining patches
> together).

Can you think of any tests specifically for 0003, or relying on the
added Asserts() is best we can do? Same question for 0002.

I can confirm that v33 applies and passes the test.

0002 adds _bt_set_startikey() to nbtutils.c but it is not well-covered
by tests, many branches of the new code are never executed.

```
@@ -2554,9 +2865,20 @@ _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
*/
if (requiredSameDir)
*continuescan = false;
+ else if (unlikely(key->sk_flags & SK_BT_SKIP))
+ {
+ /*
+ * If we're treating scan keys as nonrequired, and encounter a
+ * skip array scan key whose current element is NULL, then it
+ * must be a non-range skip array
+ */
+ Assert(forcenonrequired && *ikey > 0);
+ return _bt_advance_array_keys(scan, NULL, tuple, tupnatts,
+ tupdesc, *ikey, false);
+ }
```

This branch is also never executed during the test run.

In 0003:

```
@@ -2006,6 +2008,10 @@ _bt_advance_array_keys(IndexScanDesc scan,
BTReadPageState *pstate,
else if (has_required_opposite_direction_only && pstate->finaltup &&
unlikely(!_bt_oppodir_checkkeys(scan, dir, pstate->finaltup)))
{
+ /*
+ * Make sure that any SAOP arrays that were not marked required by
+ * preprocessing are reset to their first element for this direction
+ */
_bt_rewind_nonrequired_arrays(scan, dir);
goto new_prim_scan;
}
```

This branch is never executed too. This being said, technically there
is no new code here.

For your convenience I uploaded a complete HTML code coverage report
(~36 Mb) [1].

[1]: https://drive.google.com/file/d/1breVpHapvJLtw8AlFAdXDAbK8ZZytY6v/view?usp=sharing

--
Best regards,
Aleksander Alekseev

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 14:37:50
Message-ID:	CAH2-WzmrpzzxXmRVsHQB=Y4x_gh8Y0ed24_xRq=djgBv+HT+vA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 1, 2025 at 9:26 AM Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:
> Can you think of any tests specifically for 0003, or relying on the
> added Asserts() is best we can do? Same question for 0002.

There are many more tests than those that are included in the patch. I
wrote and debugged all patches using TDD. This is also how I wrote the
Postgres 17 SAOP patch. (My test suite is actually a greatly expanded
version of the earlier SAOP patch's test suite, since this project is
a direct follow-up.)

These tests are not nearly stable enough to commit; they have
something like a 5% chance of spuriously failing when I run them,
generally due to concurrent autoanalyze activity that prevents VACUUM
from setting all VM bits. Almost all of the tests expect specific
index page accesses, which is captured by "Buffers:" output. I find
that the full test suite takes over 10 seconds to run with a debug
build. So they're really not too maintainable.

Every little behavioral detail has been memorialized by some test.
Many of the tests present some adversarial scenario where
_bt_advance_array_keys runs out of array keys before reaching the end
of a given page, where it must then decide how to get to the next leaf
page (the page whose key space matches the next distinct set of array
keys). We want to consistently make good decisions about whether we
should step to the next sibling page, or whether starting another
primscan to get to the next relevant leaf page makes more sense. That
comes up again and again (only a minority of these tests were written
to test general correctness).

I did write almost all of these tests before writing the code that
they test, but most of the tests never failed by giving wrong answers
to queries. They initially failed by not meeting some expectation that
I had in mind about how best to schedule primitive index scans.

What "the right decision" means when scheduling primitive index scans
is somewhat open to interpretation -- I do sometimes revise my opinion
in light of new information, or new requirements (SAOP scans have
basically the same issues as skip scans, but in practice skip scans
are more sensitive to these details). It's inherently necessary to
manage the uncertainty around which approach is best, in any given
situation, on any given page. Having a large and varied set of
scenarios seems to help to keep things in balance (it avoids unduly
favoring one type of scan or index over another).

> I can confirm that v33 applies and passes the test.
>
> 0002 adds _bt_set_startikey() to nbtutils.c but it is not well-covered
> by tests, many branches of the new code are never executed.

It's hard to get test coverage for these cases into the standard
regression tests, since the function is only called when on the second
or subsequent page of a given primscan -- you need quite a few
relatively big indexes. There are quite a few far-removed
implementation details that any test will inevitably have to rely on,
such as BLCKSZ, the influence of suffix truncation.

Just having test coverage of these branches wouldn't test much on its
own. In practice it's very likely that not testing the key directly
(incorrectly assuming that the _bt_keep_natts_fast precheck is enough,
just removing the uncovered branches) won't break queries at all. The
_bt_keep_natts_fast precheck tells us which columns have no more than
one distinct value on the page. If there's only one value, and if we
just left a page that definitely satisfied its keys, it's quite likely
that those same keys will also be satisfied on this new page (it's
just not certain, which is why _bt_set_startikey can't just rely on
the precheck, it has to test each key separately).

> ```
> @@ -2554,9 +2865,20 @@ _bt_check_compare(IndexScanDesc scan, ScanDirection dir,
> */
> if (requiredSameDir)
> *continuescan = false;
> + else if (unlikely(key->sk_flags & SK_BT_SKIP))
> + {
> + /*
> + * If we're treating scan keys as nonrequired, and encounter a
> + * skip array scan key whose current element is NULL, then it
> + * must be a non-range skip array
> + */
> + Assert(forcenonrequired && *ikey > 0);
> + return _bt_advance_array_keys(scan, NULL, tuple, tupnatts,
> + tupdesc, *ikey, false);
> + }
> ```
>
> This branch is also never executed during the test run.

FWIW I have a test case that breaks when this particular code is
removed, given a wrong answer.

This is just another way that _bt_check_compare can find that it needs
to advance the array keys in the context of forcenonrequired mode --
it's just like the other calls to _bt_advance_array_keys from
_bt_check_compare. This particular code path is only hit when a skip
array's current element is NULL, which is unlikely to coincide with
the use of forcenonrequired mode (NULL is just another array element).
We just need another _bt_advance_array_keys callsite because of how
things are laid out in _bt_check_compare, which deals with
ISNULL-marked keys separately.

> For your convenience I uploaded a complete HTML code coverage report
> (~36 Mb) [1].

Thanks. I actually generated a similar coverage report myself
yesterday. I'm keeping an eye on this, but I don't think that it's
worth aiming for very high test coverage for these code paths. Writing
a failing test case for that one ISNULL _bt_advance_array_keys code
path was quite difficult, and required quite a big index. That isn't
going to be acceptable within the standard regression tests.

--
Peter Geoghegan

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 14:39:47
Message-ID:	CAEze2Wi2qjHLd+mGnhcBLr_KYhkWAMMvGvriR3Eype13P_BROg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 28 Mar 2025 at 22:59, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Tue, Mar 25, 2025 at 7:45 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Attached is v31, which has a much-improved _bt_skip_ikeyprefix (which
> > I've once again renamed, this time to _bt_set_startikey).
>
> Attached is v32

Thanks!

The review below was started for v31, then adapted to v32 when that
arrived. I haven't cross-checked this with v33.

Review for 0001:

I have some comments on the commit message, following below. _Note:
For smaller patches I would let small things like this go, but given
the complexity of the feature I think it is important to make sure
there can be no misunderstanding about how it works and why it's
correct. Hence, these comments._

> Teach nbtree composite index scans to opportunistically skip over
> irrelevant sections of composite indexes given a query with an omitted
> prefix column.

AFAIK, this is only the second reference to "composite" indexes, after
a single mention in a comment in nbtsplitloc.c's _bt_strategy. IIUC,
this refers to multi-column indexes, which is more frequently and more
consistently used across the code base.

FYI, I think "composite index" would be more likely to refer to GIN,
given its double-btree structure. If we are about to use this term
more often, an entry in the glossary would be in place.

> When nbtree is passed input scan keys derived from a
> query predicate "WHERE b = 5", new nbtree preprocessing steps now output
> "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.

[...] new nbtree preprocessing steps now output +the equivalent of+ [...]

> Preprocessing can freely add a skip array before or after any input
> ScalarArrayOp arrays.

This implies skip arrays won't be generated for WHERE b = 5 (a
non-SAOP scankey) or WHERE b < 3 (not SAOP either), which is probably
not intentional, so a rewording would be appreciated.

// nbtpreprocesskeys.c

> +static bool _bt_s**parray_shrink

I'd like to understand the "shrink" here, as it's not entirely clear to me.
The functions are exclusively called through dispatch in
_bt_compare_array_scankey_args, and I'd expected that naming to be
extended to these functions.

> + * This qual now becomes "WHERE x = ANY('{every possible x value}') and y = 4"

I understand what you're going for, but a reference that indexed NULLs
are still handled correctly (rather than removed by the filter) would
be appreciated, as the indicated substitution would remove NULL
values. (This doesn't have to be much more than a footnote.)

> + * Also sets *numSkipArrayKeys to # of skip arrays _bt_preprocess_array_keys
> + * caller must add to the scan keys it'll output. Caller must add this many
> + * skip arrays to each of the most significant attributes lacking any keys

I don't think that's a good description; as any value other than 0 or
1 would mean more than one skip array per such attribute, which is
obviously incorrect. I think I'd word it like:

+ * Also sets *numSkipArrayKeys to # of skip arrays _bt_preprocess_array_keys
+ * caller must add to the scan keys it'll output. Caller will have to add
+ * this many skip arrays: one for each of the most significant attributes
+ * lacking any keys that use the = strategy [...]

> +#if 0
> + /* Could be a redundant input scan key, so can't do this: */
> + Assert(inkey->sk_strategy == BTEqualStrategyNumber ||
> + (inkey->sk_flags & SK_SEARCHNULL));
> +#endif

I think this should be removed?

> +#ifdef DEBUG_DISABLE_SKIP_SCAN

I noticed this one and only reference to DEBUG_DISABLE_SKIP_SCAN. Are
you planning on keeping that around, or is this a leftover?

> + ScanKey inkey = scan->keyData + i;
> +
> + /*
> + * Backfill skip arrays for any wholly omitted attributes prior to
> + * attno_inkey
> + */
> + while (attno_skip < attno_inkey)

I don't understand why we're adding skip keys before looking at the
contents of this scankey, nor why we're backfilling skip keys based on
a now old value of attno_inkey. Please
add some comments on how/why this is the right approach.

> prev_numSkipArrayKeys, *numSkipArrayKeys

I'm a bit confused why we primarily operate on *numSkipArrayKeys,
rather than a local variable that we store in *numSkipArrayKeys once
we know we can generate skip keys. I.e., I'd expected something more
in line with the following snippet (incomplete code blocks):

+ if (!OidIsValid(skip_eq_ops[attno_skip - 1]))
+ {
+ /*
+ * Cannot generate a skip array for this or later attributes
+ * (input opclass lacks an equality strategy operator)
+ */
+ return numArrayKeys + *numSkipArrayKeys;
+ }
+
+ /* plan on adding a backfill skip array for this attribute */
+ loc_numSkipArrayKeys++;
+ attno_skip++;
+ }
+
+ *numSkipArrayKeys = loc_numSkipArrayKeys;

I think this is easier for the compiler to push the store operation
out of the loop (assuming it's optimizable at all; but even if it
isn't it removes the load of *numSkipArrayKeys from the hot path).

// utils/skipsupport.h, nbtutils.c

I think the increment/decrement callbacks for skipsupport should
explicitly check (by e.g. Assert) for NULL (or alternatively: same
value) returns on overflow, and the API definition should make this
explicit. The current system is quite easy to build a leaking
implementation for. Sure, there are other easy ways to break this, but
I think it's an easy API modification that makes things just that bit
safer.

A renewed review for 0002+ will arrive at a later point.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 17:31:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.04.2025 17:39, Matthias van de Meent wrote:
> On Fri, 28 Mar 2025 at 22:59, Peter Geoghegan<pg(at)bowt(dot)ie> wrote:
>> On Tue, Mar 25, 2025 at 7:45 PM Peter Geoghegan<pg(at)bowt(dot)ie> wrote:
>>> Attached is v31, which has a much-improved _bt_skip_ikeyprefix (which
>>> I've once again renamed, this time to _bt_set_startikey).
>> Attached is v32
>> +static bool _bt_s**parray_shrink
> I'd like to understand the "shrink" here, as it's not entirely clear to me.
> The functions are exclusively called through dispatch in
> _bt_compare_array_scankey_args, and I'd expected that naming to be
> extended to these functions.
I understood _bt_skiparray_shrink() as the function that refines the
range of values that a skip array will consider,
by interpreting existing scalar inequality conditions and applying them
to limit the bounds of the skip scan.
I understood "shrink" to mean narrowing the range of values that the
skip array will consider during the index scan.

--
Regards,
Alena Rybakina
Postgres Professional

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 19:01:38
Message-ID:	CAH2-Wzk9gBuPxUPD6KwiDeTvBtBOXk9pQAj0Lxi4Zg8uVLTt=w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 1, 2025 at 10:40 AM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> The review below was started for v31, then adapted to v32 when that
> arrived. I haven't cross-checked this with v33.

There's been hardly any changes to 0001- in quite a while, so that's fine.

> > Teach nbtree composite index scans to opportunistically skip over
> > irrelevant sections of composite indexes given a query with an omitted
> > prefix column.
>
> AFAIK, this is only the second reference to "composite" indexes, after
> a single mention in a comment in nbtsplitloc.c's _bt_strategy. IIUC,
> this refers to multi-column indexes, which is more frequently and more
> consistently used across the code base.

I don't think it makes much difference, but sure, I'll use the
multi-column index terminology. In both code comments, and in commit
messages.

> > When nbtree is passed input scan keys derived from a
> > query predicate "WHERE b = 5", new nbtree preprocessing steps now output
> > "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.
>
> [...] new nbtree preprocessing steps now output +the equivalent of+ [...]
>
> > Preprocessing can freely add a skip array before or after any input
> > ScalarArrayOp arrays.
>
> This implies skip arrays won't be generated for WHERE b = 5 (a
> non-SAOP scankey) or WHERE b < 3 (not SAOP either), which is probably
> not intentional, so a rewording would be appreciated.

I don't agree. Yes, that sentence (taken in isolation) does not make
it 100% unambiguous. But, would anybody ever actually be misled? Even
once, ever? The misinterpretation of my words that you're concerned
about is directly contradicted by the whole opening paragraph. Plus,
it just doesn't make any sense. Obviously, you yourself never actually
interpreted it that way. Right?

The paragraph that this sentence appears in is all about the various
ways in which SAOP arrays and skip arrays are the same thing, except
at the very lowest level of the _bt_advance_array_keys code. I think
that that context makes it particularly unlikely that anybody would
ever think that I mean to imply something about the ways in which
non-array keys can be composed alongside skip arrays.

I'm pushing back here because I think that there's a real cost to
using overly defensive language, aimed at some imaginary person. The
problem with catering to such a person is that, overall, the
clarifications do more harm than good. What seems to end up happening
(I speak from experience with writing overly defensive comments) is
that the superfluous clarifications are read (by actual readers) the
wrong way -- they're read as if we must have meant quite a lot more
than what we actually intended.

More generally, I feel that it's a mistake to try to interpret your
words on behalf of the reader. While it's definitely useful to try to
anticipate the ways in which your reader might misunderstand, you
cannot reasonably do the work for them. It's usually (though not
always) best to deal with anticipated points of confusion by subtly
constructing examples that suggest that some plausible wrong
interpretation is in fact wrong, without drawing attention to it.
Coming right out and telling the reader what to not think *is* an
important tool, but it should be reserved for cases where it's truly
necessary.

> // nbtpreprocesskeys.c
>
> > +static bool _bt_s**parray_shrink
>
> I'd like to understand the "shrink" here, as it's not entirely clear to me.
> The functions are exclusively called through dispatch in
> _bt_compare_array_scankey_args, and I'd expected that naming to be
> extended to these functions.

I don't see the problem? As Alena pointed out, we're shrinking the
arrays here (or are likely to), meaning that we're usually going to
eliminate some subset of array elements. It's possible that this will
happen more than once for a given array (since there could be more
than one "contradictory" key on input). An array can only shrink
within _bt_compare_array_scankey_args -- it can never have new array
elements added.

> > + * This qual now becomes "WHERE x = ANY('{every possible x value}') and y = 4"
>
> I understand what you're going for, but a reference that indexed NULLs
> are still handled correctly (rather than removed by the filter) would
> be appreciated, as the indicated substitution would remove NULL
> values. (This doesn't have to be much more than a footnote.)

Why, though? Obviously, '{every possible x value}' isn't a valid array
literal. Doesn't that establish that this isn't a 100% literal
statement of fact?

There are a handful of places where I make a similar statement (the
commit message is another one, as is selfuncs.c). I do make this same
point about NULLs being just another value in selfuncs.c, though only
because it's relevant there. I don't want to talk about NULLs here,
because they just aren't relevant to this high-level overview at the
top of _bt_preprocess_keys. We do talk about the issue of skip arrays
and IS NOT NULL constraints elsewhere in nbtree: we talk about those
issues in _bt_first (shortly after _bt_preprocess_keys is called).

Again, it comes down to what the reader might actually be confused by,
in the real world. Is it really plausible that I could ever commit a
skip scan patch that wholly forgot to deal with NULLs sensibly? Would
you personally ever think that I could make such an obvious blunder in
a committed patch? And if you did, how long would it take you to
figure out that there was no such oversight?

> > + * Also sets *numSkipArrayKeys to # of skip arrays _bt_preprocess_array_keys
> > + * caller must add to the scan keys it'll output. Caller must add this many
> > + * skip arrays to each of the most significant attributes lacking any keys
>
> I don't think that's a good description; as any value other than 0 or
> 1 would mean more than one skip array per such attribute, which is
> obviously incorrect. I think I'd word it like:
>
> + * Also sets *numSkipArrayKeys to # of skip arrays _bt_preprocess_array_keys
> + * caller must add to the scan keys it'll output. Caller will have to add
> + * this many skip arrays: one for each of the most significant attributes
> + * lacking any keys that use the = strategy [...]

I agree that it's better your way. Will fix.

> > +#if 0
> > + /* Could be a redundant input scan key, so can't do this: */
> > + Assert(inkey->sk_strategy == BTEqualStrategyNumber ||
> > + (inkey->sk_flags & SK_SEARCHNULL));
> > +#endif
>
> I think this should be removed?

Why? This is me expressing that I wish I could write this assertion,
but it won't quite work. I cannot rule out rare corner-cases involving
a contradictory pair of input keys, only one of which is a = key (the
other might be some kind of inequality, which makes this would-be
assertion not quite work). (You'll see similar "almost assertions"
from time to time, in different parts of the codebase.)

> > +#ifdef DEBUG_DISABLE_SKIP_SCAN
>
> I noticed this one and only reference to DEBUG_DISABLE_SKIP_SCAN. Are
> you planning on keeping that around, or is this a leftover?

I deliberately left this in place, just in case somebody wants to see
what happens when preprocessing stops generating skip arrays entirely.
Without this, it's not too obvious that it can just be turned off by
forcing _bt_num_array_keys to return early.

> > + ScanKey inkey = scan->keyData + i;
> > +
> > + /*
> > + * Backfill skip arrays for any wholly omitted attributes prior to
> > + * attno_inkey
> > + */
> > + while (attno_skip < attno_inkey)
>
> I don't understand why we're adding skip keys before looking at the
> contents of this scankey, nor why we're backfilling skip keys based on
> a now old value of attno_inkey. Please
> add some comments on how/why this is the right approach.

_bt_num_array_keys should add exactly the minimum number of skip
arrays that will allow the standard _bt_preprocess_keys logic to mark
every input scan key (copied from scan->keyData[]) as required to
continue the scan on output (when output to so->keyData[]). It just
makes sense to structure the loop in a way that adds skip arrays just
before moving on to some input scan key on some never-before-seen
index column -- that's just how this needs to work.

Very early versions of the patch added skip arrays in cases involving
inequalities that could already be marked required. That approach
could probably work, but has no advantages, and some downsides. Now we
avoid adding skip arrays given a simple case like "WHERE a BETWEEN 1
AND 10". We only want to do it in cases like "WHERE a BETWEEN 1 AND 10
AND b = 42" -- since adding a skip array on "a" is strictly necessary
to be able to mark the = key on "b" as required. In the latter case,
the loop inside _bt_num_array_keys will add a skip array on "a" once
it gets past the final "a" input key (i.e. once it sees that the next
key is the = key on "b"). In the former case, there is no key on "b",
and so we don't add any skip arrays at all (which is the correct
behavior).

BTW, this _bt_num_array_keys code was primarily tested by writing lots
of complicated cases that tickled every edge-case I could think of. I
wrote tests that relied on my nbtree scan instrumentation patch, which
can print the details of both input and output keys -- I didn't rely
on testing any runtime behavior for this (that wouldn't have worked as
well). This includes any key markings (markings such as
SK_BT_REQFWD/SK_BT_REQBKWD), which is what I expected to see on every
scan key output (barring a couple of special cases, such as the
RowCompare case). Again, that is all that the "where do we add skip
arrays?" logic in _bt_num_array_keys is concerned with.

> > prev_numSkipArrayKeys, *numSkipArrayKeys
>
> I'm a bit confused why we primarily operate on *numSkipArrayKeys,
> rather than a local variable that we store in *numSkipArrayKeys once
> we know we can generate skip keys. I.e., I'd expected something more
> in line with the following snippet (incomplete code blocks):
> I think this is easier for the compiler to push the store operation
> out of the loop (assuming it's optimizable at all; but even if it
> isn't it removes the load of *numSkipArrayKeys from the hot path).

What if I just had a local copy of numSkipArrayKeys, and copied back
into caller's arg when the function returns? We'll still need a
prev_numSkipArrayKeys under this scheme, but we won't have to set the
caller's pointer until right at the end anymore (which, I agree, seems
like it might be worth avoiding).

> I think the increment/decrement callbacks for skipsupport should
> explicitly check (by e.g. Assert) for NULL (or alternatively: same
> value) returns on overflow, and the API definition should make this
> explicit.

But the API definition *does* specifically address the opclass
author's responsibilities around NULLs? It specifically says that it's
not up to the opclass author to think about them at all.

> The current system is quite easy to build a leaking
> implementation for. Sure, there are other easy ways to break this, but
> I think it's an easy API modification that makes things just that bit
> safer.

How can I do that? The callbacks themselves (the callbacks for
functions that are called as the scan progresses) don't use the fmgr
interface.

They're simple C function pointers (rather like sort support
callbacks), and do not have any args related to NULLs. They accept a
raw Datum, which can never be a NULL. The convention followed by
similar functions that are passed a Datum that might just be a NULL is
for the function to also accept a separate "bool isnull" argument.
(Just not having such a separate bool arg is another existing
convention, and the one that I've followed here.)

> A renewed review for 0002+ will arrive at a later point.

Thanks for the review!

--
Peter Geoghegan

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 19:08:26
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi! Sorry for my later feedback, I didn't have enough time because of my
work and the conference that was held during these two days.

On 28.03.2025 23:15, Peter Geoghegan wrote:
> On Thu, Mar 27, 2025 at 6:03 PM Alena Rybakina
> <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>> I replied an example like this:
> This example shows costs that are dominated by heap access costs. Both
> the sequential scan and the bitmap heap scan must access 637 heap
> blocks. So I don't think that this is such a great example -- the heap
> accesses are irrelevant from the point of view of assessing how well
> we're modelling index scan related costs.
You are right, the example was not very successful, to be honest I
couldn't find better example.
>> I think it would be useful to show information that we used an index scan but at the same time we skipped the "region" column and I assume we should output how many distinct values the "region" column had.
>>
>> For example it will look like this "Skip Scan on region (4 distinct values)":
>> What do you think?
> As I said on our call today, I think that we should keep the output
> for EXPLAIN ANALYZE simple. While I'm sympathetic to the idea that we
> should show more information about how quals can be applied in index
> scan node output, that seems like it should be largely independent
> work to me.
>
> Masahiro Ikeda wrote a patch that aimed to improve matters in this
> area some months back. I'm supportive of that (there is definitely
> value in signalling to users that the index might actually look quite
> different to how they imagine it looks, say by having an
> omitted-by-query prefix attribute).
Thank you for your insights and explanation. I agree that this is a
separate piece of work, and I’ll definitely take a closer
look at Masahiro Ikeda’s contributions.
> I don't exactly know what the most
> useful kind of information to show is with skip scan in place, since
> skip scan makes the general nature of quals (whether a given qual is
> what oracle calls "access predicates", or what oracle calls "filter
> predicates") is made squishy/dynamic by skip scan, in a way that is
> new.
>
> The relationship between the number of values that a skip array ever
> uses, and the number of primitive index scans is quite complicated.
> Sometimes it is actually as simple as your example query, but that's
> often not true. "Index Searches: N" can be affected by:
>
> * The use of SAOP arrays, which also influence primitive scan
> scheduling, in the same way as they have since Postgres 17 -- and can
> be mixed freely with skip arrays.
I see that this work is really voluminous and yes, I agree with you that
optimization for skipping index scanning
in terms of displaying information about changing quals, if any, can
even be done using Oracle as an example.
For example, if you introduce a new range ANY(<every possible 'a'
value>) due to skipping the first column
in the index key, it will be useful for users to know. Or if you define
a new range that the skip array will consider
during the index scan. Well, and other things related to this, but not
specifically with your patch, for example
in the case of conflicting conditions or defining boundaries in the case
of index scanning, like, when a>1 and a>10
we need to scan only a>10.

This optimization was also introduced by you earlier even before the
patch on skip optimization, but I think it also lacks
some output in the explain.

> * The availability of opclass skipsupport, which makes skip arrays
> generate their element values by addition/subtraction from the current
> array element, rather than using NEXT/PRIOR sentinel keys.
>
> The sentinel keys act as probes that get the next real (non-sentinel)
> value that we need to look up next. Whereas skip support can often
> successfully guess that (for example) the next value in the index
> after 268 is 269, saving a primitive scan each time (this might not
> happen at all, or it might work only some of the time, or it might
> work all of the time).

To be honest, I missed your point here. If possible, could you explain
it in more detail?

> * Various primitive index scan scheduling heuristics.
>
> Another concern here is that I don't want to invent a special kind of
> "index search" just for skip scan. We're going to show an "Index
> Searches: N" that's greater than 1 with SAOP array keys, too -- which
> don't use skip scan at all (nothing new about that, except for the
> fact that we report the number of searches directly from EXPLAIN
> ANALYZE in Postgres 18).

I agree with you, this is an improvement. "Index Searches: N" shows the
number of individual index searches, but
it is still not clear enough. Here you can additionally determine what
happened based on the information about
the number of scanned pages, but with large amounts of data this is
difficult.

By the way, are you planning to commit additional output to the explain
about skipped pages? I think, together with
the previous ones, the information about index scanning would be
sufficient for analysis.

Although I am still learning to understand correctly this information in
the explain.
By the way, I have long wanted to ask, maybe you can advise something
else to read on this topic?
Maybe not in this thread, so as not to overload this.
> There really is almost no difference between
> a scan with a skip array and a scan of the same index with a similar
> SAOP array (when each array "contains the same elements", and is used
> to scan the same index, in the same way). That's why the cost model is
> as similar as possible to the Postgres 17 costing of SAOP array scans
> -- it's really the same access method. Reusing the cost model makes
> sense because actual execution times are almost identical when we
> compare a skip array to a SAOP array in the way that I described.

Yes, I agree with that.

> The only advantage that I see from putting something about "skip scan"
> in EXPLAIN ANALYZE is that it is more googleable that way. But it
> seems like "Index Searches: N" is almost as good, most of the time. In
> any case, the fact that we don't need a separate optimizer index
> path/executor node for this is something that I see as a key
> advantage, and something that I'd like EXPLAIN ANALYZE to preserve.
I think it is worth adding "skip scan" information, without it it is
difficult in my opinion to evaluate
whether this index is effective in comparison with another, looking only
at the information on
scanned blocks or Index search or I missed something?

I'm not sure that I correctly understood about "separation optimizer
index path/executor stage". Do you mean that it's better to do all the
optimizations during index execution rather than during index execution?
I just remember you mentioned the Goetz Graefe interview on the call and
and this was precisely his point of view, with which you agree. Is that
what you mean?

> The problem with advertising that an index scan node is a skip scan
> is: what if it just never skips? Never skipping like this isn't
> necessarily unexpected. And even if it is unexpected, it's not
> necessarily a problem.

I agree that there may not be a place for this to be used, but it is
worth showing information about it if it does happen.

On the other hand, here we need to be able to determine when it was
necessary to perform skip scan optimization, but
it was not there. But I'm not sure that it is possible to display this
in the explain - only when analyzing the received query plan,
namely the buffer statistics.

>> I didn't see any regression tests. Maybe we should add some tests? To be honest I didn't see it mentioned in the commit message but I might have missed something.
> There are definitely new regression tests -- I specifically tried to
> keep the test coverage high, using gcov html reports (like the ones
> from coverage.postgresql.org) The test updates appear towards the end
> of the big patch file, though. Maybe you're just not used to seeing
> tests appear last like this?
>
> I use "git config diff.orderfile ... " to get this behavior. I find it
> useful to put the important changes (particularly header file changes)
> first, and less important changes (like tests) much later.
>
> Thanks for taking a look at my patch!

Yes, indeed, everything is in place, sorry, I didn't notice it right
away, I'll be more attentive and I'll take note of your advice about the
git - I haven't used such methods before)

--
Regards,
Alena Rybakina
Postgres Professional

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 20:16:15
Message-ID:	CAEze2WgDec5gu+MQqmhhL0vxMWQT6OGXXCXxJmY3w0O6ifAGSQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 1 Apr 2025 at 21:02, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Tue, Apr 1, 2025 at 10:40 AM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> > > When nbtree is passed input scan keys derived from a
> > > query predicate "WHERE b = 5", new nbtree preprocessing steps now output
> > > "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.
> >
> > [...] new nbtree preprocessing steps now output +the equivalent of+ [...]
> >
> > > Preprocessing can freely add a skip array before or after any input
> > > ScalarArrayOp arrays.
> >
> > This implies skip arrays won't be generated for WHERE b = 5 (a
> > non-SAOP scankey) or WHERE b < 3 (not SAOP either), which is probably
> > not intentional, so a rewording would be appreciated.
>
> I don't agree. Yes, that sentence (taken in isolation) does not make
> it 100% unambiguous. But, would anybody ever actually be misled? Even
> once, ever? The misinterpretation of my words that you're concerned
> about is directly contradicted by the whole opening paragraph. Plus,
> it just doesn't make any sense. Obviously, you yourself never actually
> interpreted it that way. Right?

That's built on my knowledge about the internals of this patch ahead
of reading the message, not on a clean-sleet interpretation.

> The paragraph that this sentence appears in is all about the various
> ways in which SAOP arrays and skip arrays are the same thing, except
> at the very lowest level of the _bt_advance_array_keys code. I think
> that that context makes it particularly unlikely that anybody would
> ever think that I mean to imply something about the ways in which
> non-array keys can be composed alongside skip arrays.
>
> I'm pushing back here because I think that there's a real cost to
> using overly defensive language, aimed at some imaginary person.

While I agree that there is such a cost, I don't think that this is
too far fetched. They are not just added when we have SAOP scankeys,
and I think the non-SAOP case is one of the most important areas where
this patch improves performance. Yes, this patch improves performance
for queries with only SAOP on non-first keys, but I've seen more
non-SAOP queries where this patch would improve performance than
similar queries but with SAOP.

> > // nbtpreprocesskeys.c
> >
> > > +static bool _bt_s**parray_shrink
> >
> > I'd like to understand the "shrink" here, as it's not entirely clear to me.
> > The functions are exclusively called through dispatch in
> > _bt_compare_array_scankey_args, and I'd expected that naming to be
> > extended to these functions.
>
> I don't see the problem?

It's mostly as an observation (and not problem per se) that "compare"
(which sounds like a pure function that doens't modify anything, e.g.
_bt_compare) is used to dispatch to "shrink" (which does sound like
it'd modify something).

> > > + * This qual now becomes "WHERE x = ANY('{every possible x value}') and y = 4"
> >
> > I understand what you're going for, but a reference that indexed NULLs
> > are still handled correctly (rather than removed by the filter) would
> > be appreciated, as the indicated substitution would remove NULL
> > values. (This doesn't have to be much more than a footnote.)

> Again, it comes down to what the reader might actually be confused by,
> in the real world. Is it really plausible that I could ever commit a
> skip scan patch that wholly forgot to deal with NULLs sensibly? Would
> you personally ever think that I could make such an obvious blunder in
> a committed patch? And if you did, how long would it take you to
> figure out that there was no such oversight?

My comment is not about your potential future actions, but rather what
any future developer or committer working on this code might think and
worry about when reading this. = ANY {} constructs *always* have NOT
NULL behaviour, just like any other operator clause that isn't "IS
NULL", so clarifying that this is only similar -and does not behave
the same in important edge cases- is IMO important. Not specifically
for you, but for any other developer trying to get a correct
understanding of how this works and why it is correct.

> > > +#if 0
> > > + /* Could be a redundant input scan key, so can't do this: */
> > > + Assert(inkey->sk_strategy == BTEqualStrategyNumber ||
> > > + (inkey->sk_flags & SK_SEARCHNULL));
> > > +#endif
> >
> > I think this should be removed?
>
> Why? This is me expressing that I wish I could write this assertion,
> but it won't quite work. I cannot rule out rare corner-cases involving
> a contradictory pair of input keys, only one of which is a = key (the
> other might be some kind of inequality, which makes this would-be
> assertion not quite work). (You'll see similar "almost assertions"
> from time to time, in different parts of the codebase.)

It is my understanding that those are mostly historical artifacts of
the code base, and not systems in active development. Their rarety
makes it difficult to put numbers on, but IIRC at least one such case
was recently removed for bitrot and apparent lack of use in years.

> > > +#ifdef DEBUG_DISABLE_SKIP_SCAN
> >
> > I noticed this one and only reference to DEBUG_DISABLE_SKIP_SCAN. Are
> > you planning on keeping that around, or is this a leftover?
>
> I deliberately left this in place, just in case somebody wants to see
> what happens when preprocessing stops generating skip arrays entirely.
> Without this, it's not too obvious that it can just be turned off by
> forcing _bt_num_array_keys to return early.

Ah, I see.

> > > prev_numSkipArrayKeys, *numSkipArrayKeys
> >
> > I'm a bit confused why we primarily operate on *numSkipArrayKeys,
> > rather than a local variable that we store in *numSkipArrayKeys once
> > we know we can generate skip keys. I.e., I'd expected something more
> > in line with the following snippet (incomplete code blocks):
> > I think this is easier for the compiler to push the store operation
> > out of the loop (assuming it's optimizable at all; but even if it
> > isn't it removes the load of *numSkipArrayKeys from the hot path).
>
> What if I just had a local copy of numSkipArrayKeys, and copied back
> into caller's arg when the function returns? We'll still need a
> prev_numSkipArrayKeys under this scheme, but we won't have to set the
> caller's pointer until right at the end anymore (which, I agree, seems
> like it might be worth avoiding).

That's a nice alternative too, indeed.

> > I think the increment/decrement callbacks for skipsupport should
> > explicitly check (by e.g. Assert) for NULL (or alternatively: same
> > value) returns on overflow, and the API definition should make this
> > explicit.
>
> But the API definition *does* specifically address the opclass
> author's responsibilities around NULLs? It specifically says that it's
> not up to the opclass author to think about them at all.

Yes. What I'm suggesting is to make the contract enforceable to a
degree. If it was defined to "return (Datum) 0 on overflow", then that
could be Assert()ed; and code that does leak memory could get detected
by static analysis tools in the function scope because the allocation
didn't get returned, but with this definition returning an allocation
is never detected because that's not part of the contract.

> > The current system is quite easy to build a leaking
> > implementation for. Sure, there are other easy ways to break this, but
> > I think it's an easy API modification that makes things just that bit
> > safer.
>
> How can I do that? The callbacks themselves (the callbacks for
> functions that are called as the scan progresses) don't use the fmgr
> interface.

You could Assert(inc_sk_argument == (Datum) 0) in the oflow branch?
I'm certain that (Datum) 0 is an invalid representation of a pointer,
so we know that no allocated value is returned (be it new or
pre-existing).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 21:56:13
Message-ID:	CAEze2WjDKjtsxpVaZnHDKT0de5m1C4CbWR1Q_BtVwvpkVziTEQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 1 Apr 2025 at 04:02, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Fri, Mar 28, 2025 at 5:59 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > Attached is v32, which has very few changes, but does add a new patch:
> > a patch that adds skip-array-specific primitive index scan heuristics
> > to _bt_advance_array_keys (this is
> > v32-0003-Improve-skip-scan-primitive-scan-scheduling.patch).
>
> Attached is v33

0001:

I just realised we never check whether skip keys' high/low_compare
values generate valid quals, like what you'd see generated for WHERE a
> 5 AND a < 3 AND b = 2;

This causes a performance regression in the patched version:

-> Index Only Scan using test_a_b_idx on test (cost=0.14..8.16
rows=1 width=0) (actual time=0.240..0.241 rows=0.00 loops=1)
Index Cond: ((a > 5) AND (a < 3) AND (b = 2))
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=1

As you can see in this explain, we're doing an index search, while the
index searches attribute before this patch would've been 0 due to
conflicting conditions.

(This came up while reviewing 0004, when I thought about doing this
key consistency check after the increment/decrement optimization of
that patch and after looking couldn't find the skipscan bounds
consistency check at all)

0002:

// nbtutils.c

> + * (safe even when "key->sk_attno <= firstchangingattnum")

Typo: should be "key->sk_attno >= firstchangingattnum".

I'd also refactor the final segment to something like the following,
to remove a redundant compare when the attribute we're checking is
equal between firsttup and lasttup:

+ firstdatum = index_getattr(firsttup, key->sk_attno, tupdesc,
&firstnull);
+
+ /* Test firsttup */
+ _bt_binsrch_skiparray_skey(false, ForwardScanDirection,
+ firstdatum, firstnull, array, key,
+ &result);
+ if (result != 0)
+ break; /* unsafe */
+
+ /* both attributes are equal, so no need to check lasttup */
+ if (key->sk_attno < firstchangingattnum)
+ continue;
+
+ lastdatum = index_getattr(lasttup, key->sk_attno, tupdesc, &lastnull);
+
+ /* Test lasttup */
+ _bt_binsrch_skiparray_skey(false, ForwardScanDirection,
+ lastdatum, lastnull, array, key,
+ &result);
+ if (result != 0)
+ break; /* unsafe */
+
+ /* Safe, range skip array satisfied by every tuple */

0003: LGTM

0004: LGTM, but note the current bug in 0001, which is probably best
solved with a fix that keeps this optimization in mind, too.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 22:08:55
Message-ID:	CAEze2WixF=em9aja83-9gjjtLFnaLwH+06JteEWeukoiRrjU_A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 1 Apr 2025 at 23:56, Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
>
> On Tue, 1 Apr 2025 at 04:02, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> >
> > On Fri, Mar 28, 2025 at 5:59 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > > Attached is v32, which has very few changes, but does add a new patch:
> > > a patch that adds skip-array-specific primitive index scan heuristics
> > > to _bt_advance_array_keys (this is
> > > v32-0003-Improve-skip-scan-primitive-scan-scheduling.patch).
> >
> > Attached is v33
>
> 0001:
>
> I just realised we never check whether skip keys' high/low_compare
> values generate valid quals, like what you'd see generated for WHERE a
> > 5 AND a < 3 AND b = 2;
>
> This causes a performance regression in the patched version:

Apparently it's not a regression, as we don't have this check in place
in the master branch. So, that's an optimization for PG19.

Sorry for the noise.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-01 23:02:54
Message-ID:	CAH2-Wz=bM_PwAP926_KKhP7LTTS=TWE=jKNm1Tkp5PMvxoodqQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 1, 2025 at 5:56 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> 0002:
>
> // nbtutils.c
>
> > + * (safe even when "key->sk_attno <= firstchangingattnum")
>
> Typo: should be "key->sk_attno >= firstchangingattnum".

Good catch!

Though I think it should be "" safe even when "key->sk_attno >
firstchangingattnum" "", to highlight that the rule here is
significantly more permissive than even the nearby range skip array
case, which is still safe when (key->sk_attno == firstchangingattnum).

As I'm sure you realize, SAOP = keys and regular = keys are only safe
when "key->sk_attno < firstchangingattnum". So there are a total of 3
distinct rules about how firstchangingattnum affects whether it's safe
to advance pstate.startikey past a scan key (which of the 3 rules we
apply depends solely on the type of scan key).

In summary, simple = keys have the strictest firstchangingattnum rule,
range skip arrays/scalar inequalities have a somewhat less restrictive
rule, and non-range skip arrays have the least restrictive/most
permissive rule. As I think you understand already, it is generally
safe to set pstate.startikey to an offset that's past several earlier
simple skip arrays (against several prefix columns, all omitted from
the query) -- even when firstchangingattnum is the lowest possible
value (which is attnum 1).

> I'd also refactor the final segment to something like the following,
> to remove a redundant compare when the attribute we're checking is
> equal between firsttup and lasttup:

At one point the code did look like that, but I concluded that the
extra code wasn't really worth it. We can only save cycles within
_bt_set_startikey itself this way, which doesn't add up to much.
_bt_set_startikey is only called once per page.

Besides, in general it's fairly likely that a range skip array that
_bt_set_startikey sees won't be against a column that we already know
(from the _bt_keep_natts_fast precheck, which returns
firstchangingattnum) to only have one distinct value among all tuples
on the page.

> 0003: LGTM
>
> 0004: LGTM

Great, thanks!

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-02 23:15:33
Message-ID:	CAH2-WznbeZA+59gJHzRz1GnpDRqK7G=n83ribXs4iTx=QSfeuA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 1, 2025 at 4:16 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> While I agree that there is such a cost, I don't think that this is
> too far fetched. They are not just added when we have SAOP scankeys,
> and I think the non-SAOP case is one of the most important areas where
> this patch improves performance. Yes, this patch improves performance
> for queries with only SAOP on non-first keys, but I've seen more
> non-SAOP queries where this patch would improve performance than
> similar queries but with SAOP.

That's all likely to be true. I just don't think that the commit
message for the big commit (the part that you took issue with) said
anything that suggests otherwise.

To recap, the sentence in question says "Preprocessing can freely add
a skip array before or after any input ScalarArrayOp arrays". This is
mostly just making a high level point about the design itself -- so I
just don't get what you mean.

The "everything is an array" design is what allows skip arrays to work
with a qual like "WHERE b IN (1, 2, 3)", as you say. It's also what
allows things like "WHERE a IN (100, 500) AND c = 55" to work
efficiently, without introducing any special cases -- it works both
ways. And, a pair of skip arrays can also be composed together, in
just the same way as a pair of SAOP arrays. This all works in the same
way; _bt_advance_array_keys concepts like array roll over continue to
apply, with essentially zero changes to the high level design,
relative to Postgres 17. That's the core idea that the paragraph in
question conveys.

Recall that _bt_advance_array_keys likes to think of simple scalar =
keys as a degenerate single-value array. They are the same thing, for
the purposes of rolling over the scan's arrays. We need to use a 3-way
ORDER proc for scalar scan keys for this reason.

> It's mostly as an observation (and not problem per se) that "compare"
> (which sounds like a pure function that doens't modify anything, e.g.
> _bt_compare) is used to dispatch to "shrink" (which does sound like
> it'd modify something).

It sounds like it's modifying something because (as you must know) it
does just that. This has been the case since the Postgres 17 SAOP
patch, of course (only the use of the term "shrink" in a helper
function is new here).

I don't want to rename _bt_compare_scankey_args now (that name is well
over 20 years old). That would be what it would take to make this
consistent in the way you expect. I just don't think it matters very
much.

> My comment is not about your potential future actions, but rather what
> any future developer or committer working on this code might think and
> worry about when reading this. = ANY {} constructs *always* have NOT
> NULL behaviour, just like any other operator clause that isn't "IS
> NULL", so clarifying that this is only similar -and does not behave
> the same in important edge cases- is IMO important.

> Not specifically for you, but for any other developer trying to get a correct
> understanding of how this works and why it is correct.

How many times does it have to be clarified, though? Do I have to put
something about it anywhere I give a brief high-level description of
how skip arrays work, where it's natural to compare them to a
traditional SAOP that generates all possible matching elements?
Explaining the concepts in question is hard enough, without having to
always list all of the ways that my analogy isn't the full and literal
truth of the matter. It's already extremely obvious that it must be
far from a literal account of what happens.

> It is my understanding that those are mostly historical artifacts of
> the code base, and not systems in active development. Their rarety
> makes it difficult to put numbers on, but IIRC at least one such case
> was recently removed for bitrot and apparent lack of use in years.

It's effectively a comment (nobody is expected to ever uncomment it by
removing the "#ifdef 0"). Sometimes, comments become obsolete. It's a
trade-off.

> > What if I just had a local copy of numSkipArrayKeys, and copied back
> > into caller's arg when the function returns? We'll still need a
> > prev_numSkipArrayKeys under this scheme, but we won't have to set the
> > caller's pointer until right at the end anymore (which, I agree, seems
> > like it might be worth avoiding).
>
> That's a nice alternative too, indeed.

I'll do it that way in the commited patch. That's probably not going
to happen until Friday morning EST, to give me another full day to
work some more on the docs.

I don't see much point in posting another version of the patchset to the list.

> > But the API definition *does* specifically address the opclass
> > author's responsibilities around NULLs? It specifically says that it's
> > not up to the opclass author to think about them at all.
>
> Yes. What I'm suggesting is to make the contract enforceable to a
> degree. If it was defined to "return (Datum) 0 on overflow", then that
> could be Assert()ed; and code that does leak memory could get detected
> by static analysis tools in the function scope because the allocation
> didn't get returned, but with this definition returning an allocation
> is never detected because that's not part of the contract.

All B-Tree support functions aren't allowed to leak memory. Same with
all operators. This will be far from the only time that we expect
opclass authors to get that right. This mostly works just fine,
probably because an opclass that leaked memory like this would visibly
break quite quickly.

> You could Assert(inc_sk_argument == (Datum) 0) in the oflow branch?
> I'm certain that (Datum) 0 is an invalid representation of a pointer,
> so we know that no allocated value is returned (be it new or
> pre-existing).

I just don't see what the point would be. Nothing would stop a
decrement/increment callback that needs to allocate memory from
returning 0 and then leaking memory anyway.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-02 23:32:24
Message-ID:	CAH2-Wznyiub+1mdf_bk+GqMaYV25v4jpUv=FhAUk6wApP37uTg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 1, 2025 at 3:08 PM Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
> I think it would be useful to show information that we used an index scan but at the same time we skipped the "region" column and I assume we should output how many distinct values the "region" column had.
>
> For example it will look like this "Skip Scan on region (4 distinct values)":
>
> What do you think?

I don't see much value in that. We can sometimes have data skew that
makes the number of distinct values far from representative of how
many index searches were required. We can have 3 distinct prefix
column values within 90% of all leaf pages, while the remaining 10%
all have unique values. Skip scan will work quite well here (at least
compared to a traditional full index scan), but the number of distinct
values makes it look really bad.

> I see that this work is really voluminous and yes, I agree with you that optimization for skipping index scanning
> in terms of displaying information about changing quals, if any, can even be done using Oracle as an example.
> For example, if you introduce a new range ANY(<every possible 'a' value>) due to skipping the first column
> in the index key, it will be useful for users to know. Or if you define a new range that the skip array will consider
> during the index scan. Well, and other things related to this, but not specifically with your patch, for example
> in the case of conflicting conditions or defining boundaries in the case of index scanning, like, when a>1 and a>10
> we need to scan only a>10.
>
> This optimization was also introduced by you earlier even before the patch on skip optimization, but I think it also lacks
> some output in the explain.

I would also like this kind of stuff to appear in EXPLAIN. It's partly
hard because so much stuff happens outside of planning.

So, many types don't (and probably can't) offer skip support. Skip
scan still works there. The most common example of this is "text".

We're still using skip arrays when skipping using (say) a text column.
Conceptually, it's still "WHERE a = ANY(<every possible 'a' value>)",
even with these continuous data types. It is useful to "pretend that
we're using a discrete data type", so that everything can work in the
usual way (remember, I hate special cases). We need to invent another
way to "increment" a text datum, that works in the same way (but
doesn't really require understanding the semantics of text, or
whatever the data type may be).

See my explanation about this here:

https://postgr.es/m/CAH2-WznKyHq_W7heu87z80EHyZepQeWbGuAZebcxZHvOXWCU-w@mail.gmail.com

See the part of my email that begins with "I think that you're
probably still a bit confused because the terminology in this area is
a little confusing. There are two ways of explaining the situation
with types like text and numeric (types that lack skip support)...."

> I agree with you, this is an improvement. "Index Searches: N" shows the number of individual index searches, but
> it is still not clear enough. Here you can additionally determine what happened based on the information about
> the number of scanned pages, but with large amounts of data this is difficult.

The benefit of using skip scan comes from all of the pages that we
*aren't* reading, that we'd usually have to read. Hard to show that.

> By the way, are you planning to commit additional output to the explain about skipped pages? I think, together with
> the previous ones, the information about index scanning would be sufficient for analysis.

Not for Postgres 18.

> Although I am still learning to understand correctly this information in the explain.
> By the way, I have long wanted to ask, maybe you can advise something else to read on this topic?
> Maybe not in this thread, so as not to overload this.

Let's talk about it off list.

> I think it is worth adding "skip scan" information, without it it is difficult in my opinion to evaluate
> whether this index is effective in comparison with another, looking only at the information on
> scanned blocks or Index search or I missed something?

I think that it's particularly worth adding something to EXPLAIN
ANALYZE that makes it obvious that the index in question might not be
what the user thinks that it is. It might be an index that does some
skipping, but is far from optimal. They might have simply overlooked
the fact that there is an "extra" column between the columns that
their query predicate actually uses, which is far slower than what is
possible with a separate index that also omits that column.

Basically, I think that the most important goal should be to try to
help the user to understand when they have completely the wrong idea
about the index. I think that it's much less important to help users
to understand exactly how well a skip scan performs, relative to some
theoretical ideal. The theoretical ideal is just too complicated.

> I'm not sure that I correctly understood about "separation optimizer index path/executor stage". Do you mean that it's better to do all the optimizations during index execution rather than during index execution?

Yes. In general it's good if we can delay decisions about the scan's
behavior until runtime, when we have a more complete picture of what
the index actually looks like.

--
Peter Geoghegan

From:	Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-04 17:44:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03.04.2025 02:32, Peter Geoghegan wrote:
> On Tue, Apr 1, 2025 at 3:08 PM Alena Rybakina<a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>> I think it would be useful to show information that we used an index scan but at the same time we skipped the "region" column and I assume we should output how many distinct values the "region" column had.
>>
>> For example it will look like this "Skip Scan on region (4 distinct values)":
>>
>> What do you think?
> I don't see much value in that. We can sometimes have data skew that
> makes the number of distinct values far from representative of how
> many index searches were required. We can have 3 distinct prefix
> column values within 90% of all leaf pages, while the remaining 10%
> all have unique values. Skip scan will work quite well here (at least
> compared to a traditional full index scan), but the number of distinct
> values makes it look really bad.
Yes, I agree — this could give a misleading impression when trying to
evaluate the effectiveness of the feature.
I’ve spent quite some time thinking about whether there’s a better way
to present this information, but I haven’t come up with a solid alternative.

To be honest, I’m starting to think that simply displaying the name of
the skipped column might be sufficient.
>> * The availability of opclass skipsupport, which makes skip arrays
>> generate their element values by addition/subtraction from the current
>> array element, rather than using NEXT/PRIOR sentinel keys.
>>
>> The sentinel keys act as probes that get the next real (non-sentinel)
>> value that we need to look up next. Whereas skip support can often
>> successfully guess that (for example) the next value in the index
>> after 268 is 269, saving a primitive scan each time (this might not
>> happen at all, or it might work only some of the time, or it might
>> work all of the time).
>>
>> To be honest, I missed your point here. If possible, could you explain it in more detail?
> So, many types don't (and probably can't) offer skip support. Skip
> scan still works there. The most common example of this is "text".
>
> We're still using skip arrays when skipping using (say) a text column.
> Conceptually, it's still "WHERE a = ANY(<every possible 'a' value>)",
> even with these continuous data types. It is useful to "pretend that
> we're using a discrete data type", so that everything can work in the
> usual way (remember, I hate special cases). We need to invent another
> way to "increment" a text datum, that works in the same way (but
> doesn't really require understanding the semantics of text, or
> whatever the data type may be).
>
> See my explanation about this here:
>
> https://postgr.es/m/CAH2-WznKyHq_W7heu87z80EHyZepQeWbGuAZebcxZHvOXWCU-w@mail.gmail.com
>
> See the part of my email that begins with "I think that you're
> probably still a bit confused because the terminology in this area is
> a little confusing. There are two ways of explaining the situation
> with types like text and numeric (types that lack skip support)...."

I understand it. I agree with you that it should be extended to other
types but I'm not sure how.

Maybe we can add an abstract iterator that will helps to get the next
distinct value adapted to the type or
it needs to be added similar functions for each type. I think this topic
is also for a separate thread)

>> I agree with you, this is an improvement. "Index Searches: N" shows the number of individual index searches, but
>> it is still not clear enough. Here you can additionally determine what happened based on the information about
>> the number of scanned pages, but with large amounts of data this is difficult.
> The benefit of using skip scan comes from all of the pages that we
> *aren't* reading, that we'd usually have to read. Hard to show that.
I noticed statistics on the number of hit buffers, read buffers, for
example, "Buffers: shared hit=3 read=52",
are you talking about this?
>> Although I am still learning to understand correctly this information in the explain.
>> By the way, I have long wanted to ask, maybe you can advise something else to read on this topic?
>> Maybe not in this thread, so as not to overload this.
> Let's talk about it off list.
Okay. Thank you)
>> I think it is worth adding "skip scan" information, without it it is difficult in my opinion to evaluate
>> whether this index is effective in comparison with another, looking only at the information on
>> scanned blocks or Index search or I missed something?
> I think that it's particularly worth adding something to EXPLAIN
> ANALYZE that makes it obvious that the index in question might not be
> what the user thinks that it is. It might be an index that does some
> skipping, but is far from optimal. They might have simply overlooked
> the fact that there is an "extra" column between the columns that
> their query predicate actually uses, which is far slower than what is
> possible with a separate index that also omits that column.
>
> Basically, I think that the most important goal should be to try to
> help the user to understand when they have completely the wrong idea
> about the index. I think that it's much less important to help users
> to understand exactly how well a skip scan performs, relative to some
> theoretical ideal. The theoretical ideal is just too complicated.
Yes, I agree — this information can be valuable, especially for those
investigating query performance issues.
In particular, for example, it helps in cases where the optimizer
chooses a suboptimal index, and it's not obvious why.

I believe it’s important to display not just what expressions were used
during planning, but also what was actually used during execution.
That information might be important when analyzing data skew, deciding
whether extended statistics are needed, and
understanding how the planner's assumptions played out.

But this is a separate thread for discussion.

--
Regards,
Alena Rybakina
Postgres Professional

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-04-29 22:51:38
Message-ID:	CAH2-Wz=2WgQwyM=S_+e_oKEqNcUuZoriw+mwNUp++KsawMnS4g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 1, 2025 at 7:02 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Though I think it should be "" safe even when "key->sk_attno >
> firstchangingattnum" "", to highlight that the rule here is
> significantly more permissive than even the nearby range skip array
> case, which is still safe when (key->sk_attno == firstchangingattnum).

Mark Dilger reported a bug in commit 8a510275 on Saturday, which I
fixed in commit b75fedca from Monday. Mark's repro was a little bit
complicated, though.

Attached is a Python script that performs fuzz testing of nbtree skips
scan. It is capable of quickly finding the same bug as the one that
Mark reported. The script generates random, complicated multi-column
index scans on a (a, b, c, d) index on a test table, and verifies that
each queries gives the same answer as an equivalent sequential scan
plan. This works quite well as a general smoke test. I find that if I
deliberately add somewhat plausible bugs to the code in
_bt_set_startikey, the fuzz testing script is usually able to identify
wrong answers to queries in under a minute.

I don't expect that this script will actually discover any real bugs
-- I ran it for long enough to get the sense that that was unlikely.
But it seemed like a worthwhile exercise.

--
Peter Geoghegan

Attachment	Content-Type	Size
fuzz_skip_scan.py	text/x-python-script	20.9 KB

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-01 01:12:33
Message-ID:	CAHgHdKtQMhHy5qcB3KqCcGiW-Rp8P7KzUFRa9ZMKUiv6zen7LQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter,

moving the conversation here from "pgsql: Improve nbtree skip scan
primitive scan scheduling" thread on -committers. Attached is another
regression test for your consideration, which gives rise to the following
assertion:

TRAP: failed Assert("numSkipArrayKeys == 0"), File: "nbtpreprocesskeys.c",
Line: 1859, PID: 7210
0 postgres 0x00000001032f30e0
ExceptionalCondition + 108
1 postgres 0x0000000102e83100
_bt_preprocess_keys + 6036
2 postgres 0x0000000102e87338 _bt_first + 168
3 postgres 0x0000000102e84680 btgettuple + 196
4 postgres 0x0000000102e75cdc
index_getnext_tid + 68
5 postgres 0x00000001030017a0 IndexOnlyNext +
228
6 postgres 0x0000000102fe5b2c ExecScan + 228
7 postgres 0x0000000103011088 ExecSort + 536
8 postgres 0x0000000102fdbc68
standard_ExecutorRun + 312
9 postgres 0x00000001031bdfb8 PortalRunSelect
+ 236
10 postgres 0x00000001031bdbd4 PortalRun + 492
11 postgres 0x00000001031bcb18
exec_simple_query + 1292
12 postgres 0x00000001031b9d1c PostgresMain +
1388
13 postgres 0x00000001031b59a8
BackendInitialize + 0
14 postgres 0x000000010310edd8
postmaster_child_launch + 372
15 postgres 0x000000010311303c ServerLoop + 4948
16 postgres 0x0000000103111360
InitProcessGlobals + 0
17 postgres 0x0000000103030c00 help + 0
18 dyld 0x000000018eb17154 start + 2476

This looks sufficiently different from the assertion mentioned on the other
thread to be worth posting here.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
_skipscan2.sql	application/sql	3.0 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-01 03:17:05
Message-ID:	CAH2-WznDbVUkN5DzjawMrvX1+bu6c1FmrNm1c7YXduLM4K3t2g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Apr 30, 2025 at 9:12 PM Mark Dilger
<mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> TRAP: failed Assert("numSkipArrayKeys == 0"), File: "nbtpreprocesskeys.c", Line: 1859, PID: 7210
> 0 postgres 0x00000001032f30e0 ExceptionalCondition + 108
> 1 postgres 0x0000000102e83100 _bt_preprocess_keys + 6036

> This looks sufficiently different from the assertion mentioned on the other thread to be worth posting here.

It is a completely unrelated issue. Fortunately, this one is harmless.
The assertion merely failed to account for the case where we
completely end the scan during preprocessing due to it having an
unsatisfiable array qual.

Pushed a fix for this just now.

Thanks

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-02 18:22:07
Message-ID:	CAH2-WzmT1cKPXQCEJA4F8FrE0GAdfbXtgN2nmKPo0Cr4invGxg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 29, 2025 at 6:51 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I don't expect that this script will actually discover any real bugs
> -- I ran it for long enough to get the sense that that was unlikely.
> But it seemed like a worthwhile exercise.

A slight variant of my fuzzing Python script did in fact go on to
detect a couple of bugs.

I'm attaching a compressed SQL file with repros for 2 different bugs.
The first bug was independently detected by some kind of fuzzing
performed by Mark Dilger, reported elsewhere [1].

I'm not sure if this message will be held up in moderation (the file
is 1MB in size once compressed), so I will explain these test cases in
the next mail to the list. And, I'll post fixes for both bugs.

[1] https://www.postgresql.org/message-id/CAHgHdKsn2W=gPBmj7p6MjQFvxB+zZDBkwTSg0o3f5Hh8rkRrsA@mail.gmail.com
--
Peter Geoghegan

Attachment	Content-Type	Size
repro_forcenonrequired_bugs.tar.bz2	application/x-bzip2	993.8 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-02 19:04:22
Message-ID:	CAH2-WzmodSE+gpTd1CRGU9ez8ytyyDS+Kns2r9NzgUp1s56kpw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 2, 2025 at 2:22 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> A slight variant of my fuzzing Python script did in fact go on to
> detect a couple of bugs.
>
> I'm attaching a compressed SQL file with repros for 2 different bugs.
> The first bug was independently detected by some kind of fuzzing
> performed by Mark Dilger, reported elsewhere [1].

Picking up from the email with the big attachment...

Both bugs are from commit 8a510275, "Further optimize nbtree search
scan key comparisons" (not the main skip scan commit). I actually
wrote code very much like the code from these patches that appeared in
certain versions of the skip scan patch series -- it was originally
supposed to be defensive hardening. This so-called hardening wasn't
kept in the final committed version because I incorrectly believed
that it wasn't necessary.

I would like to commit the first patch later today, ahead of shipping
beta1. But the second patch won't make it into beta1.

In practice the bug fixed by the first patch is more likely to affect
users, and (unlike the bug fixed by the second patch), it involves a
hard crash. The first patch prevents us from dereferencing a NULL
pointer (pstate) within _bt_advance_array_keys (unless on an
assert-enabled build, where we get an assertion failure first). It
would also be possible to fix the issue by testing if pstate itself is
not a NULL pointer in the usual defensive style, but I find the
approach taken in the first patch slightly more natural.

The second patch is more complicated, and seems like something that
I'll need to spend more time thinking about before proceeding with
commit. It has subtle behavioral implications, in that it causes the
pstate.forcenonrequired mechanism to influence when and how
_bt_advance_array_keys schedules primitive index scans in a tiny
minority of forward scan cases. I know of only 3 queries where this
happens, 2 of which are from my repro -- it's actually really hard to
find an example of this, even if you go out of your way to.

Allowing pstate.forcenonrequired to affect primscan scheduling like
this is something that I have been avoiding up until now, since that
makes things cleaner -- but I'm starting to think that that goal isn't
important enough to force the second patch to be significantly more
complicated than what I came up with here. It's not like the
behavioral differences represent a clear regression; they're just
slightly different to what we see in cases where
pstate.forcenonrequired/pstate.ikey is forcibly not applied (e.g., by
commenting-out the calls to _bt_set_startikey made by _bt_readpage).

My approach in the second patch is to simply call _bt_start_array_keys
ahead of the finaltup call to _bt_checkkeys when
pstate.forcenonrequired, which has the merit of being relatively
simple (it's likely the simplest possible approach). I'm unwilling to
pay too much of a cost in implementation complexity just to avoid
side-effects in _bt_advance_array_keys/primscan scheduling, but maybe
I'll find that the cost isn't too high.

--
Peter Geoghegan

Attachment	Content-Type	Size
v1-0001-Avoid-treating-nonrequired-nbtree-keys-as-require.patch	application/octet-stream	1.4 KB
v1-0002-Prevent-prematurely-nbtree-array-advancement.patch	application/octet-stream	2.2 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-06 22:00:37
Message-ID:	CAH2-WzkXzJajgyW-pCQ7vaDPhaT3huU+Zw_j448rpCBEsu2YOQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 2, 2025 at 3:04 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I would like to commit the first patch later today, ahead of shipping
> beta1. But the second patch won't make it into beta1.

Committed the first patch last Friday.

Attached is v2, whose 0002- bugfix patch is essentially unchanged
compared to v1 -- there are now comments explaining why RowCompare
keys cannot safely use the pstate.forcenonrequired behavior (in the
presence of a higher-order array). There is also a new 0001- patch
(not to be confused with the prior 0001- patch that I committed last
week).

I plan to commit everything in the next couple of days, barring any objections.

> The second patch is more complicated, and seems like something that
> I'll need to spend more time thinking about before proceeding with
> commit. It has subtle behavioral implications, in that it causes the
> pstate.forcenonrequired mechanism to influence when and how
> _bt_advance_array_keys schedules primitive index scans in a tiny
> minority of forward scan cases. I know of only 3 queries where this
> happens, 2 of which are from my repro -- it's actually really hard to
> find an example of this, even if you go out of your way to.

The new 0001- patch addresses these concerns of mine about
pstate.forcenonrequired affecting primscan scheduling.

It turned out that this unprincipled behavioral inconsistency was only
possible because of an inconsistency in how the recheck within
_bt_scanbehind_checkkeys works in the presence of relevant truncated
high key attributes -- an inconsistency compared to similar code
within _bt_advance_array_keys. Now (with the new 0001- patch applied),
we won't allow a scan with an "almost matching" set of array keys to
continue with reading the next page in the case where the keys merely
satisfy the next page's high key's untruncated attribute prefix values
-- we won't accept it when there's uncertainty due to other arrays
pertaining to attributes that are truncated within the next page's
high key/finaltup.

There is no reason to believe that this matters on correctness ground,
or even on performance grounds, but it does seem like the principled
approach. We shouldn't cross more than one leaf page boundary before
resolving our uncertainty about whether or not stepping to the next
leaf page (i.e. not starting another primscan) is the right thing to
do.

Note again that this is a very narrow issue: it could only happen when
we advanced the array keys on a page to values that just so happen to
be exact matches for the *next* page's high key, when there were
additional keys corresponding to truncated suffix attributes in that
same next page high key. The chances of things lining up like that are
very slim indeed. But, I find the behavioral inconsistency distracting
and unprincipled, and it's easy enough to just eliminate it
altogether. I haven't formally promised that calling _bt_set_startikey
must never affect the total number of primscans (relative to an
equivalent query/scan where we just don't call it), but that feels
like a good goal.

--
Peter Geoghegan

Attachment	Content-Type	Size
v2-0002-Prevent-premature-nbtree-array-advancement.patch	application/octet-stream	5.4 KB
v2-0001-nbtree-tighten-up-array-recheck-rules.patch	application/octet-stream	2.0 KB

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-07 00:11:36
Message-ID:	CAHgHdKtz0oHcdKTerpJ74CYCB4VC0Sc_rv=4DEtjoC+epMZ8-Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 6, 2025 at 3:01 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

>
> I plan to commit everything in the next couple of days, barring any
> objections.
>

I have been using your two patches, one committed by you and the other
committed locally to my working directory, since you posted them a few days
ago. I have not encountered any of the problems that I had been
encountering previously. That's not proof, but I was hitting these
failures pretty consistently before the patches.

I have no objection.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-07 19:21:38
Message-ID:	CAH2-Wzm_ArFzt9ch4qMND1EM=HmMesDiEYkcSSw9XSyO_uKBvw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 6, 2025 at 8:11 PM Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
> I have been using your two patches, one committed by you and the other committed locally to my working directory, since you posted them a few days ago. I have not encountered any of the problems that I had been encountering previously.

Cool.

Pushed both patches from v2 just now.

Thanks
--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 12:58:34
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

While doing some benchmarks to compare 17 vs. 18, I ran into a
regression that I ultimately tracked to commit 92fe23d93aa.

commit 92fe23d93aa3bbbc40fca669cabc4a4d7975e327
Author: Peter Geoghegan <pg(at)bowt(dot)ie>
Date: Fri Apr 4 12:27:04 2025 -0400

Add nbtree skip scan optimization.

The workload is very simple - pgbench scale 1 with 100 partitions, an
extra index and a custom select script (same as the other regression I
just reported, but with low client counts):

pg_ctl -D data init
pg_ctl -D data -l pg.log start

createdb test

psql test -c 'create index on pgbench_accounts(bid)'

and a custom script with a single query:

select count(*) from pgbench_accounts where bid = 0

and then simply run this for a couple client counts:

for m in simple prepared; do
for c in 1 4 32; do
pgbench -n -f select.sql -M $m -T 10 -c $c -j $c test | grep tps;
done;
done;

And the results for 92fe23d93aa and 3ba2cdaa454 (the commit prior to the
skip scan one) look like this:

mode #c 3ba2cdaa454 92fe23d93aa diff
-------------------------------------------------------
simple 1 2617 1832 70%
4 8332 6260 75%
32 11603 7110 61%
------------------------------------------------------
prepared 1 11113 3646 33%
4 25379 11375 45%
32 37319 14097 38%

The number are throughput, as reported by pgbench, and for this
workload, we're often losing ~50% of throughput with 92fe23d93aa.

Despite that, I'm not entirely sure how serious this is. This was meant
to be a micro-benchmark stressing the locking, but maybe it's considered
unrealistic in practice. Not sure.

I'm also not sure about the root cause, but while investigating it one
of the experiments I tried was tweaking the glibc malloc by setting

export MALLOC_TOP_PAD_=$((64*1024*1024))

which keeps a 64MB "buffer" in glibc, to reduce the amount of malloc
syscalls. And with that, the results change to this:

mode #c 3ba2cdaa454 92fe23d93aa diff
-------------------------------------------------------
simple 1 3168 3153 100%
4 9172 9171 100%
32 12425 13248 107%
------------------------------------------------------
prepared 1 11104 11460 103%
4 25481 25737 101%
32 36795 38372 104%

So the difference disappears - what remains is essentially run to run
variability. The throughout actually improves a little bit for 3ba2cd.

My conclusion from this is that 92fe23d93aa ends up doing a lot of
malloc calls, and this is what makes causes the regression. Otherwise
setting the MALLOC_TOP_PAD_ would not help like this. But I haven't
looked at the code, and I wouldn't have guessed the query to have
anything to do with skip scan ...

regards

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 13:33:11
Message-ID:	CAH2-Wz=SABjX0AEedf7Md2aXKi0KkVVdKSTDozrpsF3Z0WV2ZQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 8:58 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> My conclusion from this is that 92fe23d93aa ends up doing a lot of
> malloc calls, and this is what makes causes the regression. Otherwise
> setting the MALLOC_TOP_PAD_ would not help like this. But I haven't
> looked at the code, and I wouldn't have guessed the query to have
> anything to do with skip scan ...

While it's just about plausible that added nbtree preprocessing
allocation overhead could account for this, I don't think it's
actually possible here, since, as you said, this particular query
simply isn't eligible to use a skip scan. It's impossible for any
single column index to do so.

The best guess I have is that the skip scan commit inadvertently added
planner cycles, even though this query isn't eligible to use skip scan
-- I thought that I'd avoided that, but perhaps I overlooked some
subtlety.

Can you try it again, with prepared statements? Alternatively, you
could selectively revert the changes that commit 92fe23d93aa made to
utils/adt/selfuncs.c, and then retest.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 13:42:57
Message-ID:	CAH2-Wzm5jSpdh+E1Yhp=L8N04_rrOQvwk8AD_=fTgaHvRFf=yg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 9:33 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Can you try it again, with prepared statements? Alternatively, you
> could selectively revert the changes that commit 92fe23d93aa made to
> utils/adt/selfuncs.c, and then retest.

Oh, wait, you already did try it with prepared statements.

I'm rather puzzled as to why this happens, then. I expect that nbtree
preprocessing will be able to use its usual single index column/index
key fast path here -- the "We can short-circuit most of the work if
there's just one key" path in _bt_preprocess_keys (and I expect that
_bt_num_array_keys() quickly determines that no skip arrays should be
added, preventing array preprocessing from ever really starting).

As far as I'm aware, none of the skip scan commits introduced new
palloc() overhead that would affect simple index scans like this. In
general that should only happen with index scans of a multicolumn
index that has at least one prefix column with no "=" condition. Most
individual index scans (including index scans that just use
inequalities) don't look like that.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 13:59:09
Message-ID:	CAH2-WzkGcY9vJ_hEJUAzegex=9Js46wrPOVw_-GgA22+ABaWmg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 9:42 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I'm rather puzzled as to why this happens, then. I expect that nbtree
> preprocessing will be able to use its usual single index column/index
> key fast path here -- the "We can short-circuit most of the work if
> there's just one key" path in _bt_preprocess_keys (and I expect that
> _bt_num_array_keys() quickly determines that no skip arrays should be
> added, preventing array preprocessing from ever really starting).

You've been testing commit 92fe23d9 ("Add nbtree skip scan
optimization") here, but I think you should test commit 8a510275
("Further optimize nbtree search scan key comparisons") instead. The
former commit's commit message says that there big regressions, that
the latter commit should go on to fix. Note that these two commits
were pushed together, as a unit. All of my performance validation work
was for the patch series as a whole, not for any individual commit.

I don't actually think that this kind of scan would have been affected
by those known regressions -- since they don't use array keys. But it
is definitely true that the queries that you're looking at very much
rely on the optimization from commit 8a510275 (or its predecessor
optimization, the "pstate.prechecked" optimization). As I said, my
performance validation didn't target individual commits.

--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 14:15:03
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/9/25 15:59, Peter Geoghegan wrote:
> On Fri, May 9, 2025 at 9:42 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I'm rather puzzled as to why this happens, then. I expect that nbtree
>> preprocessing will be able to use its usual single index column/index
>> key fast path here -- the "We can short-circuit most of the work if
>> there's just one key" path in _bt_preprocess_keys (and I expect that
>> _bt_num_array_keys() quickly determines that no skip arrays should be
>> added, preventing array preprocessing from ever really starting).
>
> You've been testing commit 92fe23d9 ("Add nbtree skip scan
> optimization") here, but I think you should test commit 8a510275
> ("Further optimize nbtree search scan key comparisons") instead. The
> former commit's commit message says that there big regressions, that
> the latter commit should go on to fix. Note that these two commits
> were pushed together, as a unit. All of my performance validation work
> was for the patch series as a whole, not for any individual commit.
>
> I don't actually think that this kind of scan would have been affected
> by those known regressions -- since they don't use array keys. But it
> is definitely true that the queries that you're looking at very much
> rely on the optimization from commit 8a510275 (or its predecessor
> optimization, the "pstate.prechecked" optimization). As I said, my
> performance validation didn't target individual commits.
>

I initially compared 17 to current master, but after discovering the
regression I bisected to the actual commit. That's how I ended up with
92fe23d93aa. This is how it looks for current master (bc35adee8d7):

mode #c 3ba2cdaa454 92fe23d93aa bc35adee8d7
-------------------------------------------------------------
simple 1 2617 1832 1899
4 8332 6260 6143
32 11603 7110 7193
-------------------------------------------------------------
prepared 1 11113 3646 3655
4 25379 11375 11342
32 37319 14097 13911

There's almost no difference between bc35adee8d7 and 92fe23d93aa.

regards

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 14:17:39
Message-ID:	CAH2-Wz=XfWkSJ7+HU938j6Hfg2EXie4Sw0VvMPu_TQnxkQgSFg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 8:58 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> I'm also not sure about the root cause, but while investigating it one
> of the experiments I tried was tweaking the glibc malloc by setting
>
> export MALLOC_TOP_PAD_=$((64*1024*1024))
>
> which keeps a 64MB "buffer" in glibc, to reduce the amount of malloc
> syscalls. And with that, the results change to this:

You're sure that the problem is an increase in the number of
malloc()s? If that's what this is, then it shouldn't be too hard to
debug.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 14:22:57
Message-ID:	CAH2-WzmdOETpceTr3wsTTTpu8jvNkYitw86uaGmiRZxJpL=p0A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 9:59 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I don't actually think that this kind of scan would have been affected
> by those known regressions -- since they don't use array keys. But it
> is definitely true that the queries that you're looking at very much
> rely on the optimization from commit 8a510275 (or its predecessor
> optimization, the "pstate.prechecked" optimization). As I said, my
> performance validation didn't target individual commits.

Wait, that's not it, either. Since the index scan that you use won't
find any matching tuples at all. It should land on the leftmost leaf
page, find that there are no tuples "WHERE bid = 0", ending the scan
before it ever really began.

--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 14:51:10
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/9/25 16:17, Peter Geoghegan wrote:
> On Fri, May 9, 2025 at 8:58 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> I'm also not sure about the root cause, but while investigating it one
>> of the experiments I tried was tweaking the glibc malloc by setting
>>
>> export MALLOC_TOP_PAD_=$((64*1024*1024))
>>
>> which keeps a 64MB "buffer" in glibc, to reduce the amount of malloc
>> syscalls. And with that, the results change to this:
>
> You're sure that the problem is an increase in the number of
> malloc()s? If that's what this is, then it shouldn't be too hard to
> debug.
>

No, I'm not sure. I merely speculate based on the observation that
setting the environment variable makes the issue go away.

I've seen similar problems with btree before, as it allocates fairly
large chunks of memory for BTScanOpaque.

--
Tomas Vondra

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 14:57:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/9/25 16:22, Peter Geoghegan wrote:
> On Fri, May 9, 2025 at 9:59 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I don't actually think that this kind of scan would have been affected
>> by those known regressions -- since they don't use array keys. But it
>> is definitely true that the queries that you're looking at very much
>> rely on the optimization from commit 8a510275 (or its predecessor
>> optimization, the "pstate.prechecked" optimization). As I said, my
>> performance validation didn't target individual commits.
>
> Wait, that's not it, either. Since the index scan that you use won't
> find any matching tuples at all. It should land on the leftmost leaf
> page, find that there are no tuples "WHERE bid = 0", ending the scan
> before it ever really began.
>

I see the regression even with variants that actually match some rows.
For example if I do this:

update pgbench_accounts set bid = aid;
vacuum full;

and change the query to search for "bid = 1", I get exactly the same
behavior. Even with

update pgbench_accounts set bid = aid / 100;
vacuum full;

so that the query matches 100 rows, I get the same behavior.

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 15:55:19
Message-ID:	CAH2-Wzk0AduEKWBnPVWr2v2+RZ4ZB0Sbm7iMwJBhTX=BAjVAEw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 10:57 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> I see the regression even with variants that actually match some rows.
> For example if I do this:

> so that the query matches 100 rows, I get the same behavior.

That's really weird, since the index scans will no longer be cheap.
And yet whatever the overhead is still seems to be plainly visible. I
would expect whatever the underlying problem is to be completely
drowned out once the index scan had to do real work.

I wonder if it could be due to the fact that I added a new support
function, support function #6/skip support? That would have increased
the size of things like RelationData.rd_support and
RelationData.rd_supportinfo.

Note that "sizeof(FmgrInfo)" is 48 bytes. Prior to skip scan,
RelationData.rd_supportinfo would have required 48*5=240 bytes. After
skip scan, it would have required 48*6=288 bytes. Maybe 256 bytes is
some kind of critical threshold, someplace?

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 16:09:01
Message-ID:	CAH2-Wz=LRs1uoc3Nkg5C=asCN5_LeZrObiOTvp8DvQs_KqVwAg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 11:55 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Note that "sizeof(FmgrInfo)" is 48 bytes. Prior to skip scan,
> RelationData.rd_supportinfo would have required 48*5=240 bytes. After
> skip scan, it would have required 48*6=288 bytes. Maybe 256 bytes is
> some kind of critical threshold, someplace?

Can you try it with the attached patch?

The patch disables skip support entirely, in a way that should
eliminate whatever the inherent overhead of adding a sixth support
routine to nbtree was. It does not remove skip scan itself (that
should still work with queries that are actually eligible to use skip
scan, albeit slightly less efficiently with some opclasses).

--
Peter Geoghegan

Attachment	Content-Type	Size
0001-Remove-nbtree-support-routine-6.patch	application/octet-stream	10.7 KB

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 16:28:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/9/25 17:55, Peter Geoghegan wrote:
> On Fri, May 9, 2025 at 10:57 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> I see the regression even with variants that actually match some rows.
>> For example if I do this:
>
>> so that the query matches 100 rows, I get the same behavior.
>
> That's really weird, since the index scans will no longer be cheap.
> And yet whatever the overhead is still seems to be plainly visible. I
> would expect whatever the underlying problem is to be completely
> drowned out once the index scan had to do real work.
>

Not sure if it matters, but this uses index-only scans, and the pages
are all-visible, so maybe it's not much more expensive.

> I wonder if it could be due to the fact that I added a new support
> function, support function #6/skip support? That would have increased
> the size of things like RelationData.rd_support and
> RelationData.rd_supportinfo.
>
> Note that "sizeof(FmgrInfo)" is 48 bytes. Prior to skip scan,
> RelationData.rd_supportinfo would have required 48*5=240 bytes. After
> skip scan, it would have required 48*6=288 bytes. Maybe 256 bytes is
> some kind of critical threshold, someplace?
>

Not sure, I did multiple tests with different queries, and it'd be a bit
strange if this was the only one affected. Not impossible.

The theory about crossing the 256B threshold is interesting. I've been
thinking about ALLOC_CHUNK_LIMIT = 8KB, which is what's making the
BTScanOpaque expensive. But there's also ALLOC_CHUNK_FRACTION, which is
1/4. So maybe there's a context with maxBlockSize=1kB? But I think most
places use the size macros, and ALLOCSET_SMALL_MAXSIZE is 8KB.

regards

--
Tomas Vondra

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 16:29:54
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/9/25 18:09, Peter Geoghegan wrote:
> On Fri, May 9, 2025 at 11:55 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> Note that "sizeof(FmgrInfo)" is 48 bytes. Prior to skip scan,
>> RelationData.rd_supportinfo would have required 48*5=240 bytes. After
>> skip scan, it would have required 48*6=288 bytes. Maybe 256 bytes is
>> some kind of critical threshold, someplace?
>
> Can you try it with the attached patch?
>
> The patch disables skip support entirely, in a way that should
> eliminate whatever the inherent overhead of adding a sixth support
> routine to nbtree was. It does not remove skip scan itself (that
> should still work with queries that are actually eligible to use skip
> scan, albeit slightly less efficiently with some opclasses).
>

Tried, doesn't seem to affect the results at all.

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 16:36:23
Message-ID:	CAH2-Wzkw-6-nz7vChxZtfOwEXxm=JSqvB=7RVOrbdc1mmi1k5Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 12:28 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Not sure if it matters, but this uses index-only scans, and the pages
> are all-visible, so maybe it's not much more expensive.

You're still going to have to scan 85 full index pages on your
pgbench_accounts.bid index -- that's pretty expensive, even with an
index-only scan.

Even if there was some kind of really obvious regression in
preprocessing (which seems almost impossible), I'd still expect it to
be drowned out for queries such as these.

> Not sure, I did multiple tests with different queries, and it'd be a bit
> strange if this was the only one affected. Not impossible.

The only thing that substantially differs between this
pgbench_accounts.bid query and traditional pgbench SELECT queries (on
pgbench_accounts.aid) is 1.) this query is quite a bit more expensive
at execution time, and 2.) that it involves the use of partitioning.

I made sure to test pgbench SELECT quite thoroughly -- that certainly
wasn't regressed. So it seems very likely to have something to do with
partitioning.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 16:38:02
Message-ID:	CAH2-WznofJHRdoXUSXUMi-As-y+vmRso=ek5z5RqV6WunV2Hhw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 12:29 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Tried, doesn't seem to affect the results at all.

OK, then. I don't think that we're going to figure it out this side of
pgConf.dev. I'm already well behind on talk preparation.

--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 17:19:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/9/25 18:36, Peter Geoghegan wrote:
> On Fri, May 9, 2025 at 12:28 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> Not sure if it matters, but this uses index-only scans, and the pages
>> are all-visible, so maybe it's not much more expensive.
>
> You're still going to have to scan 85 full index pages on your
> pgbench_accounts.bid index -- that's pretty expensive, even with an
> index-only scan.
>

Not sure I understand. Why would it need to scan 85 index pages? There's
only 100 matching tuples total, spread over the 100 partitions. We'll
need to scan maybe 1 page per partition.

> Even if there was some kind of really obvious regression in
> preprocessing (which seems almost impossible), I'd still expect it to
> be drowned out for queries such as these.
>
>> Not sure, I did multiple tests with different queries, and it'd be a bit
>> strange if this was the only one affected. Not impossible.
>
> The only thing that substantially differs between this
> pgbench_accounts.bid query and traditional pgbench SELECT queries (on
> pgbench_accounts.aid) is 1.) this query is quite a bit more expensive
> at execution time, and 2.) that it involves the use of partitioning.
>
> I made sure to test pgbench SELECT quite thoroughly -- that certainly
> wasn't regressed. So it seems very likely to have something to do with
> partitioning.
>

Yeah. This type of query amplifies any overhead in the index scan,
because it does one for every partition. I haven't seen the regression
without the partitioning.

regards

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 17:22:33
Message-ID:	CAH2-WzmhSe0u1R8xF75Asam6U3t01ZdEE1z7R0C3UMKiUiSTZQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 1:19 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Not sure I understand. Why would it need to scan 85 index pages? There's
> only 100 matching tuples total, spread over the 100 partitions. We'll
> need to scan maybe 1 page per partition.

I was unclear. The thing about 85 leaf pages only applies when
partitioning isn't in use. When it is in use, each individual
partition's index has only one index leaf page. So each individual
index scan is indeed fairly inexpensive, particularly relative to
startup cost/preprocessing cost.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 17:30:01
Message-ID:	CAH2-WznrJBh1YdhMQr6ETEg+GtsB=7R9=vBCUQFN4VZY25zgbA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 8:58 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> select count(*) from pgbench_accounts where bid = 0

What kind of plan are you getting? Are you sure it's index-only scans?

With 100 partitions, I get a parallel sequential scan when I run
EXPLAIN ANALYZE with this query from psql -- though only with "bid =
1". With your original "bid = 0" query I do get index-only scans.

What ends up happening (when index-only scans are used) is that we
scan only one index leaf page per partition index scanned. The
individual index-only scans don't need to scan too much (even when the
"bid = 1" variant query is forced to use index-only similar scans), so
I guess it's plausible that something like a regression in
preprocessing could be to blame, after all. As I mentioned just now,
these indexes each have only one index leaf page (the thing about 85
leaf pages only applies when partitioning isn't in use).

I find that the execution time for index-only scans with "bid = 0"
with a warm cache are:

Planning Time: 0.720 ms
Serialization: time=0.001 ms output=1kB format=text
Execution Time: 0.311 ms

Whereas the execution times for index-only scans with "bid = 1" are:

Planning Time: 0.713 ms
Serialization: time=0.001 ms output=1kB format=text
Execution Time: 16.491 ms

So you can see why I'd find it so hard to believe that any underlying
regression wouldn't at least be well hidden (by all of the other
overhead) in the case of the "bid = 1" variant query. There's no
reason to expect the absolute number of cycles added by some
hypothetical regression in preprocessing to vary among these two
variants of your count(*) query.

--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 18:07:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/9/25 19:30, Peter Geoghegan wrote:
> On Fri, May 9, 2025 at 8:58 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> select count(*) from pgbench_accounts where bid = 0
>
> What kind of plan are you getting? Are you sure it's index-only scans?
>
> With 100 partitions, I get a parallel sequential scan when I run
> EXPLAIN ANALYZE with this query from psql -- though only with "bid =
> 1". With your original "bid = 0" query I do get index-only scans.
>
> What ends up happening (when index-only scans are used) is that we
> scan only one index leaf page per partition index scanned. The
> individual index-only scans don't need to scan too much (even when the
> "bid = 1" variant query is forced to use index-only similar scans), so
> I guess it's plausible that something like a regression in
> preprocessing could be to blame, after all. As I mentioned just now,
> these indexes each have only one index leaf page (the thing about 85
> leaf pages only applies when partitioning isn't in use).
>
> I find that the execution time for index-only scans with "bid = 0"
> with a warm cache are:
>
> Planning Time: 0.720 ms
> Serialization: time=0.001 ms output=1kB format=text
> Execution Time: 0.311 ms
>
> Whereas the execution times for index-only scans with "bid = 1" are:
>
> Planning Time: 0.713 ms
> Serialization: time=0.001 ms output=1kB format=text
> Execution Time: 16.491 ms
>
> So you can see why I'd find it so hard to believe that any underlying
> regression wouldn't at least be well hidden (by all of the other
> overhead) in the case of the "bid = 1" variant query. There's no
> reason to expect the absolute number of cycles added by some
> hypothetical regression in preprocessing to vary among these two
> variants of your count(*) query.
>

Yes, I'm sure it's doing index only scan - did you update "bid" or did
you leave it as generated by "pgbench -i"?. Because then there's only
one value "1", and it'd make sense to use seqscan. The exact steps I did
for the "bid = 1" case are:

update pgbench_accounts set bid = aid / 100;
vacuum full;
analyze;

and then I get the proper index-only scans, with pretty much the same
behavior as for bid=0.

Also, I did some profiling and the (attached) flamegraphs confirm this.
The "slow" is on master, "fast" is on 3ba2cdaa454. Both very clearly
show IndexOnlyScan callbacks, etc. And the "slow" flamegraph also shows
a lot of time spent in malloc(), unlike the fast one.

(AFAIK the profiles for bid=0 and bid=1 look exactly the same.)

In fact, all of the malloc() calls seem to happen in AllocSetAllocLarge,
which matches the guess that something tripped over allocChunkLimit. Not
sure what, though.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
perf-fast.svg.gz	application/gzip	80.9 KB
perf-slow.svg.gz	application/gzip	104.1 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 18:37:58
Message-ID:	CAH2-Wzm4Y9OyGf_Oojx0Y+FSh=gcAMc43hXtyFnxq6UMnJspjQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 2:04 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Yes, I'm sure it's doing index only scan

Looks that way, from the pair of flame graphs you sent. Thanks for that.

> did you update "bid" or did
> you leave it as generated by "pgbench -i"?.

I didn't bother with updating, or running VACUUM FULL. I did run
VACUUM ANALYZE, though (can confirm no heap accesses for the
index-only scans).

> In fact, all of the malloc() calls seem to happen in AllocSetAllocLarge,
> which matches the guess that something tripped over allocChunkLimit. Not
> sure what, though.

While there are way too many AllocSetAllocLarge calls here, I don't
think that that can be blamed on the skip scan work. Note that commit
92fe23d9 didn't even touch the BTScanOpaqueData struct. Its size did
change a bit, in other nearby commits, but it was already so large
that I don't think that it could matter here. Besides, you said that
the problem clearly starts in commit 92fe23d9.

The AllocSetAllocLarge calls that I see from gdb look like the slow
ones from your flame graph. They're for the BTScanOpaqueData
allocation, and for the BLCKSZ * 2 used by index-only scans (though
not index scans). These allocations happen once per rescan/index scan.
So, again, too many large allocations take place here, but it doesn't
look like anything that can be attributed to skip scan.

The difference shown by your flame graph is absolutely enormous --
that's *very* surprising to me. btbeginscan and btrescan go from being
microscopic to being very prominent. But skip scan simply didn't touch
either function, at all, directly or indirectly. And neither function
has really changed in any significant way in recent years. So right
now I'm completely stumped.

--
Peter Geoghegan

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 21:30:28
Message-ID:	CAEze2Wg2Dnt91VrSH68SF57qxnGEUA_VdYRO_h=Cosu56-L48A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 9 May 2025 at 20:38, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Fri, May 9, 2025 at 2:04 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > Yes, I'm sure it's doing index only scan
>
> Looks that way, from the pair of flame graphs you sent. Thanks for that.
>
> > did you update "bid" or did
> > you leave it as generated by "pgbench -i"?.
>
> I didn't bother with updating, or running VACUUM FULL. I did run
> VACUUM ANALYZE, though (can confirm no heap accesses for the
> index-only scans).
>
> > In fact, all of the malloc() calls seem to happen in AllocSetAllocLarge,
> > which matches the guess that something tripped over allocChunkLimit. Not
> > sure what, though.

> The difference shown by your flame graph is absolutely enormous --
> that's *very* surprising to me. btbeginscan and btrescan go from being
> microscopic to being very prominent. But skip scan simply didn't touch
> either function, at all, directly or indirectly. And neither function
> has really changed in any significant way in recent years. So right
> now I'm completely stumped.

I see some 60.5% of the samples under PostgresMain (35% overall) in
the "bad" flamegraph have asm_exc_page_fault on the stack, indicating
the backend(s) are hit with a torrent of continued page faults.
Notably, this is not just in btree code: ExecInitIndexOnlyScan's
components (ExecAssignExprContext,
ExecConditionalAssignProjectionInfo, ExecIndexBuildScanKeys,
ExecInitQual, etc.) are also very much affected, and none of those
call into index code. Notably, this is before any btree code is
executed in the query.

In the "good" version, asm_exc_page_fault does not show up, at all;
nor does sysmalloc.

@Tomas
Given the impact of MALLOC_TOP_PAD_, have you tested with other values
of MALLOC_TOP_PAD_?

Also, have you checked the memory usage of the benchmarked backends
before and after 92fe23d93aa, e.g. by dumping
pg_backend_memory_contexts after preparing and executing the sample
query, or through pg_get_process_memory_contexts() from another
backend?

Kind regards,

Matthias van de Meent

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-09 22:54:10
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/9/25 23:30, Matthias van de Meent wrote:
> ...
>> The difference shown by your flame graph is absolutely enormous --
>> that's *very* surprising to me. btbeginscan and btrescan go from being
>> microscopic to being very prominent. But skip scan simply didn't touch
>> either function, at all, directly or indirectly. And neither function
>> has really changed in any significant way in recent years. So right
>> now I'm completely stumped.
>
> I see some 60.5% of the samples under PostgresMain (35% overall) in
> the "bad" flamegraph have asm_exc_page_fault on the stack, indicating
> the backend(s) are hit with a torrent of continued page faults.
> Notably, this is not just in btree code: ExecInitIndexOnlyScan's
> components (ExecAssignExprContext,
> ExecConditionalAssignProjectionInfo, ExecIndexBuildScanKeys,
> ExecInitQual, etc.) are also very much affected, and none of those
> call into index code. Notably, this is before any btree code is
> executed in the query.
>
> In the "good" version, asm_exc_page_fault does not show up, at all;
> nor does sysmalloc.
>

Yes. Have you tried reproducing the issue? It'd be good if someone else
reproduced this independently, to confirm I'm not hallucinating.

> @Tomas
> Given the impact of MALLOC_TOP_PAD_, have you tested with other values
> of MALLOC_TOP_PAD_?
>

I tried, and it seems 4MB is sufficient for the overhead to disappear.
Perhaps some other mallopt parameters would help too, but my point was
merely to demonstrate this is malloc-related.

> Also, have you checked the memory usage of the benchmarked backends
> before and after 92fe23d93aa, e.g. by dumping
> pg_backend_memory_contexts after preparing and executing the sample
> query, or through pg_get_process_memory_contexts() from another
> backend?
>

I haven't noticed any elevated memory usage in top, but the queries are
very short, so I'm not sure how reliable that is. But if adding 4MB is
enough to make this go away, I doubt I'd notice a difference.

regards

--
Tomas Vondra

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-10 11:14:34
Message-ID:	CAEze2Wi7tDidbDVJhu=Pstb2hbUXDCxx_VAZnKSqbTMf7k8+uQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 10 May 2025 at 00:54, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 5/9/25 23:30, Matthias van de Meent wrote:
> > ...
> >> The difference shown by your flame graph is absolutely enormous --
> >> that's *very* surprising to me. btbeginscan and btrescan go from being
> >> microscopic to being very prominent. But skip scan simply didn't touch
> >> either function, at all, directly or indirectly. And neither function
> >> has really changed in any significant way in recent years. So right
> >> now I'm completely stumped.
> >
> > I see some 60.5% of the samples under PostgresMain (35% overall) in
> > the "bad" flamegraph have asm_exc_page_fault on the stack, indicating
> > the backend(s) are hit with a torrent of continued page faults.
> > Notably, this is not just in btree code: ExecInitIndexOnlyScan's
> > components (ExecAssignExprContext,
> > ExecConditionalAssignProjectionInfo, ExecIndexBuildScanKeys,
> > ExecInitQual, etc.) are also very much affected, and none of those
> > call into index code. Notably, this is before any btree code is
> > executed in the query.
> >
> > In the "good" version, asm_exc_page_fault does not show up, at all;
> > nor does sysmalloc.
> >
>
> Yes. Have you tried reproducing the issue? It'd be good if someone else
> reproduced this independently, to confirm I'm not hallucinating.
>
> > @Tomas
> > Given the impact of MALLOC_TOP_PAD_, have you tested with other values
> > of MALLOC_TOP_PAD_?
> >
>
> I tried, and it seems 4MB is sufficient for the overhead to disappear.
> Perhaps some other mallopt parameters would help too, but my point was
> merely to demonstrate this is malloc-related.
>
> > Also, have you checked the memory usage of the benchmarked backends
> > before and after 92fe23d93aa, e.g. by dumping
> > pg_backend_memory_contexts after preparing and executing the sample
> > query, or through pg_get_process_memory_contexts() from another
> > backend?
> >
>
> I haven't noticed any elevated memory usage in top, but the queries are
> very short, so I'm not sure how reliable that is. But if adding 4MB is
> enough to make this go away, I doubt I'd notice a difference.

I think I may have it down, based on memory context checks and some
introspection. It's a bit of a ramble, with garden path sentences, and
some data tables to back it up:

Up to PG17, and 3ba2cdaa454, the size of data allocated in "index
info" was just enough for a good portion of our indexes to only
require one memory context block.
With the increased size of btree's per-attribute amsupportinfo, the
requirements for even a single index attribute won't fit in this first
block, requiring at least a second mctx block. As each mctx block for
"index info" is at least 1KiB large, this adds at least 30KiB of
additional memory.

See the table below for an example btree index with one column:

| type (PG17) | size | alignment | size bucket | total + chunkhdr
| remaining | mctx blocks |
|-----------------|-------|-----------|-------------|------------------|-----------|-------------|
| AllocSetContext | 200 B | 0 B | n/a | 200 B
| 824 B | 1 |
| Chunk hdr | 8 B | 0 B | n/a | 8 B
| 816 B | 1 |
| IndexAmRoutine | 248 B | 0 B | 256 B | 264 B
| 552 B | 1 |
| rd_opfamily | 4 B | 4 B | 8 B | 16 B
| 536 B | 1 |
| rd_opcintype | 4 B | 4 B | 8 B | 16 B
| 520 B | 1 |
| rd_support | 4 B | 4 B | 8 B | 16 B
| 504 B | 1 |
| rd_supportinfo | 240 B | 0 B | 256 B | 264 B
| 240 B | 1 |
| rd_indcollation | 4 B | 4 B | 8 B | 16 B
| 224 B | 1 |
| rd_indoption | 2 B | 6 B | 8 B | 16 B
| 206 B | 1 |

| type (skips) | size | alignment | size bucket | total + chunkhdr
| remaining | mctx blocks |
|-----------------|-------|-----------|-------------|------------------|-----------|-------------|
| AllocSetContext | 200 B | 0 B | n/a | 200 B
| 824 B | 1 |
| Block hdr | 8 B | 0 B | n/a | 8 B
| 816 B | 1 |
| IndexAmRoutine | 248 B | 0 B | 256 B | 264 B
| 552 B | 1 |
| rd_opfamily | 4 B | 4 B | 8 B | 16 B
| 536 B | 1 |
| rd_opcintype | 4 B | 4 B | 8 B | 16 B
| 520 B | 1 |
| rd_support | 4 B | 4 B | 8 B | 16 B
| 504 B | 1 |
| Block hdr | 8 B | 0 B | n/a | 8 B
| 1016 B | 2 |
| rd_supportinfo | 288 B | 0 B | 512 B | 520 B
| 496 B | 2 |
| rd_indcollation | 4 B | 4 B | 8 B | 16 B
| 224 B | 1 |
| rd_indoption | 2 B | 6 B | 8 B | 16 B
| 206 B | 1 |

Note that there's a new block required to fit rd_supportinfo because
it wouldn't fit in the first, due to AllocSet's bucketing the
allocation into a larger chunk.

If you check each backend's memory statistics for index info memory
contexts [0], you'll notice this too:

Master (with skip)
count (d73d4cfd) | total_bytes | combined_size
------------------+-------------+---------------
87 | | 215808
50 | 2048 | 102400
1 | 2240 | 2240
33 | 3072 | 101376
3 | 3264 | 9792

(commit before skip)
count (3ba2cdaa) | total_bytes | combined_size
------------------+-------------+---------------
87 | | 157696
35 | 1024 | 35840
37 | 2048 | 75776
15 | 3072 | 46080

This shows we're using 56KiB more than before.
I'm not quite sure yet where the memfault overhead is introduced, but
I do think this is heavy smoke, and closer to the fire.

I've attached a patch that makes IndexAmRoutine a static const*,
removing it from rd_indexcxt, and returning some of the index ctx
memory usage to normal:

count (patch 1) | total_bytes | combined_size
-----------------+-------------+---------------
87 | | 171776
10 | 2048 | 20480
40 | 1024 | 40960
4 | 2240 | 8960
33 | 3072 | 101376

Another patch on top of that, switching rd_indexcxt to
GenerationContext (from AllocSet) sees the following improvement

count (patch 2) | total_bytes | combined_size
------------------+-------------+---------------
87 | | 118832
22 | 1680 | 36960
11 | 1968 | 21648
50 | 1024 | 51200
4 | 2256 | 9024

Also tracked: total memctx-tracked memory usage on a fresh connection [0]:

3ba2cdaa: 2006024 / 1959 kB
Master: 2063112 / 2015 kB
Patch 1: 2040648 / 1993 kB
Patch 2: 1976440 / 1930 kB

There isn't a lot of space on master to allocate new memory before it
reaches a (standard linux configuration) 128kB boundary - only 33kB
(assuming no other memory tracking overhead). It's easy to allocate
that much, and go over, causing malloc to extend with sbrk by 128kB.
If we then get back under because all per-query memory was released,
the newly allocated memory won't have any data anymore, and will get
released again immediately (default: release with sbrk when the top
>=128kB is free), thus churning that memory area.

We may just have been lucky before, and your observation that
MALLOC_TOP_PAD_ >= 4MB fixes the issue reinforces that idea.

If patch 1 or patch 1+2 fixes this regression for you, then that's
another indication that we exceeded this threshold in a bad way.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. In ± 1 hour I'm leaving for pgconf.dev, so this will be my final
investigation update on the issue today CEST.

[0] select count(*), total_bytes, sum(total_bytes) as "combined size"
from pg_backend_memory_contexts WHERE name = 'index info' group by
rollup (2);
[1] select sum(total_bytes), pg_size_pretty(sum(total_bytes)) from
pg_backend_memory_contexts;

Attachment	Content-Type	Size
v1-0002-Use-Generation-contexts-for-rd_indexcxt.patch	application/octet-stream	1.8 KB
v1-0001-Stop-heap-allocating-IndexAmRoutine-for-every-ind.patch	application/octet-stream	43.8 KB

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)bowt(dot)ie>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-10 14:59:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/10/25 13:14, Matthias van de Meent wrote:
> ...
>
> I've attached a patch that makes IndexAmRoutine a static const*,
> removing it from rd_indexcxt, and returning some of the index ctx
> memory usage to normal:
>
> count (patch 1) | total_bytes | combined_size
> -----------------+-------------+---------------
> 87 | | 171776
> 10 | 2048 | 20480
> 40 | 1024 | 40960
> 4 | 2240 | 8960
> 33 | 3072 | 101376
>
> Another patch on top of that, switching rd_indexcxt to
> GenerationContext (from AllocSet) sees the following improvement
>
> count (patch 2) | total_bytes | combined_size
> ------------------+-------------+---------------
> 87 | | 118832
> 22 | 1680 | 36960
> 11 | 1968 | 21648
> 50 | 1024 | 51200
> 4 | 2256 | 9024
>
> Also tracked: total memctx-tracked memory usage on a fresh connection [0]:
>
> 3ba2cdaa: 2006024 / 1959 kB
> Master: 2063112 / 2015 kB
> Patch 1: 2040648 / 1993 kB
> Patch 2: 1976440 / 1930 kB
>
> There isn't a lot of space on master to allocate new memory before it
> reaches a (standard linux configuration) 128kB boundary - only 33kB
> (assuming no other memory tracking overhead). It's easy to allocate
> that much, and go over, causing malloc to extend with sbrk by 128kB.
> If we then get back under because all per-query memory was released,
> the newly allocated memory won't have any data anymore, and will get
> released again immediately (default: release with sbrk when the top
>> =128kB is free), thus churning that memory area.
>
> We may just have been lucky before, and your observation that
> MALLOC_TOP_PAD_ >= 4MB fixes the issue reinforces that idea.
>
> If patch 1 or patch 1+2 fixes this regression for you, then that's
> another indication that we exceeded this threshold in a bad way.
>

Thanks! I think this explanation seems very plausible. I repeated the
tests and the results agree with it too. Here's what I got for the two
older commits before/after skip scan, and then 0001 and 0001+0002:

old head 0001 0001+0002
mode clients 3ba2cdaa454 99ddf8615c2 54c23341b31 9a6f6679e67
----------------------------------------------------------------------
prepared 1 10858 3534 11109 3324
4 25311 11307 25325 10928
32 38869 14194 39423 13626
----------------------------------------------------------------------
simple 1 2676 1865 2534 1883
4 8355 6140 8012 6160
32 11827 7216 12046 7322

This is the bid=0 case, the bid=1 is very similar, I'm leaving it out to
keep this simple (and because formatting those tables is tedious). A
nicer table is in the attached PDF.

Relative to 3ba2cdaa454 it looks like this:

head 0001 0001+0002
mode clients 99ddf8615c2 54c23341b31 9a6f6679e67
--------------------------------------------------------
prepared 1 33% 102% 31%
4 45% 100% 43%
32 37% 101% 35%
--------------------------------------------------------
simple 1 70% 95% 70%
4 73% 96% 74%
32 61% 102% 62%

So clearly, 0001 helps a lot, essentially eliminating the regression.
But 0002 makes it slow again, so the generation context is not a good
match here (perhaps the rd_indexcxt allocation pattern is different).

Based on this I tried a couple additional experiments:

a) switch rd_indexcxt to ALLOCSET_DEFAULT_SIZES, speculating that maybe
one larger malloc() is cheaper than multiple smaller ones

b) increasing the ALLOC_CHUNK_FRACTION from 1/4 to 1/2, so that fewer
chunks need to be allocated as a separate block

c) switch rd_indexcxt to ALLOCSET_MEDIUM_SIZES, which is the same as
SMALL_SIZES, but INITSIZE is 2kB, combined with the CHUNK_FRACTION
adjustment from (b)

The results are in the second table in the PDF. None of it helped very
much, unfortunately. The (a) is even slower than master in some cases.
(b) helps in some cases, but not as much as 0001. And even 2kB blocks
make it slow again.

So I guess something like 0001 might be the way to go ...

But doesn't it also highlight how fragile this memory allocation is? The
skip scan patch didn't do anything wrong - it just added a couple
fields, using a little bit more memory. I think we understand allocating
more memory may need more time, but we expect the effect to be somewhat
proportional. Which doesn't seem to be the case here.

Many other patches add fields somewhere, it seems like bad luck the skip
scan happened to trigger this behavior. It's quite likely other patches
ran into the same issue, except that no one noticed. Maybe the skip scan
did that in much hotter code, not sure.

Of course, this is not "our" issue - it seems to be glibc specific
(based on my experience with allocators in other libc libraries). Still,
it's a long-standing behavior, and I doubt it's likely to change. But
considering glibc is what most systems use, maybe we should add some
protections?

I recall there were proposals to add optional mallopt() call to set the
M_TOP_PAD when running on glibc. Maybe we should revive that. I also had
a patch to add a "memory pool", which fixed this as a side effect.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
results.pdf	application/pdf	32.2 KB

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-11 16:07:56
Message-ID:	CAH2-WznTx=uj6z_YPp=j5EwhUvpVNxq3LUZKECadS9ZZPHqrYA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, May 10, 2025 at 10:59 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> But doesn't it also highlight how fragile this memory allocation is? The
> skip scan patch didn't do anything wrong - it just added a couple
> fields, using a little bit more memory. I think we understand allocating
> more memory may need more time, but we expect the effect to be somewhat
> proportional. Which doesn't seem to be the case here.
>
> Many other patches add fields somewhere, it seems like bad luck the skip
> scan happened to trigger this behavior. It's quite likely other patches
> ran into the same issue, except that no one noticed. Maybe the skip scan
> did that in much hotter code, not sure.

But what did the skip scan commit (specifically commit 92fe23d9,
without any of the follow-up commits) change about memory allocation,
that might be at issue with your test case? You said that that commit
"just added a couple fields". What specific fields are you talking
about, that were added by commit 92fe23d9?

I already speculated that the issue might be tied to the addition of a
new support routine (skip support), but the experiment we ran to try
to validate that theory disproved it. What else is there?

Again, commit 92fe23d9 didn't make BTScanOpaqueData any larger
(follow-up commit 8a510275 added a new BTScanOpaqueData.skipScan bool
field, but that didn't even affect sizeof BTScanOpaqueData, since the
field fit into what was previously just alignment padding space).
AFAICT nothing that seems like it might be relevant changed, apart
from the addition of the new support routine, which was ruled out
already.

--
Peter Geoghegan

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-12 03:09:25
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5/11/25 18:07, Peter Geoghegan wrote:
> On Sat, May 10, 2025 at 10:59 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> But doesn't it also highlight how fragile this memory allocation is? The
>> skip scan patch didn't do anything wrong - it just added a couple
>> fields, using a little bit more memory. I think we understand allocating
>> more memory may need more time, but we expect the effect to be somewhat
>> proportional. Which doesn't seem to be the case here.
>>
>> Many other patches add fields somewhere, it seems like bad luck the skip
>> scan happened to trigger this behavior. It's quite likely other patches
>> ran into the same issue, except that no one noticed. Maybe the skip scan
>> did that in much hotter code, not sure.
>
> But what did the skip scan commit (specifically commit 92fe23d9,
> without any of the follow-up commits) change about memory allocation,
> that might be at issue with your test case? You said that that commit
> "just added a couple fields". What specific fields are you talking
> about, that were added by commit 92fe23d9?
>
> I already speculated that the issue might be tied to the addition of a
> new support routine (skip support), but the experiment we ran to try
> to validate that theory disproved it. What else is there?
>

That's a good point. However, it seems I have done something wrong when
running the tests with the support routine removed :-( I just repeated
the tests, and I got this:

mode clients 3ba2cdaa454 master revert master revert
------------------------------------------------------------------
prepared 1 10860 3548 11048 33% 102%
4 25492 11299 25190 44% 99%
32 38399 14142 38493 37% 100%
------------------------------------------------------------------
simple 1 2595 1844 2604 71% 100%
4 8266 6090 8126 74% 98%
32 11765 7198 11449 61% 97%

I where "revert" is master with the removal patch. Sorry about the
confusion, I guess I was distracted and did some mistake.

So, this seems to be in line with the hypothesis ...

regards

--
Tomas Vondra

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-12 12:58:15
Message-ID:	CAH2-WznJVh46LPDmkOo95zyanBdMteX5gDQpo6roZeYP4kq+bQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, May 11, 2025 at 11:09 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> I where "revert" is master with the removal patch. Sorry about the
> confusion, I guess I was distracted and did some mistake.
>
> So, this seems to be in line with the hypothesis ...

That makes way more sense.

I wonder if we can fix this problem by getting rid of the old support
routine #5, "options". It currently doesn't do anything, and I always
thought it was strange that it was added "for uniformity" with other
index AMs.

OTOH, one could argue that it's only a matter of time until somebody
needs to add another support routine to nbtree; why delay dealing with
the problem that you've highlighted? Right now I don't really have an
opinion on how best to address the problem.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-20 20:13:41
Message-ID:	CAH2-Wzkk8k_7wj8UUhYb2=q_8D=-c1mtwuG4PCb7j+SNEtD3Ew@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, May 12, 2025 at 8:58 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I wonder if we can fix this problem by getting rid of the old support
> routine #5, "options". It currently doesn't do anything, and I always
> thought it was strange that it was added "for uniformity" with other
> index AMs.

Attached patch completely removes the nbtree "options" support
function, while changing the support function number of skip support:
it becomes support function #5 (the number previously used by
"options"). This patch should fix the regression that Tomas complained
about in an expedient way.

It's likely that somebody else will run into the same problem in the
future, the next time that a new support function is needed. But I
think that it makes sense to do this much now -- we need a short term
solution for Postgres 18. Usually I would never suggest breaking
compatibility like this, but, remarkably, we have never actually done
anything with our current support function 5. It's not possible to
break compatibility with code that can never be called in the first
place, so I see no compatibility to preserve.

Questions for Alexander about the "options" support function:

* Why did you invent the whole idea of an "options" support function,
given that it doesn't actually do anything? I get that it might be a
good idea to add these kinds of functions in the future, but why
didn't you wait until nbtree *actually had a use* for them?

* I've removed some of the tests that you added, that (for whatever
reason) cover nbtree specifically. The test from alter_generic.sql.
There might be some kind of loss of test coverage. What do you think?

--
Peter Geoghegan

Attachment	Content-Type	Size
v1-0001-Remove-OPTIONS-support-proc-from-nbtree.patch	application/octet-stream	20.7 KB

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-05-22 17:57:57
Message-ID:	CAEze2WgpF7TUNCyXyb-wmUmwOui=DSdQnswxwCfvLnTm0yn6+Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 20 May 2025, 22:14 Peter Geoghegan, <pg(at)bowt(dot)ie> wrote:
>
> On Mon, May 12, 2025 at 8:58 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > I wonder if we can fix this problem by getting rid of the old support
> > routine #5, "options". It currently doesn't do anything, and I always
> > thought it was strange that it was added "for uniformity" with other
> > index AMs.
>
> Attached patch completely removes the nbtree "options" support
> function, while changing the support function number of skip support:
> it becomes support function #5 (the number previously used by
> "options"). This patch should fix the regression that Tomas complained
> about in an expedient way.

> It's likely that somebody else will run into the same problem in the
> future, the next time that a new support function is needed. But I
> think that it makes sense to do this much now -- we need a short term
> solution for Postgres 18.

Didn't we have different solutions already? E.g. dropping the
allocation requirement from IndexAM (and indeed, making it
non-required), as seen in my 0001? It's a bit of a hassle, but has
shown to have the same effect in PG18b1, doesn't break compatibility
with existing amproc registrations, and even improves the memory
situation for all other index types.

> Usually I would never suggest breaking
> compatibility like this, but, remarkably, we have never actually done
> anything with our current support function 5. It's not possible to
> break compatibility with code that can never be called in the first
> place, so I see no compatibility to preserve.

I disagree. An extension or user may well have registered FUNCTION 5
for their opclass or opfamily as blanket future-proofed
implementation, which will now break after pg_upgrade. We never
rejected this registration before, and while there won't be issues
with the proc signature across the versions (options' expected
signature is equivalent to that of skipsupport), it will probably
cause issues when it's called with unexpected inputs.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-06-11 22:09:55
Message-ID:	CAH2-Wzm3bKcz3TbHGem3_+SinEyG=VZVPbApQghp7YiZj+MM3g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 2, 2025 at 3:04 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> The second patch is more complicated, and seems like something that
> I'll need to spend more time thinking about before proceeding with
> commit. It has subtle behavioral implications, in that it causes the
> pstate.forcenonrequired mechanism to influence when and how
> _bt_advance_array_keys schedules primitive index scans in a tiny
> minority of forward scan cases.

While I have no reason to believe that there were any problems in this
bugfix (which became commit 5f4d98d4), there were problems in the
follow-up, commit 54c6ea8c, "nbtree: Remove useless row compare arg".

I devised a new test case (which is too large to be easily posted to
this list) that shows a RowCompare query that fails with an assertion
failure (the assert added by 54c6ea8c). Here's the query:

select a, b, c, d
from fuzz_skip_scan
where b is not null and (c, d) < (60, 0)
order by a, b, c, d
limit 200 offset 80_000;

There is at least one page where _bt_set_startikey will want to apply
forcenonrequired mode while setting pstate.ikey = 1 (meaning that we
can start _bt_checkkeys calls with the "b" scan key, avoiding
maintenance of the "a" scan key). So clearly we do need support for
forcenonrequired=true + a RowCompare key. The row compare arg wasn't
so useless after all.

With commit 5f4d98d4 in place and commit 54c6ea8c reverted, everything
works here. The row compare on "(c, d)" shouldn't need to prevent
application of pstate.forcenonrequired mode here. The skip array on
"a" is bound to advance on the page anyway, so there's no risk that
we'll do the wrong thing with the RowCompare for the pstate.finaltup
call to _bt_checkkeys, that takes place after the scan's arrays have
been reset.

I'm going to revert my ill-advised commit 54c6ea8c now.

I should give this general area more thought. Some comment updates in
_bt_set_startikey seem in order, at least. But I see no reason to wait
to revert.

--
Peter Geoghegan