Collation & ctype method table, and extension hooks

Lists: pgsql-hackers
From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Collation & ctype method table, and extension hooks
Date: 2024-09-26 22:30:09
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

The attached patch series refactors the collation and ctype behavior
into method tables, and provides a way to hook the creation of a
pg_locale_t so that an extension can create any kind of method table it
wants.

In practice, the main use is to replace, for example, ICU with a
different version of ICU. But it can also be used to control libc
behavior, or to use a different set of methods that have nothing to do
with ICU or libc.

It also isolates code to some new files: ICU code goes in
pg_locale_icu.c, and libc code goes in pg_locale_libc.c. And it reduces
a lot of code that branches on the provider. That's easier to reason
about, in my opinion.

With these patches, the collation provider becomes mainly a catalog
concept used to create the right pg_locale_t, rather than an execution-
time concept.

We could take this further and make providers a concept in the catalog,
like "CREATE LOCALE PROVIDER", and it would just provide an arbitrary
handler function to create the pg_locale_t. If we decide how we'd like
to handle versioning, that could potentially allow a much smoother
upgrade process that preserves the provider versions.

Regards,
Jeff Davis

Attachment Content-Type Size
v5-0008-Introduce-hooks-for-creating-custom-pg_locale_t.patch text/x-patch 4.9 KB
v5-0007-Control-ctype-behavior-with-a-method-table.patch text/x-patch 26.7 KB
v5-0006-Control-case-mapping-behavior-with-a-method-table.patch text/x-patch 34.8 KB
v5-0005-Control-collation-behavior-with-a-method-table.patch text/x-patch 16.7 KB
v5-0004-Perform-provider-specific-initialization-code-in-.patch text/x-patch 18.0 KB
v5-0003-Refactor-the-code-to-create-a-pg_locale_t-into-ne.patch text/x-patch 11.3 KB
v5-0002-Move-libc-specific-code-from-pg_locale.c-into-pg_.patch text/x-patch 16.8 KB
v5-0001-Move-ICU-specific-code-from-pg_locale.c-into-pg_l.patch text/x-patch 40.0 KB

From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-10-04 13:24:12
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 9/27/24 12:30 AM, Jeff Davis wrote:
> The attached patch series refactors the collation and ctype behavior
> into method tables, and provides a way to hook the creation of a
> pg_locale_t so that an extension can create any kind of method table it
> wants.

Great! I had been planning to do this myself so great to see that you
already did it before me. Will take a look at this work later.

Andreas


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-10-08 00:26:45
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2024-10-04 at 15:24 +0200, Andreas Karlsson wrote:
> Great! I had been planning to do this myself so great to see that you
> already did it before me. Will take a look at this work later.

Great! We'll need to test whether there are any regressions in the
regex & pattern matching code due to the indirection.

What would be a good test for that? Just running it over long strings?

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-10-09 23:27:41
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2024-10-04 at 15:24 +0200, Andreas Karlsson wrote:
> On 9/27/24 12:30 AM, Jeff Davis wrote:
> > The attached patch series refactors the collation and ctype
> > behavior
> > into method tables, and provides a way to hook the creation of a
> > pg_locale_t so that an extension can create any kind of method
> > table it
> > wants.
>
> Great! I had been planning to do this myself so great to see that you
> already did it before me. Will take a look at this work later.

Attached v6 with significant improvements, and should be easier to
review.

This removes all runtime branching for collation & ctype operations; I
even removed the "provider" field of pg_locale_t to be sure.

This series gets us to the point where it's possible (though not easy)
to completely replace the provider at runtime without missing any
capabilities.

There are many things that would be nice to improve further, such as:

* Have a CREATE LOCALE PROVIDER command and make "provider" an Oid
rather than a char ('b'/'i'/'c'). The v6 patches brings us close to
this point, but I'm not sure if we want to go this far in v18.

* Need an actual extension to prove that it works.

* Clean up the way versions are handled.

* Do we want to provide support for changing the provider at initdb
time?

* The catalog representation is not very clean or general. The libc
provider allows collation and ctype to be set separately, but they
control the environment variables, too. ICU has rules, which are
specific to ICU.

* I've tested the performance for collation and case mapping, and there
does not appear to be any overhead. I didn't observe any performance
overhead for ctype either, but I think I need a more strenuous test to
be sure.

Regards,
Jeff Davis

Attachment Content-Type Size
v6-0001-Move-ICU-specific-code-from-pg_locale.c-into-pg_l.patch text/x-patch 40.0 KB
v6-0002-Move-libc-specific-code-from-pg_locale.c-into-pg_.patch text/x-patch 26.1 KB
v6-0003-Refactor-the-code-to-create-a-pg_locale_t-into-ne.patch text/x-patch 11.3 KB
v6-0004-Perform-provider-specific-initialization-code-in-.patch text/x-patch 16.6 KB
v6-0005-Control-collation-behavior-with-a-method-table.patch text/x-patch 18.4 KB
v6-0006-Control-case-mapping-behavior-with-a-method-table.patch text/x-patch 35.7 KB
v6-0007-Control-ctype-behavior-with-a-method-table.patch text/x-patch 31.0 KB
v6-0008-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.7 KB
v6-0009-Make-provider-data-in-pg_locale_t-an-opaque-point.patch text/x-patch 21.6 KB
v6-0010-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 3.7 KB
v6-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch text/x-patch 6.6 KB

From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-10-24 08:05:24
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 10/10/24 1:27 AM, Jeff Davis wrote:
> Attached v6 with significant improvements, and should be easier to
> review.
>
> This removes all runtime branching for collation & ctype operations; I
> even removed the "provider" field of pg_locale_t to be sure.

Nice! Some great changes. I did a quick review:

= General

Why is there no pg_locale_builtin.c? I feel the code would be easier to
understand for someone not familiar with it if each provider was defined
in its own file.

= v6-0003-Refactor-the-code-to-create-a-pg_locale_t-into-ne.patch

Looks good.

= v6-0004-Perform-provider-specific-initialization-code-in-.patch

I am not a fan of all the #ifdef USE_ICU in pg_locale_icu.c but I am
not sure if I have a cleaner solution.

= v6-0005-Control-collation-behavior-with-a-method-table.patch

strncoll_libc_win32_utf8 is used in one patch but then later defined in
the next patch. So seems like you accidentally added that to the wrong
patch,

I think adding an assert to create_pg_locale() which enforces valid
there is always a combination of collate_is_c and collate would be good.
Especially when we have the hook.

= v6-0006-Control-case-mapping-behavior-with-a-method-table.patch

I think you forgot to remove #include <wctype.h> from formatting.c.

I need to look at it more in detail but I think this new version makes
us do extra work when ICU strings grow in length when calling upper/lower.

I think adding an assert to create_pg_locale() which enforces valid
there is always a combination of ctype_is_c and casemap would be good,
similar to the collate field.

= v6-0007-Control-ctype-behavior-with-a-method-table.patch

Why are casemap and ctype_methods not the same struct? They seem very
closely related.

This commit makes me tempted to handle the ctype_is_c logic for
character classes also in callbacks and remove the if in functions like
pg_wc_ispunct(). But this si something that would need to be benchmarked.

I wonder if the bitmask idea isn't terrible for the branch predictor and
that me may want one function per character class, but this is yet again
something we need to benchmark.

= v6-0008-Remove-provider-field-from-pg_locale_t.patch

Looks good.

= v6-0009-Make-provider-data-in-pg_locale_t-an-opaque-point.patch

Is there a reason we allocate the icu_provider in create_pg_locale_icu
with MemoryContextAllocZero when we intialize everything anyway? And
similar for other providers.

= v6-0010-Don-t-include-ICU-headers-in-pg_locale.h.patch

Looks good.

= v6-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch

Looks good but seems like a quite painful API to use.

> * Have a CREATE LOCALE PROVIDER command and make "provider" an Oid
> rather than a char ('b'/'i'/'c'). The v6 patches brings us close to
> this point, but I'm not sure if we want to go this far in v18.

Probably necessary but I hate all the DDL commands the way to SQL
standard is written forces us to add.

> * Need an actual extension to prove that it works.
>
> * Clean up the way versions are handled.
>
> * Do we want to provide support for changing the provider at initdb
> time?

Not sure, need to think about this one.

> * The catalog representation is not very clean or general. The libc
> provider allows collation and ctype to be set separately, but they
> control the environment variables, too. ICU has rules, which are
> specific to ICU.

Yeah, would be really nice to clean this up but it might be work for a
different patch set.

Rebased patches are attached.

Andreas

Attachment Content-Type Size
v1-0001-Specialize-EEOP_-_TESTVAL-steps.patch text/x-patch 9.0 KB
v7-0001-Refactor-the-code-to-create-a-pg_locale_t-into-ne.patch text/x-patch 11.3 KB
v7-0002-Perform-provider-specific-initialization-code-in-.patch text/x-patch 16.6 KB
v7-0003-Control-collation-behavior-with-a-method-table.patch text/x-patch 18.4 KB
v7-0004-Control-case-mapping-behavior-with-a-method-table.patch text/x-patch 35.7 KB
v7-0005-Control-ctype-behavior-with-a-method-table.patch text/x-patch 31.0 KB
v7-0006-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.7 KB
v7-0007-Make-provider-data-in-pg_locale_t-an-opaque-point.patch text/x-patch 21.6 KB
v7-0008-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 3.7 KB
v7-0009-Introduce-hooks-for-creating-custom-pg_locale_t.patch text/x-patch 6.6 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-10-25 22:42:36
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2024-10-24 at 10:05 +0200, Andreas Karlsson wrote:
> Why is there no pg_locale_builtin.c?

Just that it would be a fairly small file, but I'm fine with doing
that.

> I think adding an assert to create_pg_locale() which enforces valid
> there is always a combination of ctype_is_c and casemap would be
> good,
> similar to the collate field.

Good idea.

> Why are casemap and ctype_methods not the same struct? They seem very
> closely related.

The code impact was in fairly different places, so it seemed like a
nice way to break it out. I could combine them, but it would be a
fairly large patch.

> This commit makes me tempted to handle the ctype_is_c logic for
> character classes also in callbacks and remove the if in functions
> like
> pg_wc_ispunct(). But this si something that would need to be
> benchmarked.

That's a good idea. The reason collate_is_c is important is because
there are quite a few caller-specific optimizations, but that doesn't
seem to be true of ctype_is_c.

> I wonder if the bitmask idea isn't terrible for the branch predictor
> and
> that me may want one function per character class, but this is yet
> again
> something we need to benchmark.

Agreed -- a lot of work has gone into optimizing the regex code, and we
don't want a perf regression there. But I'm also not sure exactly which
kinds of tests I should be running for that.

> Is there a reason we allocate the icu_provider in
> create_pg_locale_icu
> with MemoryContextAllocZero when we intialize everything anyway? And
> similar for other providers.

Allocating and zeroing is a good defense against new optional methods
and fields which can safely default to zero.

> = v6-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch
>
> Looks good but seems like a quite painful API to use.

How is it painful and can we make it better?

> > * Have a CREATE LOCALE PROVIDER command and make "provider" an Oid
> > rather than a char ('b'/'i'/'c'). The v6 patches brings us close to
> > this point, but I'm not sure if we want to go this far in v18.
>
> Probably necessary but I hate all the DDL commands the way to SQL
> standard is written forces us to add.

There is some precedent for a DDL-like thing without new grammar:
pg_replication_origin_create(). I don't have a strong opinion on
whether to do that or not.

>
Regards,
Jeff Davis


From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-11-01 13:08:07
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 10/26/24 12:42 AM, Jeff Davis wrote:
> On Thu, 2024-10-24 at 10:05 +0200, Andreas Karlsson wrote:
>> Why is there no pg_locale_builtin.c?
>
> Just that it would be a fairly small file, but I'm fine with doing
> that.

I think adding such a small file would make life easier for people new
to the collation part of the code base. It would be a nice symmetry
between collation providers and where code for them can be found.

>> Why are casemap and ctype_methods not the same struct? They seem very
>> closely related.
>
> The code impact was in fairly different places, so it seemed like a
> nice way to break it out. I could combine them, but it would be a
> fairly large patch.

For me combining them would make the intention of the code easier to
understand since aren't the casemap functions just a set of "ctype_methods"?

>> This commit makes me tempted to handle the ctype_is_c logic for
>> character classes also in callbacks and remove the if in functions
>> like
>> pg_wc_ispunct(). But this si something that would need to be
>> benchmarked.
>
> That's a good idea. The reason collate_is_c is important is because
> there are quite a few caller-specific optimizations, but that doesn't
> seem to be true of ctype_is_c.

Yeah, that was my though too but I have not confirmed it.

>> I wonder if the bitmask idea isn't terrible for the branch predictor
>> and
>> that me may want one function per character class, but this is yet
>> again
>> something we need to benchmark.
>
> Agreed -- a lot of work has gone into optimizing the regex code, and we
> don't want a perf regression there. But I'm also not sure exactly which
> kinds of tests I should be running for that.

I think we should at least try to find the worst case to see how big the
performance hit for that is. And then after that try to figure out a
more typical case benchmark.

>> = v6-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch
>>
>> Looks good but seems like a quite painful API to use.
>
> How is it painful and can we make it better?

The painful part was mostly just a reference to that without a catalog
table where new providers can be added we would need to add collations
for our new custom provider on some already existing provider and then
do for example some pattern matching on the name of the new collation.
Really ugly but works.

I am thinking of implementing ICU4x as an external extension to try out
the hook, but for the in-core contrib module we likely want to use
something which does not require an external dependency. Or what do you
think?

Andreas


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-11-01 18:17:09
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2024-11-01 at 14:08 +0100, Andreas Karlsson wrote:
> > Agreed -- a lot of work has gone into optimizing the regex code,
> > and we
> > don't want a perf regression there. But I'm also not sure exactly
> > which
> > kinds of tests I should be running for that.
>
> I think we should at least try to find the worst case to see how big
> the
> performance hit for that is. And then after that try to figure out a
> more typical case benchmark.

What I had in mind was:

* a large table with a single ~100KiB text field
* a scan with a case insensitive regex that uses some character
classes

Does that sound like a worst case?

> The painful part was mostly just a reference to that without a
> catalog
> table where new providers can be added we would need to add
> collations
> for our new custom provider on some already existing provider and
> then
> do for example some pattern matching on the name of the new
> collation.
> Really ugly but works.

To add a catalog table for the locale providers, the main challenge is
around the database default collation and, relatedly, initdb. Do you
have some ideas around that?

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-11-19 21:32:47
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2024-11-01 at 14:08 +0100, Andreas Karlsson wrote:
> I think adding such a small file would make life easier for people
> new
> to the collation part of the code base. It would be a nice symmetry
> between collation providers and where code for them can be found.

Done.

> >
> For me combining them would make the intention of the code easier to
> understand since aren't the casemap functions just a set of
> "ctype_methods"?

Done.

There is a bit of weirdness in libc because:

* Single byte encodings use the single-byte isupper(), toupper(), etc.
* UTF8 encoding uses wide character iswupper(), towupper(), etc.
* Non-UTF8 multibyte encodings use isupper() for pattern matching but
towupper() for case mapping

that weirdness existed before, but it's a bit more obvious what's
happening now.

> > > This commit makes me tempted to handle the ctype_is_c logic for
> > > character classes also in callbacks and remove the if in
> > > functions
> > > like
> > > pg_wc_ispunct(). But this si something that would need to be
> > > benchmarked.

I like this idea, but it can be a follow up.

Attached new patchset.

I also tried some performance tests again. I used smalltext (a table of
10M ~30-character strings) and bigtext (a table of 32768 rows, each
containing the 100KiB source of https://en.wikipedia.org/wiki/Diacritic
). And I then ran the following regex on each:

select count(*) from thetable
where t ~
'[[:digit:]][[:space:]][[:punct:]][[:alpha:]][[:lower:]][[:upper:]]';

for "C", "en_US", and "en-US-x-icu". The timings for smalltext were
indistinguishable between master and the patched version. The timings
for bigtext were pretty noisy so it's hard to tell if there was a
regression or not, but I saw some evidence in the profile that
char_properties has a cost (~1%). I'm not sure if that's a significant
concern or not.

Which API do you think is the right one? Individual functions testing
individual properties, or something like char_properties() that can
test several at once?

Regards,
Jeff Davis

Attachment Content-Type Size
v8-0001-Perform-provider-specific-initialization-code-in-.patch text/x-patch 18.5 KB
v8-0002-Control-collation-behavior-with-a-method-table.patch text/x-patch 18.9 KB
v8-0003-Control-ctype-behavior-internally-with-a-method-t.patch text/x-patch 63.2 KB
v8-0004-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.8 KB
v8-0005-Make-provider-data-in-pg_locale_t-an-opaque-point.patch text/x-patch 21.5 KB
v8-0006-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 3.7 KB
v8-0007-Introduce-hooks-for-creating-custom-pg_locale_t.patch text/x-patch 6.5 KB

From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-12-02 15:39:11
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I have not looked at the later patches in the series yet as I got
sidetracked while reviewing and decided to clean up some related
collation things which I added to the patch set (feel free to ignore
them if you want). The goal of my added patches is to move provider
specific code into fewer places and not have provider specific logic all
over the codebase.

I feel your first patch in the series is something you can just commit.
It looks good and is simple, obvious refactoring. In theory we could
share the code which does the lookup in the catalog table but I do not
think it would be worth it. I fixed a small issue with it and the
function prototypes in pg_collation.c.

I will look at the rest of your patches later.

My patches:

= v9-0002-Move-check-for-ucol_strcollUTF8-to-pg_locale_icu..patch

Broken out from v9-0010-Don-t-include-ICU-headers-in-pg_locale.h.patch.

= v9-0003-Move-code-for-collation-version-into-provider-spe.patch

Moves some code from pg_collate.c into provider specific files.

= v9-0004-Move-ICU-database-encoding-check-into-validation-.patch

Makes the ICU code more similar to the built-in provider plus reduces
some code duplication. I feel we could go one step further and also only
normalize built-in when "if (!IsBinaryUpgrade && dblocale !=
src_locale)" but I leave that for another patch if that is something we
actually want to unify.

= v9-0005-Move-provider-specific-code-when-looking-up-local.patch

I did not like how namespace.c had knowledge of ICU.

Andreas

Attachment Content-Type Size
v9-0001-Perform-provider-specific-initialization-code-in-.patch text/x-patch 18.6 KB
v9-0002-Move-check-for-ucol_strcollUTF8-to-pg_locale_icu..patch text/x-patch 1.8 KB
v9-0003-Move-code-for-collation-version-into-provider-spe.patch text/x-patch 10.8 KB
v9-0004-Move-ICU-database-encoding-check-into-validation-.patch text/x-patch 4.6 KB
v9-0005-Move-provider-specific-code-when-looking-up-local.patch text/x-patch 2.6 KB
v9-0006-Control-collation-behavior-with-a-method-table.patch text/x-patch 19.1 KB
v9-0007-Control-ctype-behavior-internally-with-a-method-t.patch text/x-patch 63.4 KB
v9-0008-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.8 KB
v9-0009-Make-provider-data-in-pg_locale_t-an-opaque-point.patch text/x-patch 21.6 KB
v9-0010-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 2.8 KB
v9-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch text/x-patch 6.7 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-12-03 07:58:36
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote:
> I feel your first patch in the series is something you can just
> commit.

Done.

I combined your patches and mine into the attached v10 series.

I also split out the ctype methods patch into two, so that patch v10-
0005 moves all of the case mapping code into the appropriate provider
files. That should make the ctype methods patch (v10-0007) easier to
review.

Regards,
Jeff Davis

Attachment Content-Type Size
v10-0001-Move-check-for-ucol_strcollUTF8-to-pg_locale_icu.patch text/x-patch 1.8 KB
v10-0002-Move-code-for-collation-version-into-provider-sp.patch text/x-patch 10.8 KB
v10-0003-Move-ICU-database-encoding-check-into-validation.patch text/x-patch 4.6 KB
v10-0004-Move-provider-specific-code-when-looking-up-loca.patch text/x-patch 2.6 KB
v10-0005-Refactor-case-mapping-into-provider-specific-fil.patch text/x-patch 38.1 KB
v10-0006-Control-collation-behavior-with-a-method-table.patch text/x-patch 19.3 KB
v10-0007-Control-ctype-behavior-internally-with-a-method-.patch text/x-patch 40.7 KB
v10-0008-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.8 KB
v10-0009-Make-provider-data-in-pg_locale_t-an-opaque-poin.patch text/x-patch 21.7 KB
v10-0010-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 2.8 KB
v10-0011-Introduce-hooks-for-creating-custom-pg_locale_t.patch text/x-patch 6.4 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-12-05 00:21:34
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote:
> My patches:
>
> = v9-0002-Move-check-for-ucol_strcollUTF8-to-pg_locale_icu..patch

Committed.

> = v9-0003-Move-code-for-collation-version-into-provider-spe.patch
>
> Moves some code from pg_collate.c into provider specific files.

I agree with the general idea, but it seems we are accumulating a lot
of provider-specific functions. Should we define a provider struct with
its own methods?

That would be a good step toward making the provider catalog-driven.
Even if we don't support CREATE LOCALE PROVIDER, having space in the
catalog would be a good place to track the provider version.

> = v9-0004-Move-ICU-database-encoding-check-into-validation-.patch

This seems to be causing a test failure in 020_createdb.pl.

> = v9-0005-Move-provider-specific-code-when-looking-up-local.patch
>
> I did not like how namespace.c had knowledge of ICU.

See comments above about v9-0003.

Regards,
Jeff Davis


From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-12-20 05:48:54
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 12/5/24 1:21 AM, Jeff Davis wrote:
>> = v9-0003-Move-code-for-collation-version-into-provider-spe.patch
>>
>> Moves some code from pg_collate.c into provider specific files.
>
> I agree with the general idea, but it seems we are accumulating a lot
> of provider-specific functions. Should we define a provider struct with
> its own methods?
>
> That would be a good step toward making the provider catalog-driven.
> Even if we don't support CREATE LOCALE PROVIDER, having space in the
> catalog would be a good place to track the provider version.

Yeah, that was my idea too but I just have not gotten around to it yet.

>> = v9-0004-Move-ICU-database-encoding-check-into-validation-.patch
>
> This seems to be causing a test failure in 020_createdb.pl.

Thanks, I have attached a fixup commit for this.

Andreas

Attachment Content-Type Size
0001-fixup-Move-ICU-database-encoding-check-into-validati.patch text/x-patch 865 bytes

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2025-01-10 00:19:58
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2024-12-02 at 23:58 -0800, Jeff Davis wrote:
> On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote:
> > I feel your first patch in the series is something you can just
> > commit.
>
> Done.
>
> I combined your patches and mine into the attached v10 series.

Here's v12 after committing a few of the earlier patches.

I changed the ctype method table to have separate methods for isdigit,
isalpha, etc., instead of the combined char_properties method. That's
more consistent with how things are currently done.

I may still be seeing a tiny perf regression using the same test as
[1], but I don't expect it to have a practical impact. Let me know if
you think that's a problem.

I committed your change to move the version reporting into the
provider-specific files.

Your other change to lookup_collation() in namespace.c should also
account for the code in DefineCollation() -- I don't think it makes
sense to refactor one without the other.

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/78a1b434ff40510dc5aaabe986299a09f4da90cf.camel%40j-davis.com

Attachment Content-Type Size
v12-0001-Control-ctype-behavior-internally-with-a-method-.patch text/x-patch 40.6 KB
v12-0002-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.8 KB
v12-0003-Make-provider-data-in-pg_locale_t-an-opaque-poin.patch text/x-patch 22.7 KB
v12-0004-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 2.8 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2025-01-15 20:42:46
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2025-01-09 at 16:19 -0800, Jeff Davis wrote:
> On Mon, 2024-12-02 at 23:58 -0800, Jeff Davis wrote:
> > On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote:
> > > I feel your first patch in the series is something you can just
> > > commit.
> >
> > Done.
> >
> > I combined your patches and mine into the attached v10 series.
>
> Here's v12 after committing a few of the earlier patches.

I collected some performance numbers for a worst case on UTF8. This is
where each row is million characters wide and each one is greater than
MAX_SIMPLE_CHAR (U+07FF):

create table wide (t text);
insert into wide
select repeat('カ', 1048576)
from generate_series(1,1000) g;

select 1 from wide where t ~ '([[:punct:]]|[[:lower:]])'
collate "the_collation";

results:
master patched
C 3736 3589
pg_c_utf8 19500 23404
en_US 10251 12396
en-US-x-icu 10264 11963

And a separate test for ILIKE on en_US.iso885915 where each character
is beyond the ASCII range and needs to be lowercased using the
optimization for single-byte encodings in Generic_Text_IC_like:

create table sb (t text);
insert into sb
select repeat('É', 1048576)
from generate_series(1, 3000) g;

select 1 from sb where t ilike '%á%';

results:

master patched
C 2900 2812
en_US 2203 3702
en-US-x-icu 17483 18123

The numbers from both tests show a slowdown. The worst one is probably
tolower() for libc in LATIN9, which appears to be heavily optimized,
and the extra indirection for a method call slows things down quite a
bit.

This is a bit unfortunate because the method table feels like the right
code organization. Having special cases at the call sites (aside from
ctype_is_c) is not great. Are the above numbers bad enough that we need
to give up on this method-ization approach? Or should we say that the
above cases don't represent reality, and a moderate regression there is
OK?

Or perhaps someone has an idea how to mitigate the regression? I could
imagine another cache of character properties, like an extensible
pg_char_properties. I'm not sure if the extra complexity is worth it,
though.

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2025-01-18 06:42:15
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2025-01-15 at 12:42 -0800, Jeff Davis wrote:
> > Here's v12 after committing a few of the earlier patches.

And here's v14, just a rebase.

> I collected some performance numbers for a worst case on UTF8.

I'm still inlined to think the method table is a good thing to do:

(a) The performance cases I tried seem implausibly bad -- running
character classification patterns over large fields consisting only of
codepoints over U+07FF.

(b) The method tables seem like a better code organization that
separates the responsibilities of the provider from the calling code.
It's also a requirement (or nearly so) if we want to provide some
pluggability or support multiple library versions.

It would be good to hear from others on these points, though.

Regards,
Jeff Davis

Attachment Content-Type Size
v14-0001-Control-ctype-behavior-internally-with-a-method-.patch text/x-patch 41.0 KB
v14-0002-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.8 KB
v14-0003-Make-provider-data-in-pg_locale_t-an-opaque-poin.patch text/x-patch 25.3 KB
v14-0004-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 2.8 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2025-02-07 19:19:21
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

> I'm still inlined to think the method table is a good thing to do:
>
> (a) The performance cases I tried seem implausibly bad -- running
> character classification patterns over large fields consisting only
> of
> codepoints over U+07FF.
>
> (b) The method tables seem like a better code organization that
> separates the responsibilities of the provider from the calling code.
> It's also a requirement (or nearly so) if we want to provide some
> pluggability or support multiple library versions.
>
> It would be good to hear from others on these points, though.

Attached v15. Just a rebase.

I'd still like some input here. We could either:

* commit this on the grounds that it's a desirable code improvement and
the worst-case regression isn't a major concern; or

* wait until v19 when we might have a more compelling use for the
method table (e.g. pluggable provider or multilib)

Regards,
Jeff Davis

Attachment Content-Type Size
v15-0001-Control-ctype-behavior-internally-with-a-method-.patch text/x-patch 43.0 KB
v15-0002-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.8 KB
v15-0003-Make-provider-data-in-pg_locale_t-an-opaque-poin.patch text/x-patch 25.6 KB
v15-0004-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 2.8 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2025-06-12 05:49:17
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2025-02-07 at 11:19 -0800, Jeff Davis wrote:
>
> Attached v15. Just a rebase.

Attached v16.

> * commit this on the grounds that it's a desirable code improvement
> and
> the worst-case regression isn't a major concern; or

I plan to commit this soon after branching. There's a general consensus
that enabling multi-lib provider support is a good idea, and turning
the provider behavior into method tables is a prerequisite for that. I
doubt the performance issue will be a serious concern and I don't see a
good way to avoid it.

Regards,
Jeff Davis

Attachment Content-Type Size
v16-0001-Control-ctype-behavior-internally-with-a-method-.patch text/x-patch 49.0 KB
v16-0002-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.8 KB
v16-0003-Make-provider-data-in-pg_locale_t-an-opaque-poin.patch text/x-patch 26.3 KB
v16-0004-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 2.8 KB

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2025-06-29 10:43:41
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 12.06.25 07:49, Jeff Davis wrote:
> On Fri, 2025-02-07 at 11:19 -0800, Jeff Davis wrote:
>>
>> Attached v15. Just a rebase.
>
> Attached v16.
>
>> * commit this on the grounds that it's a desirable code improvement
>> and
>> the worst-case regression isn't a major concern; or
>
> I plan to commit this soon after branching. There's a general consensus
> that enabling multi-lib provider support is a good idea, and turning
> the provider behavior into method tables is a prerequisite for that. I
> doubt the performance issue will be a serious concern and I don't see a
> good way to avoid it.

Patch 0001 and 0002 seem okay to me.

I wish we could take this further and also run the "ctype is c" case
through the method table. Right now, there are still a bunch of
open-coded special cases all over the place, which could be unified. I
guess this isn't any worse than before, but maybe this could be a future
project?

Patch 0003 I don't understand. It replaces type safety by no type
safety, and it doesn't have any explanation or comments. I suppose you
have further plans in this direction, but until we have seen those and
have more clarification and explanation, I would hold this back.

Patch 0004 seems ok. But maybe you could explain this better in the
commit message, like remove includes from pg_locale.h but instead put
them in the .c files as needed, and explain why this is possible or
suitable now.


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2025-06-30 19:21:47
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2025-06-29 at 12:43 +0200, Peter Eisentraut wrote:
> I wish we could take this further and also run the "ctype is c" case
> through the method table.  Right now, there are still a bunch of
> open-coded special cases all over the place, which could be unified. 
> I
> guess this isn't any worse than before, but maybe this could be a
> future
> project?

+1. A few things need to be sorted out, but I don't see any major
problem with that.

> Patch 0003 I don't understand.  It replaces type safety by no type
> safety, and it doesn't have any explanation or comments.  I suppose
> you
> have further plans in this direction, but until we have seen those
> and
> have more clarification and explanation, I would hold this back.

Part of it is simply #include cleanliness, because we can't do v16-0004
if we have the provider-specific details in the union. I don't really
like the idea of including ICU headers (indirectly) so many places.
Another part is that I'd like to abstract the providers more completely
-- I've alluded to that a few times but I haven't made an independent
proposal for that yet. Also, the union doesn't offer a lot of type
safety, so I don't see it as a big loss.

But it's not critical right now either, so I won't push for it.

> Patch 0004 seems ok.  But maybe you could explain this better in the
> commit message, like remove includes from pg_locale.h but instead put
> them in the .c files as needed, and explain why this is possible or
> suitable now.

It goes with v16-0003, so I will hold this back for now as well.

Regards,
Jeff Davis