Lists: | pgsql-hackers |
---|
From: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Tsvector editing functions |
Date: | 2015-10-05 17:29:29 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hello.
There is patch that adds some editing routines for tsvector type.
tsvector delete(tsvector, text)
removes entry from tsvector by lexeme name
set unnest(tsvector)
expands a tsvector to a set of rows. Each row has following columns: lexeme, postings, weights.
text[] to_array(tsvector)
converts tsvector to array of lexemes
tsvector to_tsvector(text[])
converts array of lexemes to tsvector
Attachment | Content-Type | Size |
---|---|---|
tsvector_funcs.diff | application/octet-stream | 15.1 KB |
unknown_filename | text/plain | 96 bytes |
From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Tsvector editing functions |
Date: | 2015-10-06 22:21:33 |
Message-ID: | CA+TgmoYAQHp6vgUU-Ma4A-19rpLm4cTSUvfV24z8-0Aupea5gA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, Oct 5, 2015 at 1:29 PM, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
> Hello.
>
> There is patch that adds some editing routines for tsvector type.
>
> tsvector delete(tsvector, text)
> removes entry from tsvector by lexeme name
> set unnest(tsvector)
> expands a tsvector to a set of rows. Each row has following columns: lexeme, postings, weights.
> text[] to_array(tsvector)
> converts tsvector to array of lexemes
> tsvector to_tsvector(text[])
> converts array of lexemes to tsvector
When submitting a patch, it's a good idea to explain why someone would
want the feature you are adding. Maybe that's obvious to you, but it
isn't clear to me why we'd want this.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Tsvector editing functions |
Date: | 2015-10-08 15:11:21 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
>> There is patch that adds some editing routines for tsvector type.
...
> When submitting a patch, it's a good idea to explain why someone would
> want the feature you are adding. Maybe that's obvious to you, but it
> isn't clear to me why we'd want this.
>
Some examples:
tsvector delete(tsvector, text)
remove wronlgy indexed word (may, be a stop word)
text[] to_array(tsvector)
In my practice, I needed it to work with smlar module.
tsvector to_tsvector(text[])
Converts list of tags to tsvector, because search in tsvector is more
flexible and fast than array's equivalents
set unnest(tsvector)
Count some complicated statistics.
That functions mostly needed in utility processing rather in workflow.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Tsvector editing functions |
Date: | 2015-11-27 11:38:26 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
1 Please, make patch compilable with current master.
cd ../../../src/include/catalog && '/usr/local/bin/perl' ./duplicate_oids
3315
3316
2 lexin = TextDatumGetCString(PG_GETARG_DATUM(1))
lexin_len = strlen(lexin)
Why do you use C-string instead of just text? Suppose, much better:
t = PG_GETARG_TEXT_P(1)
lexin = VARDATA(t)
lexin_len = VARSIZE_ANY_EXHDR(t)
3 Why do you use linear search in tsvector instead of binary search? It could
produce a performance impact
4 Again, using BuildTupleFromCStrings() call is not very optimal
5 printing weights as numbers is not consistent with other usage of weigth's in
FTS. Lexem's weight are mentioned as one of A,B,C,D and default weight is a D.
Teodor Sigaev wrote:
>>> There is patch that adds some editing routines for tsvector type.
> ...
>> When submitting a patch, it's a good idea to explain why someone would
>> want the feature you are adding. Maybe that's obvious to you, but it
>> isn't clear to me why we'd want this.
>>
>
> Some examples:
> tsvector delete(tsvector, text)
> remove wronlgy indexed word (may, be a stop word)
> text[] to_array(tsvector)
> In my practice, I needed it to work with smlar module.
> tsvector to_tsvector(text[])
> Converts list of tags to tsvector, because search in tsvector is more
> flexible and fast than array's equivalents
> set unnest(tsvector)
> Count some complicated statistics.
>
> That functions mostly needed in utility processing rather in workflow.
>
>
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Subject: | Re: Tsvector editing functions |
Date: | 2015-11-27 12:13:44 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hmm, seems, it will be useful to add two fuctions:
tsvector filter(tsvector, array_of_weigths) - returns tsvector contains lexemes
with given weights
tsvector setweight(tsvector, weigth, array_of_lexemes) - sets given weight for
given lexemes
Stas Kelvich wrote:
> Hello.
>
> There is patch that adds some editing routines for tsvector type.
>
> tsvector delete(tsvector, text)
> removes entry from tsvector by lexeme name
> set unnest(tsvector)
> expands a tsvector to a set of rows. Each row has following columns: lexeme, postings, weights.
> text[] to_array(tsvector)
> converts tsvector to array of lexemes
> tsvector to_tsvector(text[])
> converts array of lexemes to tsvector
>
>
>
>
>
>
> Stas Kelvich
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
>
>
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
From: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2015-12-07 14:05:31 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hello.
Done with the list of suggestions. Also heavily edit delete function.
Attachment | Content-Type | Size |
---|---|---|
tsvector_ops.diff | application/octet-stream | 29.0 KB |
unknown_filename | text/plain | 1.2 KB |
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2015-12-15 03:07:59 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
On 12/07/2015 03:05 PM, Stas Kelvich wrote:
> Hello.
>
> Done with the list of suggestions. Also heavily edit delete function.
>
I did a quick review of the updated patch. I'm not a tsvector-expert so
hopefully my comments won't be entirely bogus.
1) It's a bit difficult to judge the usefulness of the API, as I've
always been a mere user of full-text search, and I never had a need
(or courage) to mess with the tsvectors. OTOH I don't see a good
reason no to have such API, when there's a need for it.
The API seems to be reasonably complete, with one exception - when
looking at editing function we have for 'hstore', we do have these
variants for delete()
delete(hstore,text)
delete(hstore,text[])
delete(hstore,hstore)
while this patch only adds delete(tsvector,text). Would it make
sense to add variants similar to hstore? It probably does not make
much sense to add delete(tsvector,tsvector), right? But being able
to delete a bunch of lexemes in one go seems like a good thing.
What do you think?
2) tsvector_op.c needs a bit of love, to eliminate the two warnings it
currently triggers:
tsvector_op.c:211:2: warning: ISO C90 forbids mixed ...
tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but ...
3) the patch also touches tsvector_setweight(), only to do change:
elog(ERROR, "unrecognized weight: %d", cw);
to
elog(ERROR, "unrecognized weight: %c", cw);
That should probably go get committed separately, as a bugfix.
4) I find it rather annoying that there are pretty much no comments in
the code. Granted, there are pretty much no comments in the
surrounding code, but I doubt that's a good reason for not having
any comments in new code. It makes reviews unnecessarily difficult.
5) tsvector_concat() is not mentioned in docs at all
6) Docs don't mention names of the new parameters in function
signatures, just data types. The functions with a single parameter
probably don't need to do that, but multi-parameter ones should.
7) Some of the functions use intexterm that does not match the function
name. I see two such cases - to_tsvector and setweight. Is there a
reason for that?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Tsvector editing functions |
Date: | 2015-12-24 02:55:17 |
Message-ID: | CAB7nPqTvjQUo4VhK9Di4J2j_wrD=2eu=muE=yJt_Eho8NK+Bmw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Dec 15, 2015 at 12:07 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> On 12/07/2015 03:05 PM, Stas Kelvich wrote:
>>
>> Hello.
>>
>> Done with the list of suggestions. Also heavily edit delete function.
>>
>
> I did a quick review of the updated patch. I'm not a tsvector-expert so
> hopefully my comments won't be entirely bogus.
>
> 1) It's a bit difficult to judge the usefulness of the API, as I've
> always been a mere user of full-text search, and I never had a need
> (or courage) to mess with the tsvectors. OTOH I don't see a good
> reason no to have such API, when there's a need for it.
>
> The API seems to be reasonably complete, with one exception - when
> looking at editing function we have for 'hstore', we do have these
> variants for delete()
>
> delete(hstore,text)
> delete(hstore,text[])
> delete(hstore,hstore)
>
> while this patch only adds delete(tsvector,text). Would it make
> sense to add variants similar to hstore? It probably does not make
> much sense to add delete(tsvector,tsvector), right? But being able
> to delete a bunch of lexemes in one go seems like a good thing.
>
> What do you think?
>
>
> 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it
> currently triggers:
>
> tsvector_op.c:211:2: warning: ISO C90 forbids mixed ...
> tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but ...
>
> 3) the patch also touches tsvector_setweight(), only to do change:
>
> elog(ERROR, "unrecognized weight: %d", cw);
>
> to
>
> elog(ERROR, "unrecognized weight: %c", cw);
>
> That should probably go get committed separately, as a bugfix.
>
>
> 4) I find it rather annoying that there are pretty much no comments in
> the code. Granted, there are pretty much no comments in the
> surrounding code, but I doubt that's a good reason for not having
> any comments in new code. It makes reviews unnecessarily difficult.
>
>
> 5) tsvector_concat() is not mentioned in docs at all
>
>
> 6) Docs don't mention names of the new parameters in function
> signatures, just data types. The functions with a single parameter
> probably don't need to do that, but multi-parameter ones should.
>
> 7) Some of the functions use intexterm that does not match the function
> name. I see two such cases - to_tsvector and setweight. Is there a
> reason for that?
I have marked this patch as returned with feedback based on the
presence of a review and a lack of replies from the author. Stas, if
you are still working on the patch, please feel free to move it to the
next commit fest.
--
Michael
From: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2015-12-30 17:49:34 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi, Tomáš! Thanks for comprehensive review.
> On 15 Dec 2015, at 06:07, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> 1) It's a bit difficult to judge the usefulness of the API, as I've
> always been a mere user of full-text search, and I never had a need
> (or courage) to mess with the tsvectors. OTOH I don't see a good
> reason no to have such API, when there's a need for it.
>
> The API seems to be reasonably complete, with one exception - when
> looking at editing function we have for 'hstore', we do have these
> variants for delete()
>
> delete(hstore,text)
> delete(hstore,text[])
> delete(hstore,hstore)
>
> while this patch only adds delete(tsvector,text). Would it make
> sense to add variants similar to hstore? It probably does not make
> much sense to add delete(tsvector,tsvector), right? But being able
> to delete a bunch of lexemes in one go seems like a good thing.
>
> What do you think?
That’s a good idea and actually deleting tsvector from tsvector makes perfect sense. In delete function I used exact string match between string and lexemes in tsvector, but if somebody wants to delete for example “Cats” from tsvector, then he should downcase and singularize this word. Easiest way to do it is to just use to_tsvector() function. Also we can use this function to delete specific positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) -> 'cat:3 fat:4'::tsvector.
So in attached patch I’ve implemented following:
delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr from tsin
delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of tsv_filter from tsin. When lexeme in tsv_filter has no positions function will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have positions function will delete them from positions of matching lexeme in tsin. If after such removal resulting positions set is empty then function will delete that lexeme from resulting tsvector.
Also if we want some level of completeness of API and taking into account that concat() function shift positions on second argument I thought that it can be useful to also add function that can shift all positions of specific value. This helps to undo concatenation: delete one of concatenating tsvectors and then shift positions in resulting tsvector. So I also wrote one another small function:
shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset
>
>
> 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it
> currently triggers:
>
> tsvector_op.c:211:2: warning: ISO C90 forbids mixed ...
> tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but …
>
fixed
> 3) the patch also touches tsvector_setweight(), only to do change:
>
> elog(ERROR, "unrecognized weight: %d", cw);
>
> to
>
> elog(ERROR, "unrecognized weight: %c", cw);
>
> That should probably go get committed separately, as a bugfix.
>
Okay, i’ll submit that as a separate patch.
>
> 4) I find it rather annoying that there are pretty much no comments in
> the code. Granted, there are pretty much no comments in the
> surrounding code, but I doubt that's a good reason for not having
> any comments in new code. It makes reviews unnecessarily difficult.
>
Fixed, I think.
>
> 5) tsvector_concat() is not mentioned in docs at all
>
Concat mentioned in docs as an operator ||.
>
> 6) Docs don't mention names of the new parameters in function
> signatures, just data types. The functions with a single parameter
> probably don't need to do that, but multi-parameter ones should.
>
Fixed.
> 7) Some of the functions use intexterm that does not match the function
> name. I see two such cases - to_tsvector and setweight. Is there a
> reason for that?
>
Because sgml compiler wants unique indexterm. Both functions that you mentioned use overloading of arguments and have non-unique name.
Attachment | Content-Type | Size |
---|---|---|
tsvector_ops.diff | application/octet-stream | 48.1 KB |
unknown_filename | text/plain | 95 bytes |
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-01-18 21:21:03 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
So, Tomas, Teodor, did you like this new version of the patch?
Stas Kelvich wrote:
> > 7) Some of the functions use intexterm that does not match the function
> > name. I see two such cases - to_tsvector and setweight. Is there a
> > reason for that?
>
> Because sgml compiler wants unique indexterm. Both functions that you
> mentioned use overloading of arguments and have non-unique name.
This sounds wrong. I think what you should really do is use
<indexterm>
<primary>foo</primary>
<secondary>bar</secondary>
</indexterm>
to distinguish the two entries.
It's a bit funny that you reintroduce the "unrecognized weight: %d"
(instead of %c) in tsvector_setweight_by_filter.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-01-22 16:03:27 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 12/30/2015 06:49 PM, Stas Kelvich wrote:
> Hi, Tomáš! Thanks for comprehensive review.
>
>> On 15 Dec 2015, at 06:07, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> 1) It's a bit difficult to judge the usefulness of the API, as I've
>> always been a mere user of full-text search, and I never had a need
>> (or courage) to mess with the tsvectors. OTOH I don't see a good
>> reason no to have such API, when there's a need for it.
>>
>> The API seems to be reasonably complete, with one exception - when
>> looking at editing function we have for 'hstore', we do have these
>> variants for delete()
>>
>> delete(hstore,text)
>> delete(hstore,text[])
>> delete(hstore,hstore)
>>
>> while this patch only adds delete(tsvector,text). Would it make
>> sense to add variants similar to hstore? It probably does not make
>> much sense to add delete(tsvector,tsvector), right? But being able
>> to delete a bunch of lexemes in one go seems like a good thing.
>>
>> What do you think?
>
> That’s a good idea and actually deleting tsvector from tsvector makes perfect sense. In delete function I used exact string match between string and lexemes in tsvector, but if somebody wants to delete for example “Cats” from tsvector, then he should downcase and singularize this word. Easiest way to do it is to just use to_tsvector() function. Also we can use this function to delete specific positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) -> 'cat:3 fat:4'::tsvector.
>
> So in attached patch I’ve implemented following:
>
> delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr from tsin
OK, although I do recommend using more sensible variable names, i.e. why
how to use 'lexemes' instead of 'lexarr' for example? Similarly for the
other functions.
>
> delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of tsv_filter from tsin. When lexeme in tsv_filter has no positions function will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have positions function will delete them from positions of matching lexeme in tsin. If after such removal resulting positions set is empty then function will delete that lexeme from resulting tsvector.
>
I can't really imagine situation in which I'd need this, but if you do
have a use case for it ... although in the initial paragraph you say
"... but if somebody wants to delete for example ..." which suggests you
may not have such use case.
Based on bad experience with extending API based on vague ideas, I
recommend only really adding functions with existing need. It's easy to
add a function later, much more difficult to remove it or change the
signature.
> Also if we want some level of completeness of API and taking into account that concat() function shift positions on second argument I thought that it can be useful to also add function that can shift all positions of specific value. This helps to undo concatenation: delete one of concatenating tsvectors and then shift positions in resulting tsvector. So I also wrote one another small function:
>
> shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset
That seems rather too low-level. Shouldn't it be really built into
delete() directly somehow?
>
>>
>> 4) I find it rather annoying that there are pretty much no comments in
>> the code. Granted, there are pretty much no comments in the
>> surrounding code, but I doubt that's a good reason for not having
>> any comments in new code. It makes reviews unnecessarily difficult.
>>
>
> Fixed, I think.
Yep, much better now.
>
>>
>> 5) tsvector_concat() is not mentioned in docs at all
>>
>
> Concat mentioned in docs as an operator ||.
Ah, OK.
>
>>
>> 6) Docs don't mention names of the new parameters in function
>> signatures, just data types. The functions with a single parameter
>> probably don't need to do that, but multi-parameter ones should.
>>
>
> Fixed.
OK, but please let's use variable names clearly identifying the meaning.
So not 'w' but 'weight' and so on.
>
>> 7) Some of the functions use intexterm that does not match the function
>> name. I see two such cases - to_tsvector and setweight. Is there a
>> reason for that?
>>
>
> Because sgml compiler wants unique indexterm. Both functions that
> youmentioned use overloading of arguments and have non-unique name.
As Michael pointed out, that should probably be handled by using
<primary> and <secondary> tags.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-01-27 16:39:29 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi
> On 22 Jan 2016, at 19:03, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> OK, although I do recommend using more sensible variable names, i.e. why how to use 'lexemes' instead of 'lexarr' for example? Similarly for the other functions.
Changed. With old names I tried to follow conventions in surrounding code, but probably that is a good idea to switch to more meaningful names in new code.
>>
>>
>> delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of tsv_filter from tsin. When lexeme in tsv_filter has no positions function will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have positions function will delete them from positions of matching lexeme in tsin. If after such removal resulting positions set is empty then function will delete that lexeme from resulting tsvector.
>>
>
> I can't really imagine situation in which I'd need this, but if you do have a use case for it ... although in the initial paragraph you say "... but if somebody wants to delete for example ..." which suggests you may not have such use case.
>
> Based on bad experience with extending API based on vague ideas, I recommend only really adding functions with existing need. It's easy to add a function later, much more difficult to remove it or change the signature.
I tried to create more or less self-contained api, e.g. have ability to negate effect of concatenation. But i’ve also asked people around what they think about extending API and everybody convinced that it is better to stick to smaller API. So let’s drop it. At least that functions exists in mail list in case if somebody will google for such kind of behaviour.
>>
>> Also if we want some level of completeness of API and taking into account that concat() function shift positions on second argument I thought that it can be useful to also add function that can shift all positions of specific value. This helps to undo concatenation: delete one of concatenating tsvectors and then shift positions in resulting tsvector. So I also wrote one another small function:
>>
>> shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset
>
> That seems rather too low-level. Shouldn't it be really built into delete() directly somehow?
I think it is ambiguous task on delete. But if we are dropping support of delete(tsvector, tsvector) I don’t see points in keeping that functions.
>>>
>>> 7) Some of the functions use intexterm that does not match the function
>>> name. I see two such cases - to_tsvector and setweight. Is there a
>>> reason for that?
>>>
>>
>> Because sgml compiler wants unique indexterm. Both functions that
>> youmentioned use overloading of arguments and have non-unique name.
>
> As Michael pointed out, that should probably be handled by using <primary> and <secondary> tags.
Done.
> On 19 Jan 2016, at 00:21, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
>
>
> It's a bit funny that you reintroduce the "unrecognized weight: %d"
> (instead of %c) in tsvector_setweight_by_filter.
>
Ah, I was thinking about moving it to separate diff and messed. Fixed and attaching diff with same fix for old tsvector_setweight.
Attachment | Content-Type | Size |
---|---|---|
tsvector_ops-v2.1.diff | application/octet-stream | 36.3 KB |
tsvector_ops-v2.2.diff | application/octet-stream | 448 bytes |
unknown_filename | text/plain | 96 bytes |
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-02-02 17:10:54 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Some notices:
1 tsin in documentation doesn't look like a good name. Changed to vector similar
to other places.
2 I did some editorization about freeing memory/forgotten names etc
3 It seems to me that tsvector_unnest() could be seriously optimized for
large tsvectors: with current coding it detoasts/decompresses tsvector value on
each call. Much better to do it once in
multi_call_memory_ctx context at first call init
4 It seems debatable returning empty array for position/weight if they are absent:
=# select * from unnest('a:1 b'::tsvector);
lexeme | positions | weights
--------+-----------+---------
a | {1} | {D}
b | {} | {}
I think, it's better to return NULL in this case
5
array_to_tsvector/tsvector_setweight_by_filter/tsvector_delete_arr/tsvector_filter
doesn't check or pay attention to NULL elements in input arrays
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-02-02 17:17:00 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
> Some notices:
>
> 1 tsin in documentation doesn't look like a good name. Changed to vector similar
> to other places.
>
> 2 I did some editorization about freeing memory/forgotten names etc
Ooops, forgot to attach
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
Attachment | Content-Type | Size |
---|---|---|
tsvector_ops-v3.1.diff | text/x-patch | 36.5 KB |
From: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-02-17 15:59:29 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
>
> On 02 Feb 2016, at 20:10, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
>
> Some notices:
>
> 1 tsin in documentation doesn't look like a good name. Changed to vector similar to other places.
>
> 2 I did some editorization about freeing memory/forgotten names etc
Thanks.
>
> 3 It seems to me that tsvector_unnest() could be seriously optimized for
> large tsvectors: with current coding it detoasts/decompresses tsvector value on each call. Much better to do it once in
> multi_call_memory_ctx context at first call init
Done, moved detoasting to first SRF call.
> 4 It seems debatable returning empty array for position/weight if they are absent:
> =# select * from unnest('a:1 b'::tsvector);
> lexeme | positions | weights
> --------+-----------+---------
> a | {1} | {D}
> b | {} | {}
> I think, it's better to return NULL in this case
>
Okay, done.
> 5 array_to_tsvector/tsvector_setweight_by_filter/tsvector_delete_arr/tsvector_filter doesn't check or pay attention to NULL elements in input arrays
>
Thanks! Fixed and added tests.
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW: http://www.sigaev.ru/
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
Attachment | Content-Type | Size |
---|---|---|
tsvector_ops-v4.diff | application/octet-stream | 37.9 KB |
unknown_filename | text/plain | 93 bytes |
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-03-10 17:29:21 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
> Thanks! Fixed and added tests.
Thank you!
I did some patch cleanup/fix, but I have some doubt with function's names:
1 to_tsvector:
# \df to_tsvector
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+-------------+------------------+---------------------+--------
pg_catalog | to_tsvector | tsvector | regconfig, text | normal
pg_catalog | to_tsvector | tsvector | text | normal
pg_catalog | to_tsvector | tsvector | text[] | normal
First two variants of to_tsvector make a morphological processing, last one doesn't.
2 to_array
# \df *to_array
List of functions
Schema | Name | Result data type | Argument data types |
Type
------------+-----------------------+------------------+---------------------+--------
pg_catalog | regexp_split_to_array | text[] | text, text |
normal
pg_catalog | regexp_split_to_array | text[] | text, text, text |
normal
pg_catalog | string_to_array | text[] | text, text |
normal
pg_catalog | string_to_array | text[] | text, text, text |
normal
pg_catalog | to_array | text[] | tsvector |
normal
Seems, to_array is not a right name compared to other *to_array.
I would like to suggest rename both functions to array_to_tsvector and
tsvector_to_array to have consistent name. Later we could add
to_tsvector([regconfig, ], text[]) with morphological processing.
Thoughts?
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
Attachment | Content-Type | Size |
---|---|---|
tsvector_ops-v5.diff | text/x-patch | 38.0 KB |
From: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-03-11 13:13:14 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
> On 10 Mar 2016, at 20:29, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
>
> I would like to suggest rename both functions to array_to_tsvector and tsvector_to_array to have consistent name. Later we could add to_tsvector([regconfig, ], text[]) with morphological processing.
>
> Thoughts?
>
Seems reasonable, done.
Attachment | Content-Type | Size |
---|---|---|
tsvector_ops-v6.diff | application/octet-stream | 38.1 KB |
unknown_filename | text/plain | 96 bytes |
From: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-03-11 17:05:32 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
>
> On 11 Mar 2016, at 16:13, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> wrote:
>
>
>> On 10 Mar 2016, at 20:29, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
>>
>> I would like to suggest rename both functions to array_to_tsvector and tsvector_to_array to have consistent name. Later we could add to_tsvector([regconfig, ], text[]) with morphological processing.
>>
>> Thoughts?
>>
Hi, thanks for commit.
I saw errors on windows, here is the fix:
Attachment | Content-Type | Size |
---|---|---|
tsvector.fix.patch | application/octet-stream | 1.1 KB |
unknown_filename | text/plain | 94 bytes |
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Tsvector editing functions |
Date: | 2016-03-11 17:17:01 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
> I saw errors on windows, here is the fix:
Thank you, pushed
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/