Re: Fuzzy substring searching with the pg_trgm extension - Mailing list pgsql-hackers
From | Artur Zakirov |
---|---|
Subject | Re: Fuzzy substring searching with the pg_trgm extension |
Date | |
Msg-id | [email protected] Whole thread Raw |
In response to | Re: Fuzzy substring searching with the pg_trgm extension (Alvaro Herrera <[email protected]>) |
Responses |
Re: Fuzzy substring searching with the pg_trgm extension
|
List | pgsql-hackers |
I have did some changes to the patch: - changed version from 1.3 to 1.2 - added check_only parameter to the function calc_substring_similarity(). If check_only == true then a calc_substring_similarity() calculates similarity which greater than pg_trgm.substring_limit variable. This behavior is used in the substring_similarity_op() and it should increase performance of the operator. In the substring_similarity() function check_only == false. On 12.01.2016 02:31, Alvaro Herrera wrote: > I gave a quick look through the patch and noticed a few minor things > while trying to understand it. > > I think the test corpus isn't particularly interesting for how big it > is. I'd rather have (a) a small corpus (say 100 words) with which to do > detailed regression testing, and (b) some larger document for more > extensive testing. I'm not sure (b) is actually necessary. > > Overall I think the new functions could stand a lot more commentary. I added some comments to new functions iterate_substring_similarity() and calc_substring_similarity(). >> >> + #ifdef DIVUNION >> + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) ((len1) + (len2) - (count))) >> + #else >> + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) (((len1) > (len2)) ? (len1) : (len2))) >> + #endif > > This macro has a multiple evaluation gotcha. Maybe we don't really care > because of the limited applicability but I'd put a comment on top. I added a comment to explain various similarity formula. > > BTW you don't need a ternary op just to say "? true : false" -- just > make it > res = foo == bar; > which immediately assigns the correct boolean value. (This pattern > occurs in many places both in the original and in your submitted code. > I see no reason to perpetuate it.) > Fixed all ternary operators. > >> + Datum >> + set_substring_limit(PG_FUNCTION_ARGS) >> + { >> + float4 nlimit = PG_GETARG_FLOAT4(0); >> + >> + if (nlimit < 0 || nlimit > 1.0) >> + elog(ERROR, "wrong limit, should be between 0 and 1"); > > Should be an ereport(). Fixed. But an elog() function occurs in trgm_gin.c and trgm_gist.c also. I'm not sure that it should be replaced with an ereport(). > > >> ! static void >> ! protect_out_of_mem(int slen) >> ! { >> ! /* >> ! * Guard against possible overflow in the palloc requests below. (We >> ! * don't worry about the additive constants, since palloc can detect >> ! * requests that are a little above MaxAllocSize --- we just need to >> ! * prevent integer overflow in the multiplications.) >> ! */ >> ! if ((Size) (slen / 2) >= (MaxAllocSize / (sizeof(trgm) * 3)) || >> ! (Size) slen >= (MaxAllocSize / pg_database_encoding_max_length())) >> ! ereport(ERROR, >> ! (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), >> ! errmsg("out of memory"))); >> ! } > > I think the comment should now be above the function, not inside it. Fixed. >> >> + /* Trigram with position */ >> + typedef struct >> + { >> + trgm t; >> + int i; >> + } ptrgm; > > This struct definition could stand a better name, and better member > names. Also, our policy (not always enforced) is to have these near the > top of the file rather than in the middle. If the struct is used in > many places then IMO it's better that it go at the top; if it's used > only in one function I'd say it's okay to have it just after that > function. Renamed this structure to pos_trgm and move it near the top since this structure is used in a couple of functions. I renamed member names to trg and index accordingly. > >> + /* >> + * Make array of positional trigrams from two trigram arrays. The first array >> + * is required substing which positions don't matter and replaced with -1. >> + * The second array is haystack where we search and have to store its >> + * positions. >> + */ >> + static ptrgm * >> + make_positional_trgm(trgm *trg1, int len1, trgm *trg2, int len2) >> + { >> + ptrgm *result; >> + int i, len = len1 + len2; >> + >> + result = (ptrgm *) palloc(sizeof(ptrgm) * len); >> + >> + for (i = 0; i < len1; i++) >> + { >> + memcpy(&result[i].t, &trg1[i], sizeof(trgm)); >> + result[i].i = -1; >> + } >> + >> + for (i = 0; i < len2; i++) >> + { >> + memcpy(&result[i + len1].t, &trg2[i], sizeof(trgm)); >> + result[i + len1].i = i; >> + } >> + >> + return result; >> + } > > So the output trigram array is the concatenation of both input trigrams? > That's not quite evident from the comment above the function ... I think > some rephrasing there is in order. Rephrased the comment. >> >> ! /* if inexact then len2 is equal to count */ >> ! if (inexact) >> ! return CALCSML(count, len1, count); >> ! else >> ! return CALCSML(count, len1, len2); > > This is where a ternary op would make better code. > CALCSML(count, len1, inexact ? count : len2); > but a better comment would actually explain why we do this, rather than > just stating we do. Fixed a code and a comment. > > >> + <row> >> + <entry><function>show_substring_limit()</function><indexterm><primary>show_substring_limit</primary></indexterm></entry> >> + <entry><type>real</type></entry> >> + <entry> >> + Returns the current similarity threshold used by the <literal><%</> >> + operator. This sets the minimum substring similarity between >> + two phrases. >> + </entry> >> + </row> >> + <row> >> + <entry><function>set_substring_limit(real)</function><indexterm><primary>set_substring_limit</primary></indexterm></entry> >> + <entry><type>real</type></entry> >> + <entry> >> + Sets the current substring similarity threshold that is used by >> + the <literal><%</> operator. The threshold must be between >> + 0 and 1 (default is 0.6). Returns the same value passed in. >> + </entry> > > I don't quite understand why aren't we using a custom GUC variable here. > These already have SHOW and SET support ... > Added GUC variables: - pg_trgm.limit - pg_trgm.substring_limit I added this variables to the documentation. show_limit() and set_limit() functions work correctly and they are marked as deprecated. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
pgsql-hackers by date: