Re: unexpected result from to_tsvector - Mailing list pgsql-hackers
From | Artur Zakirov |
---|---|
Subject | Re: unexpected result from to_tsvector |
Date | |
Msg-id | [email protected] Whole thread Raw |
In response to | Re: unexpected result from to_tsvector (Artur Zakirov <[email protected]>) |
List | pgsql-hackers |
I found the discussion about allowing an underscore in emails http://www.postgresql.org/message-id/[email protected] That bug report is about recognizing an underscore in the local part of an email. And is not about recognizing an underscore in a domain name. But that patch allows an underscore in recognized host names also. I am not good in RFC, so I put excerpt from Wikipedia https://en.wikipedia.org/wiki/Email_address: > The local-part of the email address may use any of these ASCII characters: > > Uppercase and lowercase Latin letters (A–Z, a–z) (ASCII: 65–90, 97–122) > Digits 0 to 9 (ASCII: 48–57) > These special characters: !#$%&'*+-/=?^_`{|}~ (ASCII: 33, 35–39, 42, 43, 45, 47, 61, 63, 94–96, 123–126) > Character . (dot, period, full stop), ASCII 46, provided that it is not the first or last character, and provided alsothat it does not appear consecutively (e.g. [email protected] is not allowed). > Other special characters are allowed with restrictions (they are only allowed inside a quoted string, as described in theparagraph below, and in addition, a backslash or double-quote must be preceded by a backslash). These characters are: > Space and "(),:;<>@[\] (ASCII: 32, 34, 40, 41, 44, 58, 59, 60, 62, 64, 91–93) > Comments are allowed with parentheses at either end of the local part; e.g. john.smith(comment)@example.com and (comment)[email protected] both equivalent to [email protected]. and https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names > The Internet standards (Requests for Comments) for protocols mandate that component hostname labels may contain only theASCII letters 'a' through 'z' (in a case-insensitive manner),the digits '0' through '9', and the hyphen ('-'). The originalspecification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and mustnot end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits.No other symbols, punctuation characters, or white space are permitted. Hence the valid emails is (I might be wrong): [email protected] [email protected] [email protected] [email protected] The attached patch allow them to be recognized as a email. But this patch does not prohibit underscore in recognized host names. As a result this patch gives the following results with underscores: =# select * from ts_debug('simple', 'aaa@123_yyy.zzz'); alias | description | token | dictionaries | dictionary | lexemes -------+---------------+-----------------+--------------+------------+------------------- email | Email address | aaa@123_yyy.zzz | {simple} | simple | {aaa@123_yyy.zzz} (1 row) =# select * from ts_debug('simple', '123_yyy.zzz'); alias | description | token | dictionaries | dictionary | lexemes -------+-------------+-------------+--------------+------------+--------------- host | Host | 123_yyy.zzz | {simple} | simple | {123_yyy.zzz} (1 row) On 14.03.2016 17:45, Artur Zakirov wrote: > On 14.03.2016 16:22, Shulgin, Oleksandr wrote: >> >> Hm... now that doesn't look all that consistent to me (after applying >> the patch): >> >> =# select ts_debug('simple', '[email protected]'); >> ts_debug >> --------------------------------------------------------------------------- >> >> (email,"Email >> address",[email protected],{simple},simple,{[email protected]}) >> (1 row) >> >> But: >> >> =# select ts_debug('simple', 'aaa@123_yyy.zzz'); >> ts_debug >> --------------------------------------------------------- >> (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa}) >> (blank,"Space symbols",@,{},,) >> (uint,"Unsigned integer",123,{simple},simple,{123}) >> (blank,"Space symbols",_,{},,) >> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz}) >> (5 rows) >> >> One can also see that if we only keep the domain name, the result is >> similar: >> >> =# select ts_debug('simple', '123-yyy.zzz'); >> ts_debug >> ------------------------------------------------------- >> (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz}) >> (1 row) >> >> =# select ts_debug('simple', '123_yyy.zzz'); >> ts_debug >> ----------------------------------------------------- >> (uint,"Unsigned integer",123,{simple},simple,{123}) >> (blank,"Space symbols",_,{},,) >> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz}) >> (3 rows) >> >> But, this only has to do with 123 being recognized as a number, not with >> the underscore: >> >> =# select ts_debug('simple', 'abc_yyy.zzz'); >> ts_debug >> ------------------------------------------------------- >> (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz}) >> (1 row) >> >> =# select ts_debug('simple', '1abc_yyy.zzz'); >> ts_debug >> ------------------------------------------------------- >> (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz}) >> (1 row) >> >> In fact, the 123-yyy.zzz domain is not valid either according to the RFC >> (subdomain can't start with a digit), but since we already allow it, >> should we not allow 123_yyy.zzz to be recognized as a Host? Then why >> not recognize aaa@123_yyy.zzz as an email address? >> >> Another option is to prohibit underscore in recognized host names, but >> this has more breakage potential IMO. >> >> -- >> Alex >> > > It seems reasonable to me. I like more first option. But I am not > confident that we should allow 123_yyy.zzz to be recognized as a Host. > > By the way, in this question http://webmasters.stackexchange.com/a/775 > you can see examples of domain names with numbers (but not subdomains). > > If there are not objections from others, I will send a new patch today > later or tomorrow with 123_yyy.zzz recognizing. > -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
pgsql-hackers by date: