Skip to content

Commit a652558

Browse files
committed
Allow Unicode escapes in any server encoding, not only UTF-8.
SQL includes provisions for numeric Unicode escapes in string literals and identifiers. Previously we only accepted those if they represented ASCII characters or the server encoding was UTF-8, making the conversion to internal form trivial. This patch adjusts things so that we'll call the appropriate encoding conversion function in less-trivial cases, allowing the escape sequence to be accepted so long as it corresponds to some character available in the server encoding. This also applies to processing of Unicode escapes in JSONB. However, the old restriction still applies to client-side JSON processing, since that hasn't got access to the server's encoding conversion infrastructure. This patch includes some lexer infrastructure that simplifies throwing errors with error cursors pointing into the middle of a string (or other complex token). For the moment I only used it for errors relating to Unicode escapes, but we might later expand the usage to some other cases. Patch by me, reviewed by John Naylor. Discussion: https://postgr.es/m/[email protected]
1 parent fe30e7e commit a652558

20 files changed

+612
-226
lines changed

doc/src/sgml/json.sgml

+9-10
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,8 @@
6161
</para>
6262

6363
<para>
64-
<productname>PostgreSQL</productname> allows only one character set
65-
encoding per database. It is therefore not possible for the JSON
64+
RFC 7159 specifies that JSON strings should be encoded in UTF8.
65+
It is therefore not possible for the JSON
6666
types to conform rigidly to the JSON specification unless the database
6767
encoding is UTF8. Attempts to directly include characters that
6868
cannot be represented in the database encoding will fail; conversely,
@@ -77,13 +77,13 @@
7777
regardless of the database encoding, and are checked only for syntactic
7878
correctness (that is, that four hex digits follow <literal>\u</literal>).
7979
However, the input function for <type>jsonb</type> is stricter: it disallows
80-
Unicode escapes for non-ASCII characters (those above <literal>U+007F</literal>)
81-
unless the database encoding is UTF8. The <type>jsonb</type> type also
80+
Unicode escapes for characters that cannot be represented in the database
81+
encoding. The <type>jsonb</type> type also
8282
rejects <literal>\u0000</literal> (because that cannot be represented in
8383
<productname>PostgreSQL</productname>'s <type>text</type> type), and it insists
8484
that any use of Unicode surrogate pairs to designate characters outside
8585
the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes
86-
are converted to the equivalent ASCII or UTF8 character for storage;
86+
are converted to the equivalent single character for storage;
8787
this includes folding surrogate pairs into a single character.
8888
</para>
8989

@@ -96,9 +96,8 @@
9696
not <type>jsonb</type>. The fact that the <type>json</type> input function does
9797
not make these checks may be considered a historical artifact, although
9898
it does allow for simple storage (without processing) of JSON Unicode
99-
escapes in a non-UTF8 database encoding. In general, it is best to
100-
avoid mixing Unicode escapes in JSON with a non-UTF8 database encoding,
101-
if possible.
99+
escapes in a database encoding that does not support the represented
100+
characters.
102101
</para>
103102
</note>
104103

@@ -144,8 +143,8 @@
144143
<row>
145144
<entry><type>string</type></entry>
146145
<entry><type>text</type></entry>
147-
<entry><literal>\u0000</literal> is disallowed, as are non-ASCII Unicode
148-
escapes if database encoding is not UTF8</entry>
146+
<entry><literal>\u0000</literal> is disallowed, as are Unicode escapes
147+
representing characters not available in the database encoding</entry>
149148
</row>
150149
<row>
151150
<entry><type>number</type></entry>

doc/src/sgml/syntax.sgml

+46-54
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,23 @@ UPDATE "my_table" SET "a" = 5;
189189
ampersands. The length limitation still applies.
190190
</para>
191191

192+
<para>
193+
Quoting an identifier also makes it case-sensitive, whereas
194+
unquoted names are always folded to lower case. For example, the
195+
identifiers <literal>FOO</literal>, <literal>foo</literal>, and
196+
<literal>"foo"</literal> are considered the same by
197+
<productname>PostgreSQL</productname>, but
198+
<literal>"Foo"</literal> and <literal>"FOO"</literal> are
199+
different from these three and each other. (The folding of
200+
unquoted names to lower case in <productname>PostgreSQL</productname> is
201+
incompatible with the SQL standard, which says that unquoted names
202+
should be folded to upper case. Thus, <literal>foo</literal>
203+
should be equivalent to <literal>"FOO"</literal> not
204+
<literal>"foo"</literal> according to the standard. If you want
205+
to write portable applications you are advised to always quote a
206+
particular name or never quote it.)
207+
</para>
208+
192209
<indexterm>
193210
<primary>Unicode escape</primary>
194211
<secondary>in identifiers</secondary>
@@ -230,7 +247,8 @@ U&amp;"d!0061t!+000061" UESCAPE '!'
230247
The escape character can be any single character other than a
231248
hexadecimal digit, the plus sign, a single quote, a double quote,
232249
or a whitespace character. Note that the escape character is
233-
written in single quotes, not double quotes.
250+
written in single quotes, not double quotes,
251+
after <literal>UESCAPE</literal>.
234252
</para>
235253

236254
<para>
@@ -239,32 +257,18 @@ U&amp;"d!0061t!+000061" UESCAPE '!'
239257
</para>
240258

241259
<para>
242-
The Unicode escape syntax works only when the server encoding is
243-
<literal>UTF8</literal>. When other server encodings are used, only code
244-
points in the ASCII range (up to <literal>\007F</literal>) can be
245-
specified. Both the 4-digit and the 6-digit form can be used to
260+
Either the 4-digit or the 6-digit escape form can be used to
246261
specify UTF-16 surrogate pairs to compose characters with code
247262
points larger than U+FFFF, although the availability of the
248263
6-digit form technically makes this unnecessary. (Surrogate
249-
pairs are not stored directly, but combined into a single
250-
code point that is then encoded in UTF-8.)
264+
pairs are not stored directly, but are combined into a single
265+
code point.)
251266
</para>
252267

253268
<para>
254-
Quoting an identifier also makes it case-sensitive, whereas
255-
unquoted names are always folded to lower case. For example, the
256-
identifiers <literal>FOO</literal>, <literal>foo</literal>, and
257-
<literal>"foo"</literal> are considered the same by
258-
<productname>PostgreSQL</productname>, but
259-
<literal>"Foo"</literal> and <literal>"FOO"</literal> are
260-
different from these three and each other. (The folding of
261-
unquoted names to lower case in <productname>PostgreSQL</productname> is
262-
incompatible with the SQL standard, which says that unquoted names
263-
should be folded to upper case. Thus, <literal>foo</literal>
264-
should be equivalent to <literal>"FOO"</literal> not
265-
<literal>"foo"</literal> according to the standard. If you want
266-
to write portable applications you are advised to always quote a
267-
particular name or never quote it.)
269+
If the server encoding is not UTF-8, the Unicode code point identified
270+
by one of these escape sequences is converted to the actual server
271+
encoding; an error is reported if that's not possible.
268272
</para>
269273
</sect2>
270274

@@ -427,25 +431,11 @@ SELECT 'foo' 'bar';
427431
<para>
428432
It is your responsibility that the byte sequences you create,
429433
especially when using the octal or hexadecimal escapes, compose
430-
valid characters in the server character set encoding. When the
431-
server encoding is UTF-8, then the Unicode escapes or the
434+
valid characters in the server character set encoding.
435+
A useful alternative is to use Unicode escapes or the
432436
alternative Unicode escape syntax, explained
433-
in <xref linkend="sql-syntax-strings-uescape"/>, should be used
434-
instead. (The alternative would be doing the UTF-8 encoding by
435-
hand and writing out the bytes, which would be very cumbersome.)
436-
</para>
437-
438-
<para>
439-
The Unicode escape syntax works fully only when the server
440-
encoding is <literal>UTF8</literal>. When other server encodings are
441-
used, only code points in the ASCII range (up
442-
to <literal>\u007F</literal>) can be specified. Both the 4-digit and
443-
the 8-digit form can be used to specify UTF-16 surrogate pairs to
444-
compose characters with code points larger than U+FFFF, although
445-
the availability of the 8-digit form technically makes this
446-
unnecessary. (When surrogate pairs are used when the server
447-
encoding is <literal>UTF8</literal>, they are first combined into a
448-
single code point that is then encoded in UTF-8.)
437+
in <xref linkend="sql-syntax-strings-uescape"/>; then the server
438+
will check that the character conversion is possible.
449439
</para>
450440

451441
<caution>
@@ -524,16 +514,23 @@ U&amp;'d!0061t!+000061' UESCAPE '!'
524514
</para>
525515

526516
<para>
527-
The Unicode escape syntax works only when the server encoding is
528-
<literal>UTF8</literal>. When other server encodings are used, only
529-
code points in the ASCII range (up to <literal>\007F</literal>)
530-
can be specified. Both the 4-digit and the 6-digit form can be
531-
used to specify UTF-16 surrogate pairs to compose characters with
532-
code points larger than U+FFFF, although the availability of the
533-
6-digit form technically makes this unnecessary. (When surrogate
534-
pairs are used when the server encoding is <literal>UTF8</literal>, they
535-
are first combined into a single code point that is then encoded
536-
in UTF-8.)
517+
To include the escape character in the string literally, write
518+
it twice.
519+
</para>
520+
521+
<para>
522+
Either the 4-digit or the 6-digit escape form can be used to
523+
specify UTF-16 surrogate pairs to compose characters with code
524+
points larger than U+FFFF, although the availability of the
525+
6-digit form technically makes this unnecessary. (Surrogate
526+
pairs are not stored directly, but are combined into a single
527+
code point.)
528+
</para>
529+
530+
<para>
531+
If the server encoding is not UTF-8, the Unicode code point identified
532+
by one of these escape sequences is converted to the actual server
533+
encoding; an error is reported if that's not possible.
537534
</para>
538535

539536
<para>
@@ -546,11 +543,6 @@ U&amp;'d!0061t!+000061' UESCAPE '!'
546543
parameter is set to off, this syntax will be rejected with an
547544
error message.
548545
</para>
549-
550-
<para>
551-
To include the escape character in the string literally, write it
552-
twice.
553-
</para>
554546
</sect3>
555547

556548
<sect3 id="sql-syntax-dollar-quoting">

src/backend/parser/parser.c

+40-34
Original file line numberDiff line numberDiff line change
@@ -292,22 +292,14 @@ hexval(unsigned char c)
292292
return 0; /* not reached */
293293
}
294294

295-
/* is Unicode code point acceptable in database's encoding? */
295+
/* is Unicode code point acceptable? */
296296
static void
297-
check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner)
297+
check_unicode_value(pg_wchar c)
298298
{
299-
/* See also addunicode() in scan.l */
300-
if (c == 0 || c > 0x10FFFF)
299+
if (!is_valid_unicode_codepoint(c))
301300
ereport(ERROR,
302301
(errcode(ERRCODE_SYNTAX_ERROR),
303-
errmsg("invalid Unicode escape value"),
304-
scanner_errposition(pos, yyscanner)));
305-
306-
if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8)
307-
ereport(ERROR,
308-
(errcode(ERRCODE_SYNTAX_ERROR),
309-
errmsg("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"),
310-
scanner_errposition(pos, yyscanner)));
302+
errmsg("invalid Unicode escape value")));
311303
}
312304

313305
/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
@@ -338,20 +330,39 @@ str_udeescape(const char *str, char escape,
338330
const char *in;
339331
char *new,
340332
*out;
333+
size_t new_len;
341334
pg_wchar pair_first = 0;
335+
ScannerCallbackState scbstate;
342336

343337
/*
344-
* This relies on the subtle assumption that a UTF-8 expansion cannot be
345-
* longer than its escaped representation.
338+
* Guesstimate that result will be no longer than input, but allow enough
339+
* padding for Unicode conversion.
346340
*/
347-
new = palloc(strlen(str) + 1);
341+
new_len = strlen(str) + MAX_UNICODE_EQUIVALENT_STRING + 1;
342+
new = palloc(new_len);
348343

349344
in = str;
350345
out = new;
351346
while (*in)
352347
{
348+
/* Enlarge string if needed */
349+
size_t out_dist = out - new;
350+
351+
if (out_dist > new_len - (MAX_UNICODE_EQUIVALENT_STRING + 1))
352+
{
353+
new_len *= 2;
354+
new = repalloc(new, new_len);
355+
out = new + out_dist;
356+
}
357+
353358
if (in[0] == escape)
354359
{
360+
/*
361+
* Any errors reported while processing this escape sequence will
362+
* have an error cursor pointing at the escape.
363+
*/
364+
setup_scanner_errposition_callback(&scbstate, yyscanner,
365+
in - str + position + 3); /* 3 for U&" */
355366
if (in[1] == escape)
356367
{
357368
if (pair_first)
@@ -370,9 +381,7 @@ str_udeescape(const char *str, char escape,
370381
(hexval(in[2]) << 8) +
371382
(hexval(in[3]) << 4) +
372383
hexval(in[4]);
373-
check_unicode_value(unicode,
374-
in - str + position + 3, /* 3 for U&" */
375-
yyscanner);
384+
check_unicode_value(unicode);
376385
if (pair_first)
377386
{
378387
if (is_utf16_surrogate_second(unicode))
@@ -390,8 +399,8 @@ str_udeescape(const char *str, char escape,
390399
pair_first = unicode;
391400
else
392401
{
393-
unicode_to_utf8(unicode, (unsigned char *) out);
394-
out += pg_mblen(out);
402+
pg_unicode_to_server(unicode, (unsigned char *) out);
403+
out += strlen(out);
395404
}
396405
in += 5;
397406
}
@@ -411,9 +420,7 @@ str_udeescape(const char *str, char escape,
411420
(hexval(in[5]) << 8) +
412421
(hexval(in[6]) << 4) +
413422
hexval(in[7]);
414-
check_unicode_value(unicode,
415-
in - str + position + 3, /* 3 for U&" */
416-
yyscanner);
423+
check_unicode_value(unicode);
417424
if (pair_first)
418425
{
419426
if (is_utf16_surrogate_second(unicode))
@@ -431,17 +438,18 @@ str_udeescape(const char *str, char escape,
431438
pair_first = unicode;
432439
else
433440
{
434-
unicode_to_utf8(unicode, (unsigned char *) out);
435-
out += pg_mblen(out);
441+
pg_unicode_to_server(unicode, (unsigned char *) out);
442+
out += strlen(out);
436443
}
437444
in += 8;
438445
}
439446
else
440447
ereport(ERROR,
441448
(errcode(ERRCODE_SYNTAX_ERROR),
442-
errmsg("invalid Unicode escape value"),
443-
scanner_errposition(in - str + position + 3, /* 3 for U&" */
444-
yyscanner)));
449+
errmsg("invalid Unicode escape"),
450+
errhint("Unicode escapes must be \\XXXX or \\+XXXXXX.")));
451+
452+
cancel_scanner_errposition_callback(&scbstate);
445453
}
446454
else
447455
{
@@ -457,15 +465,13 @@ str_udeescape(const char *str, char escape,
457465
goto invalid_pair;
458466

459467
*out = '\0';
468+
return new;
460469

461470
/*
462-
* We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
463-
* codes; but it's probably not worth the trouble, since this isn't likely
464-
* to be a performance-critical path.
471+
* We might get here with the error callback active, or not. Call
472+
* scanner_errposition to make sure an error cursor appears; if the
473+
* callback is active, this is duplicative but harmless.
465474
*/
466-
pg_verifymbstr(new, out - new, false);
467-
return new;
468-
469475
invalid_pair:
470476
ereport(ERROR,
471477
(errcode(ERRCODE_SYNTAX_ERROR),

0 commit comments

Comments
 (0)