1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
|
Preface: This document merely contains some decisions about minor and not-so-minor details about the JSON datatype. It establishes the direction I would like to pursue with this implementation, but should not be interpreted as inflexible.
I believe the things that make a JSON datatype actually useful are:
* Ability to encode/decode values
* Consistency through JSON validation
* Ease of looking at JSON (json_stringify)
* Compression advantage (even if it's just space removal)
The core traits of this particular implementation is/will be as follows:
* JSON is a TEXT-like datatype. Value wrapping and extraction require explicit use of the functions `from_json` and `to_json`.
* The JSON datatype allows top-level scalar values (number, string, true, false, null). `"hello"` is technically not a JSON document according to RFC 4627. However, allowing scalar toplevels tends to be more useful than not allowing them (e.g. `select json_path('[1,2,3]', '$[*]');` ).
* The datatype's on-disk format is JSON-formatted text, in the server encoding. Although a binary encoding could theoretically be more size-efficient, I believe an optimized text-based representation is pretty darn size-efficient too. It's also easier to implement :-)
* Binary send/recv will not be implemented. There is no standard binary representation for JSON at this time (BSON is *not* one-to-one with JSON—Binary JSON is a misnomer).
* The text representation will be optimized, not preserved verbatim (as it currently is). For example, `[ "\u006A" ]` will become `["j"]`. I believe that users generally care a lot more about the content than the formatting of JSON, and that any users interested in preserving the formatting can just use TEXT.
In a nutshell, character set handling follows two principles:
* Escapes are converted to characters to save space when it is possible and efficient to do so.
* Characters are escaped as necessary to prevent encoding conversion errors.
The JSON datatype behaves ideally with respect to encodings when both the client and server encodings are UTF-8. When the client encoding is not UTF-8, SQL_ASCII, nor the same as the server encoding, a performance penalty is incurred to prevent encoding conversion errors.
More specifically:
* On input:
- All ASCII escapes are unescaped, except for `"`, `\`, and control characters.
- If and only if the server encoding is UTF-8, escapes above the ASCII range are unescaped. For example, `"\u266b"` is condensed to `"♫"`.
- For other server encodings, no escapes above the ASCII range are unescaped. Finding out which codepoints are escapable and which aren't would be really expensive, as PostgreSQL doesn't seem to have a fast path for that. Also, note that client characters not encodable on the server will cause an error during transcoding, meaning the datatype doesn't have to worry about them.
* On output:
- If any of the following hold, no escaping is done:
- The client encoding is UTF-8. Escaping is not necessary because the client can encode all Unicode codepoints.
- The client encoding and the server encoding are the same. Escaping is not necessary because the client can encode all codepoints the server can encode.
- The server encoding is SQL_ASCII. This encoding tells PostgreSQL to shirk transcoding in favor of speed. It wasn't unescaped on input, so don't worry about escaping on output.
- The client encoding is SQL_ASCII. This encoding tells PostgreSQL to not perform encoding conversion.
- Otherwise, (no matter how expensive it is) all non-ASCII characters are escaped.
* When encoding a string to JSON on the server, as in:
SELECT to_json(E'\u266b');
Only `"`, `\`, and control characters are escaped.
* When decoding a JSON string on the server, and a non-ASCII escape occurs, as in:
SELECT from_json($$ "\u266b" $$);
It has to be unescaped, of course. If the server encoding lacks the codepoint (including if the server encoding is SQL_ASCII), an error is thrown. As far as I know, PostgreSQL does not provide a fast path for converting individual codepoints to/from non-UTF-8 encodings, so string extraction will be slower if the server encoding is not UTF-8 or SQL_ASCII.
|