diff options
author | Joey Adams | 2011-03-18 15:53:16 +0000 |
---|---|---|
committer | Joey Adams | 2011-03-18 15:53:16 +0000 |
commit | 59b46ebff2133b363ae3b174714726cf1c8640a4 (patch) | |
tree | c71a46597ca521e8e0db6d84a8f7eebee540252e | |
parent | 5228d8477bbcb74bba287b94ac043a04a7c9b77b (diff) |
roadmap.markdown: Summarized Unicode handling, and made minor corrections.
-rw-r--r-- | roadmap.markdown | 12 |
1 files changed, 10 insertions, 2 deletions
diff --git a/roadmap.markdown b/roadmap.markdown index a416533..aea302a 100644 --- a/roadmap.markdown +++ b/roadmap.markdown @@ -15,7 +15,14 @@ The core traits of this particular implementation is/will be as follows: * Binary send/recv will not be implemented. There is no standard binary representation for JSON at this time (BSON is *not* one-to-one with JSON—Binary JSON is a misnomer). * The text representation will be optimized, not preserved verbatim (as it currently is). For example, `[ "\u006A" ]` will become `["j"]`. I believe that users generally care a lot more about the content than the formatting of JSON, and that any users interested in preserving the formatting can just use TEXT. -Unicode will be handled as follows: +In a nutshell, character set handling follows two principles: + + * Escapes are converted to characters to save space when it is possible and efficient to do so. + * Characters are escaped as necessary to prevent encoding conversion errors. + +The JSON datatype behaves ideally with respect to encodings when both the client and server encodings are UTF-8. When the client encoding is not UTF-8, SQL_ASCII, nor the same as the server encoding, a performance penalty is incurred to prevent encoding conversion errors. + +More specifically: * On input: @@ -28,6 +35,7 @@ Unicode will be handled as follows: - The client encoding is UTF-8. Escaping is not necessary because the client can encode all Unicode codepoints. - The client encoding and the server encoding are the same. Escaping is not necessary because the client can encode all codepoints the server can encode. - The server encoding is SQL_ASCII. This encoding tells PostgreSQL to shirk transcoding in favor of speed. It wasn't unescaped on input, so don't worry about escaping on output. + - The client encoding is SQL_ASCII. This encoding tells PostgreSQL to not perform encoding conversion. - Otherwise, (no matter how expensive it is) all non-ASCII characters are escaped. * When encoding a string to JSON on the server, as in: @@ -40,4 +48,4 @@ Unicode will be handled as follows: SELECT from_json($$ "\u266b" $$); - It has to be unescaped, of course. If the server encoding lacks the codepoint (including if the server encoding is SQL_ASCII), an error is thrown. As far as I know, PostgreSQL does not provide a fast path for converting individual codepoints to/from non-UTF-8 encodings, so string extraction will be a little slower if the server encoding is not UTF-8 or SQL_ASCII. + It has to be unescaped, of course. If the server encoding lacks the codepoint (including if the server encoding is SQL_ASCII), an error is thrown. As far as I know, PostgreSQL does not provide a fast path for converting individual codepoints to/from non-UTF-8 encodings, so string extraction will be slower if the server encoding is not UTF-8 or SQL_ASCII. |