summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJoey Adams2011-03-18 15:53:16 +0000
committerJoey Adams2011-03-18 15:53:16 +0000
commit59b46ebff2133b363ae3b174714726cf1c8640a4 (patch)
treec71a46597ca521e8e0db6d84a8f7eebee540252e
parent5228d8477bbcb74bba287b94ac043a04a7c9b77b (diff)
roadmap.markdown: Summarized Unicode handling, and made minor corrections.
-rw-r--r--roadmap.markdown12
1 files changed, 10 insertions, 2 deletions
diff --git a/roadmap.markdown b/roadmap.markdown
index a416533..aea302a 100644
--- a/roadmap.markdown
+++ b/roadmap.markdown
@@ -15,7 +15,14 @@ The core traits of this particular implementation is/will be as follows:
* Binary send/recv will not be implemented. There is no standard binary representation for JSON at this time (BSON is *not* one-to-one with JSON—Binary JSON is a misnomer).
* The text representation will be optimized, not preserved verbatim (as it currently is). For example, `[ "\u006A" ]` will become `["j"]`. I believe that users generally care a lot more about the content than the formatting of JSON, and that any users interested in preserving the formatting can just use TEXT.
-Unicode will be handled as follows:
+In a nutshell, character set handling follows two principles:
+
+ * Escapes are converted to characters to save space when it is possible and efficient to do so.
+ * Characters are escaped as necessary to prevent encoding conversion errors.
+
+The JSON datatype behaves ideally with respect to encodings when both the client and server encodings are UTF-8. When the client encoding is not UTF-8, SQL_ASCII, nor the same as the server encoding, a performance penalty is incurred to prevent encoding conversion errors.
+
+More specifically:
* On input:
@@ -28,6 +35,7 @@ Unicode will be handled as follows:
- The client encoding is UTF-8. Escaping is not necessary because the client can encode all Unicode codepoints.
- The client encoding and the server encoding are the same. Escaping is not necessary because the client can encode all codepoints the server can encode.
- The server encoding is SQL_ASCII. This encoding tells PostgreSQL to shirk transcoding in favor of speed. It wasn't unescaped on input, so don't worry about escaping on output.
+ - The client encoding is SQL_ASCII. This encoding tells PostgreSQL to not perform encoding conversion.
- Otherwise, (no matter how expensive it is) all non-ASCII characters are escaped.
* When encoding a string to JSON on the server, as in:
@@ -40,4 +48,4 @@ Unicode will be handled as follows:
SELECT from_json($$ "\u266b" $$);
- It has to be unescaped, of course. If the server encoding lacks the codepoint (including if the server encoding is SQL_ASCII), an error is thrown. As far as I know, PostgreSQL does not provide a fast path for converting individual codepoints to/from non-UTF-8 encodings, so string extraction will be a little slower if the server encoding is not UTF-8 or SQL_ASCII.
+ It has to be unescaped, of course. If the server encoding lacks the codepoint (including if the server encoding is SQL_ASCII), an error is thrown. As far as I know, PostgreSQL does not provide a fast path for converting individual codepoints to/from non-UTF-8 encodings, so string extraction will be slower if the server encoding is not UTF-8 or SQL_ASCII.