diff options
author | Joey Adams | 2011-03-18 02:42:13 +0000 |
---|---|---|
committer | Joey Adams | 2011-03-18 02:42:13 +0000 |
commit | 5228d8477bbcb74bba287b94ac043a04a7c9b77b (patch) | |
tree | 25ecc1bbb21edab4564e9c315f832f46c762b4c6 | |
parent | e6ba3b7e0c044aa088e204b0cb89b86ae00d5989 (diff) |
roadmap.markdown: Set four primary goals, and improved unicode handling discussion.
-rw-r--r-- | roadmap.markdown | 35 |
1 files changed, 29 insertions, 6 deletions
diff --git a/roadmap.markdown b/roadmap.markdown index d3c7b54..a416533 100644 --- a/roadmap.markdown +++ b/roadmap.markdown @@ -1,5 +1,12 @@ Preface: This document merely contains some decisions about minor and not-so-minor details about the JSON datatype. It establishes the direction I would like to pursue with this implementation, but should not be interpreted as inflexible. +I believe the things that make a JSON datatype actually useful are: + + * Ability to encode/decode values + * Consistency through JSON validation + * Ease of looking at JSON (json_stringify) + * Compression advantage (even if it's just space removal) + The core traits of this particular implementation is/will be as follows: * JSON is a TEXT-like datatype. Value wrapping and extraction require explicit use of the functions `from_json` and `to_json`. @@ -10,11 +17,27 @@ The core traits of this particular implementation is/will be as follows: Unicode will be handled as follows: - * On input, if and only if the server encoding is UTF-8, Unicode escapes above the ASCII range will be unescaped. For example, `"\u266b"` will be condensed to `"♫"` if the server encoding is UTF-8, `"\u266b"` if it is not. - * On output, if the client encoding is neither UTF-8 nor equivalent to the server encoding, and if the server encoding is not SQL_ASCII, then all non-ASCII characters will be escaped, no matter how expensive it is. - * If the server encoding is SQL_ASCII, Unicode escapes above ASCII will never be created nor unescaped. - * When extracting a string, and a non-ASCII escape(s) occurs, as in: + * On input: + + - All ASCII escapes are unescaped, except for `"`, `\`, and control characters. + - If and only if the server encoding is UTF-8, escapes above the ASCII range are unescaped. For example, `"\u266b"` is condensed to `"♫"`. + - For other server encodings, no escapes above the ASCII range are unescaped. Finding out which codepoints are escapable and which aren't would be really expensive, as PostgreSQL doesn't seem to have a fast path for that. Also, note that client characters not encodable on the server will cause an error during transcoding, meaning the datatype doesn't have to worry about them. + + * On output: + - If any of the following hold, no escaping is done: + - The client encoding is UTF-8. Escaping is not necessary because the client can encode all Unicode codepoints. + - The client encoding and the server encoding are the same. Escaping is not necessary because the client can encode all codepoints the server can encode. + - The server encoding is SQL_ASCII. This encoding tells PostgreSQL to shirk transcoding in favor of speed. It wasn't unescaped on input, so don't worry about escaping on output. + - Otherwise, (no matter how expensive it is) all non-ASCII characters are escaped. + + * When encoding a string to JSON on the server, as in: + + SELECT to_json(E'\u266b'); + + Only `"`, `\`, and control characters are escaped. + + * When decoding a JSON string on the server, and a non-ASCII escape occurs, as in: - select from_json($$ "\u266b" $$); + SELECT from_json($$ "\u266b" $$); - It has to be unescaped, of course. If the server encoding lacks the codepoint (including if the server encoding is SQL_ASCII), an error will be thrown. As far as I know, PostgreSQL does not provide a fast path for converting individual codepoints to/from non-UTF-8 encodings, so string extraction will be a little slower if the server encoding is not UTF-8. + It has to be unescaped, of course. If the server encoding lacks the codepoint (including if the server encoding is SQL_ASCII), an error is thrown. As far as I know, PostgreSQL does not provide a fast path for converting individual codepoints to/from non-UTF-8 encodings, so string extraction will be a little slower if the server encoding is not UTF-8 or SQL_ASCII. |