W3C Logo

XML Japanese Profile

W3C Note 14 April 2000

This version:
http://www.w3.org/TR/2000/NOTE-japanese-xml-20000414
Latest version:
http://www.w3.org/TR/japanese-xml
Editor:
MURATA Makoto (Fuji Xerox Information Systems Co.,Ltd) <>
INSTAC XML S-WG:
MURATA Makoto (Fuji Xerox Information Systems)
KOMACHI Yushi (Panasonic)
KAWAMATA Akira (Piedey)
HIYAMA Masayuki (Hiyama Office)
UCHIYAMA Mitsukazu (Toshiba)
KAMIMURA Keisuke (GLOCOM)
OKUI Yasuhiro (Unitec Corporation)
IMAGO Satosi (RICOH)
Contributors:
HANADA Takako [Translator]
Rick JELLIFFE (Academia Sinica)
François YERGEAU (Alis Technologies)

Abstract

XML Japanese Profile addresses the issues of using Japanese characters in XML documents. In particular, ambiguities in converting existing Japanese charsets to Unicode are clearly pointed out.

Status of this document

This document is a submission to the World Wide Web Consortium from Xerox, Panasonic, Toshiba, GLOCOM, Academia Sinica, Alis Technologies, Sun Microsystems (see Submission Request , W3C Staff Comment ).

This document is a NOTE made available by W3C for discussion only. Publication of this Note by W3C indicates no endorsement by W3C, the W3C Team, or any W3C Members. W3C has had no editorial control over the preparation of this Note. The acknowledgment of a Submission request does not imply that any action will be taken by W3C. It merely records publicly that the Submission request has been made by the submitting Member. This document may not be referred to as "work in progress" of the W3C. No W3C resources were, are, or will be allocated to the issues addressed by the NOTE.

A list of current W3C technical documents can be found at the Technical Reports page.

Comments on this document should be sent to muraw3c@attglobal.net .

XML Japanese Profile [JIS TR X 0015] was originally published by Japanese Standards Association (JSA) in the Japanese language. It is not a standard but rather a technical report, which is intended to encourage public discussion, promote consensus among relevant parties, and eventually become a Japanese Industrial Standard (JIS), if appropriate. [JIS TR X 0015] was developed by the XML special working group (XML SWG) of Information Technology Research and Standardization Center (INSTAC), JSA .

This specification was created by first translating [JIS TR X 0015] and then revising it on the basis of comments from some I18N experts. The original specification, [JIS TR X 0015], will be accordingly revised and republished by JSA in the near future. The XML SWG intends to keep this document and [JIS TR X 0015] in sync.

Table of Contents

1. Scope
2. Normative References
3. Definitions
3.1 Japanese Characters
3.2 Coded Character Sets
3.3 Character Encoding Schemes
3.4 Charsets
3.5 XML Constituents
4. Coded Character Set
4.1 JIS X 0208:1978
4.2 Compatibility Characters
5. Character Encoding Schemes
5.1 UTF-16
5.2 UTF-8
5.3 Shift-JIS
5.4 Japanese EUC (Compressed)
5.5 ISO-2022-JP
6. Charset Names
6.1 Charsets for an XML document containing Japanese characters
6.2 Code Conversion during Transmission
6.3 Storing transmitted XML constituents to files for information interchange
7. XML Constituents in Files for Information Interchange
8. Delivering XML Constituents by HTTP 1.1
9. Delivering XML Constituents via EMAIL or NEWS
10. Avoiding Conversion Ambiguities
11. The xml:lang Attribute

Appendices

A Name Characters (Non-Normative)
B Needs for Japanese XML profile (Non-Normative)
C Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative)
D Conversion tables for Shift-JIS and Japanese EUC (Non-Normative)
E Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)
F Charset parameters in HTTP1.1 (Non-Normative)
G Non-normative references (Non-Normative)

1. Scope

This technical report addresses the issue of Japanese characters in XML documents thereby complementing [XML].

NOTE: In this technical report, the phrase "Japanese characters" is used to refer to the graphic characters specified in [JIS X 0208] and [JIS X 0212] as well as Halfwidth Katakana, the yen sign, and the overline specified in [JIS X 0201].

For high interoperability of XML documents containing Japanese characters, this technical report studies character encoding schemes (CESs) for Japanese characters. CESs for files for information interchange and CESs for each protocol (SMTP, HTTP/1.1 and others) are specified separately. Furthermore, this technical report lists conversion tables for converting these CESs to [JIS X 0221]/[Unicode 3.0], and provides names for use in encoding declarations and MIME headers.

NOTE: Refer to [IETF RFC 2130] for the source definition of the terms "Coded Character Set" and "Character Encoding Scheme" in this technical report.
NOTE: This technical report may be ignored if full interoperability for Japanese XML documents is not required.

Most parts of this technical report are mere adaptations of IETF RFCs and W3C recommendations for Japanese characters.

NOTE: For example, [XML] discourages the use of compatibility characters, defined in clause 6.8 of [Unicode 3.0]. Accordingly, this technical report discourages the use of Halfwidth Katakana characters defined in [JIS X 0201], which are compatibility characters.

When some part of this technical report is adopted from [XML] and other IETF RFCs or W3C recommendations, the source reference is indicated in a subsequent note or appendice.

2. Normative References

Any of the following JISs refers to its latest version. If an RFC or I-D is superseded by another RFC, the new RFC is referenced.

Character Mapping Tables
The Unicode Consortium. Character Mapping Tables 1.0, Proposed Draft Unicode Technical Report 22.http://www.unicode.org/unicode/reports/tr22/tr22-1 , 1999.
IANA
IANA (Internet Assigned Numbers Authority). Official Names for Character Sets, ed. Keld Simonsen et al. ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
IETF RFC 1468
IETF (Internet Engineering Task Force). RFC 1468: Japanese character encoding for Internet messages, ed. J. Murai, M. Crispin, and E. van der Poel. 1993.
IETF RFC 1766
IETF (Internet Engineering Task Force). RFC 1766: Tags for the Identification of Languages, ed. H. Alvestrand. 1995.
IETF RFC 2045
IETF (Internet Engineering Task Force). RFC 2045: Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies, ed. N. Freed and N. Borenstein. 1996.
IETF RFC 2046
IETF (Internet Engineering Task Force). RFC 2046: Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types, ed. N. Freed and N. Borenstein. 1996.
IETF RFC 2130
IETF (Internet Engineering Task Force). RFC 2130: The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996, C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin, P. Svanberg. 1997.
IETF RFC 2277
IETF (Internet Engineering Task Force). RFC 2277: IETF policy on character sets and languages, ed. H. Alvestrand. 1998.
IETF RFC 2278
IETF (Internet Engineering Task Force). RFC 2278: IANA Charset Registration Procedures, ed. N. Freed and J. Postel. 1998.
IETF RFC 2279
IETF (Internet Engineering Task Force). RFC 2279: UTF-8, a transformation format of ISO 10646, ed. F. Yergeau. 1998.
IETF RFC 2376
IETF (Internet Engineering Task Force). RFC 2376: XML media types, ed. E. Whitehead and M. Murata. 1998.
IETF RFC 2396
IETF (Internet Engineering Task Force). RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax, ed. T. Berners-Lee, R. Fielding, and L. Masinter. 1998.
IETF RFC 2616
IETF (Internet Engineering Task Force). RFC2616: Hypertext Transfer Protocol --- HTTP/1.1, ed. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. Berners-Lee. 1999.
IETF RFC 2781
IETF (Internet Engineering Task Force). RFC 2781: UTF-16, an encoding of ISO 10646, ed. P. Hoffman. and F. Yergeau. 2000.
ISO/IEC 646
International Organization for Standardization. Information technology ? ISO 7-bit coded character set for information interchange, ISO/IEC 646:1991, International Organization for Standardization, 1991.
ISO/IEC10646
International Organization for Standardization. Information technology ? Universal Multiple-Octet Coded Character Set (UCS) ? Part 1: Architecture and Basic Multilingual Plane, ISO/IEC 10646:1993 (E), International Organization for Standardization, 1993.
JIS TR X 0015
Japanese Industrial Standards Committee. XML Japanese Profile, JIS TR 0015:1999, Japanese Standards Association, May 1999.
JIS X 0201
Japanese Industrial Standards Committee. 7-bit and 8-bit coded character sets for information interchange, JIS X 0201:1997, Japanese Standards Association, 1997.
JIS X 0208
Japanese Industrial Standards Committee. 7-bit and 8-bit double byte coded KANJI sets for information interchange, JIS X 0208:1997, Japanese Standards Association, 1997.
JIS X 0212
Japanese Industrial Standards Committee. Code of the supplementary Japanese graphic character set for information interchange, JIS X0212:1990, Japanese Standards Association, 1990.
JIS X 0221
Japanese Industrial Standards Committee. Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane, JIS X0221:1995, Japanese Standards Association, 1995.
Unicode 3.0
The Unicode Consortium. The Unicode Standard, Version 3.0. Reading, MA, Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5.
US-ASCII
American National Standards Institute, Coded character set ? 7-bit American National Standard Code for Information Interchange, ANSI X3.4-1986.
XML
W3C (Worlde Wide Web Consortium). XML (Extensible Markup Language) Recommendation, http://www.w3.org/TR/REC-xml , 1998.
NOTE: [JIS X 0221] corresponds to [ISO/IEC10646].

3. Definitions

3.1 Japanese Characters

[Definition: ] Japanese characters mean graphic characters defined by [JIS X 0208] or [JIS X 0212] as well as Katakana, the yen sign, and the overline defined by [JIS X 0201].

3.2 Coded Character Sets

[Definition: ] A Coded Character Set (CCS) is a mapping from a set of characters to a set of integers.

NOTE: The term " Coded Character Set" is adopted from [IETF RFC 2277] and [IETF RFC 2130].
NOTE: Note that [JIS X 0208] defines Coded Character Sets as correspondences between characters and bit combination, which differs from this definition.

3.3 Character Encoding Schemes

[Definition: ] A Character Encoding Scheme (CES) is a mapping from a CCS or several CCSs to a set of octets.

NOTE: A Character Encoding Scheme is identical to that described in [IETF RFC 2277] and [IETF RFC 2130].
NOTE: [XML] uses the terms "Encoding" and "Encoding Method" identically to CESs of [IETF RFC 2277] and [IETF RFC 2130].

3.4 Charsets

[Definition: ] A charset is a set of rules for mapping from a sequence of octets to a sequence of characters.

NOTE: This definition is identical to those in [IETF RFC 2130], [IETF RFC 2277], and [IETF RFC 2278].
NOTE: A list of the charsets registered for Internet messages is available at [IANA].

3.5 XML Constituents

[Definition: ] In this technical report, the phrase XML constituents is used to mean document entities, external parsed entities, external DTD subsets, and external parameter entities.

NOTE: In the XML terminology, the word "entity" refers to document entities, internal parsed entities, external parsed entities, external unparsed entities, external DTD subsets, internal parameter entities, and external parameter entities. Among them, internal parsed entities, internal parameter entities, and external unparsed entities need not specify CESs.
NOTE: Each XML constituent is usually stored in a file or distributed as a MIME entity via some protocol.

4. Coded Character Set

4.1 JIS X 0208:1978

The use of [JIS X 0208]:1978 (the first version of JIS X 0208) is an error; results are undefined.

NOTE: [JIS X 0208]:1978 is not a source standard of [Unicode 3.0], which [XML] employs as the coded character set, and therefore, the relationship between these [JIS X 0208] and [Unicode 3.0] is not clear.

4.2 Compatibility Characters

Those digits, Latin characters, and special characters of [JIS X 0208] which are also specified by [JIS X 0201] are deprecated. Likewise, Halfwidth Katakana of [JIS X 0201] are deprecated.

NOTE: [XML] discourages the use of compatibility characters described in [Unicode 3.0].

5. Character Encoding Schemes

5.1 UTF-16

This technical report recommends the use of UTF-16 (or UTF-8, as written in UTF-8).

NOTE: UTF-16 is a CES that represents all characters contained in the first 17 planes specified by [Unicode 3.0] or [ISO/IEC10646]. UTF-16 is specified by Amendment 1 of ISO/IEC 10646 and by [Unicode 3.0]. Also refer to [IETF RFC 2781] for an explanation of UTF-16.
NOTE: [XML] requires any XML processor to read entities in UTF-16 (or UTF-8, as described in UTF-8).

The charset name for UTF-16 is "utf-16".

NOTE: In [IETF RFC 2781], "utf-16be" and "utf-16le" represents Big Endian UTF-16 and Little Endian UTF-16, respectively.

5.2 UTF-8

This technical report recommends the use of UTF-8 (or UTF-16, as UTF-16 describes).

NOTE: UTF-8 is a CES which covers all of the characters specified by [ISO/IEC10646] and [Unicode 3.0]. UTF-8 is specified by [Unicode 3.0] and Amendment 2 of [ISO/IEC10646]. See also [IETF RFC 2279] for another explanation of UTF-8.
NOTE: [XML] requires any XML processor to read entities in UTF-8.

The charset name for UTF-8 is utf-8.

NOTE: Charset name "utf-8" is registered at [IANA].

5.3 Shift-JIS

This technical report and [XML] treat Shift-JIS, an ordinary Japanese charset in Japan, as a CES that represents Japanese characters and [US-ASCII] characters in [ISO/IEC10646] or [Unicode 3.0]. For full interoperability in the Internet, migration from Shift-JIS to UTF-8/UTF-16 is highly recommended.

NOTE: Note that Japanese characters here include Halfwidth Katakana, the yen sign and the overline described by [JIS X 0201].

There are four major conversion tables from Shift-JIS to [ISO/IEC10646] or [Unicode 3.0]. This technical report names them x-sjis-unicode-0.9, x-sjis-jisx0221-1995, x-sjis-cp932, and x-sjis-jdk1.1.7, respectively. These conversion tables are not identical to each other.

NOTE: Other conversion tables are also in use.

X-sjis-unicode-0.9 is published by Unicode Consortium as the conversion table for Shift-JIS (version 0.9). X-sjis-jisx0221-1995 is a conversion table derived from the conversion table in [JIS X 0221] Appendix 3 for the shift encoding which is specified in [JIS X 0208] Appendix 1. X-sjis-cp932 is published by Unicode Consortium as the conversion table for Microsoft CP932. X-sjis-jdk1.1.7 is the conversion table used for the encoding named SJIS in JDK 1.1.7.

NOTE: Conversion tables from Unicode Consortium are available at ftp://ftp.unicode.org/Public/MAPPINGS/ .
NOTE: Use of Shift-JIS cannot provide interoperability in information interchange, since any of the above-mentioned conversion tables or some other conversion tables might be used.
NOTE: It is generally assumed that Shift-JIS uses [JIS X 0201] rather than [US-ASCII]. This assumption applies to all of these conversion tables except for x-sjis-jdk1.1.7. [IETF RFC 2046] deprecates the use of any national or application-oriented version of [ISO/IEC 646] in Internet mail, except when it is completely identical to US-ASCII.
NOTE: Although the definition of Windows Standard Character Set is based on [JIS X 0201], the conversion table maps 0x5C to U+005C (REVERSE SOLIDUS).
NOTE: Ambiguities in conversion from Shift-JIS to Unicode provides further information on the yen sign.

5.4 Japanese EUC (Compressed)

This technical report and [XML] treat Japanese EUC (Compressed) [UNIX International], an ordinary Japanese charset in Japan, as a CES that represents Japanese characters and [US-ASCII] characters in [ISO/IEC10646] or [Unicode 3.0]. For full interoperability in the Internet, migration from Japanese EUC (Compressed) to UTF-8/UTF-16 is highly recommended.

There are five major conversion tables from Japanese EUC (Compressed) to [ISO/IEC10646] or [Unicode 3.0]. This technical report names them x-eucjp-unicode-0.9, x-eucjp-jisx0221-1995, x-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201 and x-eucjp-open-19970715-ascii, respectively. These conversion tables are not identical to each other.

X-eucjp-unicode-0.9 is derived from the conversion table published by Unicode Consortium for conversion from [JIS X 0208] to [ISO/IEC10646] or [Unicode 3.0]. X-eucjp-jisx0221-1995 is derived from the one in Appendix 3 of [JIS X 0221]. X-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201 and x-eucjp-open-19970715-ascii are conversion tables defined by OSF Japanese Vendors Council (OSF/JVC) and they are named eucJP-ms, eucJP-0201 and eucJP-ascii by Appendix of [TOG/JVC CDE/Motif TOG/JVC CDE/Motif Technical WG].

NOTE: Use of Japanese EUC cannot provide interoperability in information interchange, since any of these conversion tables or some other conversion table might be used.
NOTE: X-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201 and x-eucjp-open-19970715-ascii contain NEC special characters, NEC-selected IBM extended characters, IBM extension characters, and user-defined characters.

5.5 ISO-2022-JP

This technical report and [XML] assume ISO-2022-JP[IETF RFC 1468], an ordinary Japanese charset in Japan, as a CES that represents Japanese characters and [US-ASCII] characters in [ISO/IEC10646] or [Unicode 3.0]. For full interoperability in the Internet, migration from ISO-2022-JP to UTF-8/UTF-16 is highly recommended.

This technical report defines conversion from ISO-2022-JP to [ISO/IEC10646] or [Unicode 3.0] via Shift-JIS or Japanese EUC; that is, ISO-2022-JP is first converted to Shift-JIS or Japanese EUC and then converted by one of the tables shown in Shift-JIS or Japanese EUC (Compressed). Therefore, for each conversion table from Japanese EUC or Shift-JIS, one conversion table from ISO-2022-JP is constructed.

NOTE: ISO-2022-JP is a charset designed for message transmission such as EMAIL. One can thus safely assume that information in ISO-2022-JP was temporarilly converted from Shift-JIS or Japanese EUC for message transmission. Therefore, it is very reasonable to convert ISO-2022-JP to [ISO/IEC10646] or [Unicode 3.0] via Shift-JIS or Japanese EUC.
NOTE: Note that conversion via Shift-JIS or Japanese EUC provides only the characters allowed by [IETF RFC 1468]. For example, x-iso2022jp-cp932 cannot represent those characters in x-sjis-cp932 not contained by [US-ASCII], [JIS X 0201] or [JIS X 0208].

After omitting identical conversion tables, five conversion tables are obtained. This technical report names them x-iso2022jp-unicode-0.9, x-iso2022jp-jisx0221-1995, x-iso2022jp-cp932, x-iso2022jp-jdk1.1.7, and x-iso2022jp-19970715-ascii. These conversion tables are not identical to each other.

Correspondences between ISO-2022-JP and Shift-JIS or Japanese EUC
Conversion from Shift-JIS or Japanese EUC Conversion from ISO-2022-JP
Shift-JIS x-sjis-jdk1.1.7 x-iso2022jp-jdk1.1.7
x-sjis-unicode-0.9 x-iso2022jp-unicode-0.9
x-sjis-jisx0221-1995 x-iso2022jp-jisx0221-1995
x-sjis-cp932 x-iso2022jp-cp932
Japanese EUC x-eucjp-open-19970715-ms x-iso2022jp-cp932
x-eucjp-open-19970715-0201 x-iso2022jp-jisx0221-1995
x-eucjp-open-19970715-ascii x-iso2022jp-19970715-ascii

An escape sequence of [IETF RFC 1468] (1B 24 42, 1B 24 4A, 1B 28 42, or 1B 28 4A) is an error if it occurs before the end of an encoding declaration in an ISO-2022-JP XML constituent. Results are undefined.

NOTE: [XML] specifies that an XML processor must report an error and stop normal processing when it is unable to process the employed CES. Most XML processors cannot handle occurrences of bit combination 1B before the end of an encoding declaration. Such an occurrence is defined as an error (whose results are undefined) rather than a fatal error (which requires suspension) in order not to void XML processors based on an existing code conversion library. If it is defined as a fatal error, such XML processors might become non-conformant.
NOTE: Use of ISO-2022-JP cannot provide interoperability in information interchange, since any of these conversion tables or some other conversion tables might be used.

6. Charset Names

6.1 Charsets for an XML document containing Japanese characters

The conversion tables shown in Character Encoding Schemes are assumed to be independent and non-identical charsets, as listed below:

Unicode utf-16
utf-8
Shift-JIS x-sjis-unicode-0.9
x-sjis-jisx0221-1995
x-sjis-cp932
x-sjis-jdk1.1.7
Japanese EUC (Compressed) x-eucjp-unicode-0.9
x-eucjp-jisx0221-1995
x-eucjp-open-19970715-ms
x-eucjp-open-19970715-0201
x-eucjp-open-19970715-ascii
ISO-2022-JP x-iso2022jp-unicode-0.9
x-iso2022jp-jisx0221-1995
x-iso2022jp-cp932
x-iso2022jp-jdk1.1.7
x-iso2022jp-19970715-ascii
NOTE: Among these charset names, this technical report recommends only utf-16 and utf-8.

Charset names can be used to specify charsets by the charset parameter of the media types "text/xml" and "application/xml" or the encoding declaration.

NOTE: The charset parameter is a parameter whose value is a charset name. In the following example, the charset name utf-8 is specified by the charset parameter of the media type "text/xml".
Content-Type: text/xml; charset=utf-8
NOTE: [IETF RFC 2376] STRONGLY RECOMMENDS appropriate charset names for the charset parameter of the media type "text/xml" or "application/xml". [XML] requires them in encoding declarations.

Charset names are case-insensitive.

NOTE: [IANA] specifies that charset names are case-insensitive.
NOTE: It remains unclear which of the conversion tables in this technical report will be used for charsets Shift_JIS, EUC-JP, and ISO-2022-JP, which are registered at [IANA]. However, users may use these charsets provided that they do not require full interoperability.

6.2 Code Conversion during Transmission

In the case the CES is altered during transmission, the charset indicating the new CES must be specified by the charset parameter of "text/xml" or "application/xml". XML processors correctly recognize the CES even if encoding declarations are not rewritten.

NOTE: [IETF RFC 2376] specifies that the charset parameter is authoritative and takes precedence over the BOM or encoding declaration.

6.3 Storing transmitted XML constituents to files for information interchange

When WWW clients, EMAIL readers, or NEWS readers save the transmitted XML constituents as files, charset names should be specified in encoding declarations or the BOM must be embedded in the beginning of the file, unless the XML constituents are encoded in UTF-8. Code conversion can be applied to the XML constituents before they are stored to files.

NOTE: Refer to [IETF RFC 2376] for the above specification.

7. XML Constituents in Files for Information Interchange

To store XML constituents in files for information interchange, they should be encoded in either UTF-16 or UTF-8.

NOTE: The initial plan was to allow Shift-JIS and Japanese EUC (Compressed) as well, but they are not allowed at present. In the future, they may be allowed after authorizing one and only one conversion table for the conversion to [ISO/IEC10646] or [Unicode 3.0]. Alternatively, all conversion tables for Shift-JIS and Japanese EUC in this technical report may be registered (with different charset names) and allowed.
NOTE: The conversion tables from Shift-JIS and Japanese EUC (Compressed) to [Unicode 3.0] or [ISO/IEC10646] are still ambiguous, but users may use these charsets provided that they do not require full interoperability.

In storing XML constituents to files, a BOM or encoding declaration must announce the CES, unless they are encoded in the default CES, i.e., UTF-8.

NOTE: 4.3.3 of [XML] requires a BOM or an encoding declaration.

When encoded in UTF-16, no charset names need to be explicitly specified since the BOM announces UTF-16 as the CES.

Without any external information available for the CES, an XML processor determines the CES from the BOM or encoding declaration. This auto-detection of CESs is described in Appendix F of [XML].

When octet streams representing XML constituents serve as input/output of application programs, such streams should also follow the rules for information interchange files.

8. Delivering XML Constituents by HTTP 1.1

XML servers should use UTF-16 or UTF-8 in delivering XML constituents.

NOTE: The initial plan was to allow Shift-JIS and Japanese EUC (Compressed) as well, but they are not allowed at present. In the future, they may be allowed after authorizing one and only one conversion table for the conversion to [ISO/IEC10646] or [Unicode 3.0]. Alternatively, all conversion tables for Shift-JIS and Japanese EUC in this technical report may be registered (with different charset names) and allowed.
NOTE: The conversion tables from ISO-2022-JP and Japanese EUC (Compressed) to [Unicode 3.0] or [ISO/IEC10646] are still ambiguous, but users may use these charsets provided that they do not require full interoperability.

XML documents delivered via HTTP 1.1 ([IETF RFC 2616]) with the media type "text/xml" or "application/xml" should have the correct charset parameter. In the case of "application/xml", the charset parameter may be omitted if the CES is indicated by the BOM or encoding declaration.

NOTE: [IETF RFC 2376] STRONGLY RECOMMENDS the use of the charset parameter. The charset parameter takes precedence over the charset specified by the BOM or encoding declaration.
NOTE: If an XML document is encoded in US-ASCII and labelled as "text/xml", the charset parameter may be omitted. Such an XML document, however, is outside the scope of this technical report, since it cannot contain any Japanese characters.
NOTE: In configuring WWW servers, the correct charset should be attached to each file containing an XML constituent of the media type "text/xml" or "appllication/xml".

The client determines the CES from the charset parameter. The default of the charset parameter for "text/xml" is US-ASCII. When the charset parameter of the media type "application/xml" is omitted, the auto-detection procedure for information interchange files is applied.

NOTE: [IETF RFC 2376] is the source standard of the above specification.

9. Delivering XML Constituents via EMAIL or NEWS

To deliver an XML constituent as the EMAIL body, they should be encoded in UTF-16 or UTF-8.

NOTE: The initial plan was to allow ISO-2022-JP as well, but it is not allowed at present. In the future, it may be allowed after authorizing one and only one conversion table for the conversion to [ISO/IEC10646] or [Unicode 3.0]. Alternatively, all conversion tables for ISO-2022-JP in this technical report may be registered (with different charset names) and allowed.
NOTE: The conversion table from ISO-2022-JP to [Unicode 3.0] or [ISO/IEC10646] is still ambiguous, but users may use this charset provided that they do not require full interoperability.
NOTE: There was no plan to allow Japanese EUC (Compressed), because Japanese EUC has been rarely used for EMAIL or NEWS.

If the charset is UTF-16, use "application/xml" as the media type; otherwise, use "text/xml". In both cases, the charset parameter is STRONGLY RECOMMENDED.

Apply a content transfer encoding (Base64 etc.), if required.

NOTE: Details of content transfer encodings are specified by [IETF RFC 2045].

10. Avoiding Conversion Ambiguities

As an example of ambiguous conversion, consider a "Shift_JIS" XML document as below:

<?xml version="1.0" encoding="shift_jis"?>
<doc>{1}{2}{3}{4}{5}
{6}{7}{8}{9}{10}</doc>

where {1} through {10} are used to represent octets as below:

  1. 0x5C(YEN SIGN),
  2. 0x7E(OVERLINE),
  3. 0x815C(FULLWIDTH EM DASH),
  4. 0x815F(REVERSE SOLIDUS),
  5. 0x8160(WAVE DASH),
  6. 0x8161(DOUBLEVERTICAL LINE),
  7. 0x817C(MINUS SIGN),
  8. 0x8191(CENT SIGN),
  9. 0x8192(POUND SIGN), and
  10. 0x81CA(NOT SIGN).

Different conversion tables interpret this document differently.

Case 1: x-sjis-jdk1.1.7

0x5C(YEN SIGN), 0x7E(OVERLINE), 0x815C(FULLWIDTH EM DASH), 0x815F(REVERSE SOLIDUS), 0x8160(WAVE DASH), 0x8161(DOUBLEVERTICAL LINE), 0x817C(MINUS SIGN), 0x8191(CENT SIGN), 0x8192(POUND SIGN), and 0x81CA(NOT SIGN).

Case 2: x-sjis-unicode-0.9 (the third character is HORIZONTAL BAR)

0x5C(YEN SIGN), 0x7E(OVERLINE), 0x815C(FULLWIDTH EM DASH), 0x815F(REVERSE SOLIDUS), 0x8160(WAVE DASH), 0x8161(DOUBLEVERTICAL LINE), 0x817C(MINUS SIGN), 0x8191(CENT SIGN), 0x8192(POUND SIGN), and 0x81CA(NOT SIGN).

Case 3: x-sjis-jisx0221-1995 (the third character is EM DASH)

0x5C(YEN SIGN), 0x7E(OVERLINE), 0x815C(FULLWIDTH EM DASH), 0x815F(REVERSE SOLIDUS), 0x8160(WAVE DASH), 0x8161(DOUBLEVERTICAL LINE), 0x817C(MINUS SIGN), 0x8191(CENT SIGN), 0x8192(POUND SIGN), and 0x81CA(NOT SIGN).

Case 4: x-sjis-cp932

0x5C(YEN SIGN), 0x7E(OVERLINE), 0x815C(FULLWIDTH EM DASH), 0x815F(REVERSE SOLIDUS), 0x8160(WAVE DASH), 0x8161(DOUBLEVERTICAL LINE), 0x817C(MINUS SIGN), 0x8191(CENT SIGN), 0x8192(POUND SIGN), and 0x81CA(NOT SIGN).

We can avoid this ambiguity by using character refererences. For example, the document after normalization in case 3 can be explicitly specified as below:

<?xml version="1.0" encoding="shift_jis"?>
<doc>&#x005C;&#x007E;&#x2015;&#x005C;&#x301C;
&#x2016;&#x2212;&#x00A2;&#x00A3;&#x00AC;</doc>

Alternatively, one can declare and use parsed entities for these character references. The following document captures all Unicode characters ambiguously represented by Shift_JIS.

<?xml version="1.0?>
<!DOCTYPE doc [
  <!ENTITY revsol  "&#x5C;">
  <!ENTITY tilde   "&#x7E;">
  <!ENTITY cent    "&#xA2;">
  <!ENTITY pound   "&#xA3;">
  <!ENTITY yen     "&#xA5;">
  <!ENTITY not     "&#xAC;">
  <!ENTITY mdash   "&#x2014;">
  <!ENTITY horbar  "&#x2015;">
  <!ENTITY dvline  "&#x2016;">
  <!ENTITY fparato "&#x2225;">
  <!ENTITY wdash   "&#x301C;">
  <!ENTITY fminus  "&#xFF0D;">
  <!ENTITY fbsol   "&#xFF3C;">
  <!ENTITY ftilde  "&#xFF5E;">
  <!ENTITY fcent   "&#xFFE0;">
  <!ENTITY fpoun   "&#xFFE1;">
  <!ENTITY fnot    "&#xFFE2;">
  <!ENTITY fmacron "&#xFFE3;">
  <!ENTITY fbrvbar "&#xFFE4;">
  <!ENTITY fyen    "&#xFFE5;">
]>
<doc>&revsol;
&tilde;
&cent;
&pound;
&yen;
&not;
&mdash;
&horbar;
&dvline;
&fparato;
&wdash;
&fminus;
&fbsol;
&ftild;
&fcent;
&fpoun;
&fnot;
&fmacron;
&fbrvbar;
&fyen;</doc>

11. The xml:lang Attribute

Use the xml:lang attribute to indicate the use of the Japanese language for an XML document in whole or in part.

NOTE: The xml:lang attribute is specified in [XML].

Specify a language code for Japanese ( ja, for example) as the value of this attribute. Values of this attribute are case-insensitive.

NOTE: Details of language codes are specified in [IETF RFC 1766].

An XML document containing a Japanese paragraph and an English paragraph is shown below:

<?xml version="1.0" encoding="utf-8"?>
<document>
  <para xml:lang="ja">これは段落です。</para>
  <para xml:lang="en">This is a paragraph.</para>
</document>

Appendices

A Name Characters (Non-Normative)

All graphic characters contained in 10th area and later of [JIS X 0208] and 5th area and later of [JIS X 0212] are name start characters.

Table 1 lists name start characters in 1st through 9th areas of [JIS X 0208]. Table 2 and 3 list other name characters in these areas and name characters in 1st through 4th areas, respectively. All name characters in [JIS X 0212] are name start characters.

NOTE: In [XML], name characters are characters which can be used as part of element and attribute names, and name start characters are those name characters with which names may begin.
NOTE: This chapter merely adapts [XML] name characters to [JIS X 0208] and [JIS X 0212].
Name start characters: 1st through 9th areas of [JIS X 0208]
KUTEN code UCS code points
01-24 U+4edd
01-27 U+3007
02-82 U+212b
04-01 U+3041
04-02 U+3042
04-03 U+3043
04-04 U+3044
04-05 U+3045
04-06 U+3046
04-07 U+3047
04-08 U+3048
04-09 U+3049
04-10 U+304a
04-11 U+304b
04-12 U+304c
04-13 U+304d
04-14 U+304e
04-15 U+304f
04-16 U+3050
04-17 U+3051
04-18 U+3052
04-19 U+3053
04-20 U+3054
04-21 U+3055
04-22 U+3056
04-23 U+3057
04-24 U+3058
04-25 U+3059
04-26 U+305a
04-27 U+305b
04-28 U+305c
04-29 U+305d
04-30 U+305e
04-31 U+305f
04-32 U+3060
04-33 U+3061
04-34 U+3062
04-35 U+3063
04-36 U+3064
04-37 U+3065
04-38 U+3066
04-39 U+3067
04-40 U+3068
04-41 U+3069
04-42 U+306a
04-43 U+306b
04-44 U+306c
04-45 U+306d
04-46 U+306e
04-47 U+306f
04-48 U+3070
04-49 U+3071
04-50 U+3072
04-51 U+3073
04-52 U+3074
04-53 U+3075
04-54 U+3076
04-55 U+3077
04-56 U+3078
04-57 U+3079
04-58 U+307a
04-59 U+307b
04-60 U+307c
04-61 U+307d
04-62 U+307e
04-63 U+307f
04-64 U+3080
04-65 U+3081
04-66 U+3082
04-67 U+3083
04-68 U+3084
04-69 U+3085
04-70 U+3086
04-71 U+3087
04-72 U+3088
04-73 U+3089
04-74 U+308a
04-75 U+308b
04-76 U+308c
04-77 U+308d
04-78 U+308e
04-79 U+308f
04-80 U+3090
04-81 U+3091
04-82 U+3092
04-83 U+3093
05-01 U+30a1
05-02 U+30a2
05-03 U+30a3
05-04 U+30a4
05-05 U+30a5
05-06 U+30a6
05-07 U+30a7
05-08 U+30a8
05-09 U+30a9
05-10 U+30aa
05-11 U+30ab
05-12 U+30ac
05-13 U+30ad
05-14 U+30ae
05-15 U+30af
05-16 U+30b0
05-17 U+30b1
05-18 U+30b2
05-19 U+30b3
05-20 U+30b4
05-21 U+30b5
05-22 U+30b6
05-23 U+30b7
05-24 U+30b8
05-25 U+30b9
05-26 U+30ba
05-27 U+30bb
05-28 U+30bc
05-29 U+30bd
05-30 U+30be
05-31 U+30bf
05-32 U+30c0
05-33 U+30c1
05-34 U+30c2
05-35 U+30c3
05-36 U+30c4
05-37 U+30c5
05-38 U+30c6
05-39 U+30c7
05-40 U+30c8
05-41 U+30c9
05-42 U+30ca
05-43 U+30cb
05-44 U+30cc
05-45 U+30cd
05-46 U+30ce
05-47 U+30cf
05-48 U+30d0
05-49 U+30d1
05-50 U+30d2
05-51 U+30d3
05-52 U+30d4
05-53 U+30d5
05-54 U+30d6
05-55 U+30d7
05-56 U+30d8
05-57 U+30d9
05-58 U+30da
05-59 U+30db
05-60 U+30dc
05-61 U+30dd
05-62 U+30de
05-63 U+30df
05-64 U+30e0
05-65 U+30e1
05-66 U+30e2
05-67 U+30e3
05-68 U+30e4
05-69 U+30e5
05-70 U+30e6
05-71 U+30e7
05-72 U+30e8
05-73 U+30e9
05-74 U+30ea
05-75 U+30eb
05-76 U+30ec
05-77 U+30ed
05-78 U+30ee
05-79 U+30ef
05-80 U+30f0
05-81 U+30f1
05-82 U+30f2
05-83 U+30f3
05-84 U+30f4
05-85 U+30f5
05-86 U+30f6
Α 06-01 U+0391
Β 06-02 U+0392
Γ 06-03 U+0393
Δ 06-04 U+0394
Ε 06-05 U+0395
Ζ 06-06 U+0396
Η 06-07 U+0397
Θ 06-08 U+0398
Ι 06-09 U+0399
Κ 06-10 U+039a
Λ 06-11 U+039b
Μ 06-12 U+039c
Ν 06-13 U+039d
Ξ 06-14 U+039e
Ο 06-15 U+039f
Π 06-16 U+03a0
Ρ 06-17 U+03a1
Σ 06-18 U+03a3
Τ 06-19 U+03a4
Υ 06-20 U+03a5
Φ 06-21 U+03a6
Χ 06-22 U+03a7
Ψ 06-23 U+03a8
Ω 06-24 U+03a9
α 06-33 U+03b1
β 06-34 U+03b2
γ 06-35 U+03b3
δ 06-36 U+03b4
ε 06-37 U+03b5
ζ 06-38 U+03b6
η 06-39 U+03b7
θ 06-40 U+03b8
ι 06-41 U+03b9
κ 06-42 U+03ba
λ 06-43 U+03bb
μ 06-44 U+03bc
ν 06-45 U+03bd
ξ 06-46 U+03be
ο 06-47 U+03bf
π 06-48 U+03c0
ρ 06-49 U+03c1
σ 06-50 U+03c3
τ 06-51 U+03c4
υ 06-52 U+03c5
φ 06-53 U+03c6
χ 06-54 U+03c7
ψ 06-55 U+03c8
ω 06-56 U+03c9
А 07-01 U+0410
Б 07-02 U+0411
В 07-03 U+0412
Г 07-04 U+0413
Д 07-05 U+0414
Е 07-06 U+0415
Ё 07-07 U+0401
Ж 07-08 U+0416
З 07-09 U+0417
И 07-10 U+0418
Й 07-11 U+0419
К 07-12 U+041a
Л 07-13 U+041b
М 07-14 U+041c
Н 07-15 U+041d
О 07-16 U+041e
П 07-17 U+041f
Р 07-18 U+0420
С 07-19 U+0421
Т 07-20 U+0422
У 07-21 U+0423
Ф 07-22 U+0424
Х 07-23 U+0425
Ц 07-24 U+0426
Ч 07-25 U+0427
Ш 07-26 U+0428
Щ 07-27 U+0429
Ъ 07-28 U+042a
Ы 07-29 U+042b
Ь 07-30 U+042c
Э 07-31 U+042d
Ю 07-32 U+042e
Я 07-33 U+042f
а 07-49 U+0430
б 07-50 U+0431
в 07-51 U+0432
г 07-52 U+0433
д 07-53 U+0434
е 07-54 U+0435
ё 07-55 U+0451
ж 07-56 U+0436
з 07-57 U+0437
и 07-58 U+0438
й 07-59 U+0439
к 07-60 U+043a
л 07-61 U+043b
м 07-62 U+043c
н 07-63 U+043d
о 07-64 U+043e
п 07-65 U+043f
р 07-66 U+0440
с 07-67 U+0441
т 07-68 U+0442
у 07-69 U+0443
ф 07-70 U+0444
х 07-71 U+0445
ц 07-72 U+0446
ч 07-73 U+0447
ш 07-74 U+0448
щ 07-75 U+0449
ъ 07-76 U+044a
ы 07-77 U+044b
ь 07-78 U+044c
э 07-79 U+044d
ю 07-80 U+044e
я 07-81 U+044f
Name characters other than name start characters: 1st through 9th areas of [JIS X 0208]
KUTEN code UCS code points
01-19 U+30fd
01-20 U+30fe
01-21 U+309d
01-22 U+309e
01-25 U+3005
01-28 U+30fc
Name start characters: 1st through 4th ares of [JIS X 0212]
KUTEN code UCS code points
Ά 06-65 U+0386
Έ 06-66 U+0388
Ή 06-67 U+0389
Ί 06-68 U+038a
Ϊ 06-69 U+03aa
Ό 06-71 U+038c
Ύ 06-73 U+038e
Ϋ 06-74 U+03ab
Ώ 06-76 U+038f
ά 06-81 U+03ac
έ 06-82 U+03ad
ή 06-83 U+03ae
ί 06-84 U+03af
ϊ 06-85 U+03ca
ΐ 06-86 U+0390
ό 06-87 U+03cc
ς 06-88 U+03c2
ύ 06-89 U+03cd
ϋ 06-90 U+03cb
ΰ 06-91 U+03b0
ώ 06-92 U+03ce
Ђ 07-34 U+0402
Ѓ 07-35 U+0403
Є 07-36 U+0404
Ѕ 07-37 U+0405
І 07-38 U+0406
Ї 07-39 U+0407
Ј 07-40 U+0408
Љ 07-41 U+0409
Њ 07-42 U+040a
Ћ 07-43 U+040b
Ќ 07-44 U+040c
Ў 07-45 U+040e
Џ 07-46 U+040f
ђ 07-82 U+0452
ѓ 07-83 U+0453
є 07-84 U+0454
ѕ 07-85 U+0455
і 07-86 U+0456
ї 07-87 U+0457
ј 07-88 U+0458
љ 07-89 U+0459
њ 07-90 U+045a
ћ 07-91 U+045b
ќ 07-92 U+045c
ў 07-93 U+045e
џ 07-94 U+045f
Æ 09-01 U+00c6
Đ 09-02 U+0110
Ħ 09-04 U+0126
Ł 09-08 U+0141
Ŋ 09-11 U+014a
Ø 09-12 U+00d8
Œ 09-13 U+0152
Ŧ 09-15 U+0166
Þ 09-16 U+00de
æ 09-33 U+00e6
đ 09-34 U+0111
ð 09-35 U+00f0
ħ 09-36 U+0127
ı 09-37 U+0131
ĸ 09-39 U+0138
ł 09-40 U+0142
ŋ 09-43 U+014b
ø 09-44 U+00f8
œ 09-45 U+0153
ß 09-46 U+00df
ŧ 09-47 U+0167
þ 09-48 U+00fe
Á 10-01 U+00c1
À 10-02 U+00c0
Ä 10-03 U+00c4
 10-04 U+00c2
Ă 10-05 U+0102
Ǎ 10-06 U+01cd
Ā 10-07 U+0100
Ą 10-08 U+0104
Å 10-09 U+00c5
à 10-10 U+00c3
Ć 10-11 U+0106
Ĉ 10-12 U+0108
Č 10-13 U+010c
Ç 10-14 U+00c7
Ċ 10-15 U+010a
Ď 10-16 U+010e
É 10-17 U+00c9
È 10-18 U+00c8
Ë 10-19 U+00cb
Ê 10-20 U+00ca
Ě 10-21 U+011a
Ė 10-22 U+0116
Ē 10-23 U+0112
Ę 10-24 U+0118
Ĝ 10-26 U+011c
Ğ 10-27 U+011e
Ģ 10-28 U+0122
Ġ 10-29 U+0120
Ĥ 10-30 U+0124
Í 10-31 U+00cd
Ì 10-32 U+00cc
Ï 10-33 U+00cf
Î 10-34 U+00ce
Ǐ 10-35 U+01cf
İ 10-36 U+0130
Ī 10-37 U+012a
Į 10-38 U+012e
Ĩ 10-39 U+0128
Ĵ 10-40 U+0134
Ķ 10-41 U+0136
Ĺ 10-42 U+0139
Ľ 10-43 U+013d
Ļ 10-44 U+013b
Ń 10-45 U+0143
Ň 10-46 U+0147
Ņ 10-47 U+0145
Ñ 10-48 U+00d1
Ó 10-49 U+00d3
Ò 10-50 U+00d2
Ö 10-51 U+00d6
Ô 10-52 U+00d4
Ǒ 10-53 U+01d1
Ő 10-54 U+0150
Ō 10-55 U+014c
Õ 10-56 U+00d5
Ŕ 10-57 U+0154
Ř 10-58 U+0158
Ŗ 10-59 U+0156
Ś 10-60 U+015a
Ŝ 10-61 U+015c
Š 10-62 U+0160
Ş 10-63 U+015e
Ť 10-64 U+0164
Ţ 10-65 U+0162
Ú 10-66 U+00da
Ù 10-67 U+00d9
Ü 10-68 U+00dc
Û 10-69 U+00db
Ŭ 10-70 U+016c
Ǔ 10-71 U+01d3
Ű 10-72 U+0170
Ū 10-73 U+016a
Ų 10-74 U+0172
Ů 10-75 U+016e
Ũ 10-76 U+0168
Ǘ 10-77 U+01d7
Ǜ 10-78 U+01db
Ǚ 10-79 U+01d9
Ǖ 10-80 U+01d5
Ŵ 10-81 U+0174
Ý 10-82 U+00dd
Ÿ 10-83 U+0178
Ŷ 10-84 U+0176
Ź 10-85 U+0179
Ž 10-86 U+017d
Ż 10-87 U+017b
á 11-01 U+00e1
à 11-02 U+00e0
ä 11-03 U+00e4
â 11-04 U+00e2
ă 11-05 U+0103
ǎ 11-06 U+01ce
ā 11-07 U+0101
ą 11-08 U+0105
å 11-09 U+00e5
ã 11-10 U+00e3
ć 11-11 U+0107
ĉ 11-12 U+0109
č 11-13 U+010d
ç 11-14 U+00e7
ċ 11-15 U+010b
ď 11-16 U+010f
é 11-17 U+00e9
è 11-18 U+00e8
ë 11-19 U+00eb
ê 11-20 U+00ea
ě 11-21 U+011b
ė 11-22 U+0117
ē 11-23 U+0113
ę 11-24 U+0119
ǵ 11-25 U+01f5
ĝ 11-26 U+011d
ğ 11-27 U+011f
ġ 11-29 U+0121
ĥ 11-30 U+0125
í 11-31 U+00ed
ì 11-32 U+00ec
ï 11-33 U+00ef
î 11-34 U+00ee
ǐ 11-35 U+01d0
ī 11-37 U+012b
į 11-38 U+012f
ĩ 11-39 U+0129
ĵ 11-40 U+0135
ķ 11-41 U+0137
ĺ 11-42 U+013a
ľ 11-43 U+013e
ļ 11-44 U+013c
ń 11-45 U+0144
ň 11-46 U+0148
ņ 11-47 U+0146
ñ 11-48 U+00f1
ó 11-49 U+00f3
ò 11-50 U+00f2
ö 11-51 U+00f6
ô 11-52 U+00f4
ǒ 11-53 U+01d2
ő 11-54 U+0151
ō 11-55 U+014d
õ 11-56 U+00f5
ŕ 11-57 U+0155
ř 11-58 U+0159
ŗ 11-59 U+0157
ś 11-60 U+015b
ŝ 11-61 U+015d
š 11-62 U+0161
ş 11-63 U+015f
ť 11-64 U+0165
ţ 11-65 U+0163
ú 11-66 U+00fa
ù 11-67 U+00f9
ü 11-68 U+00fc
û 11-69 U+00fb
ŭ 11-70 U+016d
ǔ 11-71 U+01d4
ű 11-72 U+0171
ū 11-73 U+016b
ų 11-74 U+0173
ů 11-75 U+016f
ũ 11-76 U+0169
ǘ 11-77 U+01d8
ǜ 11-78 U+01dc
ǚ 11-79 U+01da
ǖ 11-80 U+01d6
ŵ 11-81 U+0175
ý 11-82 U+00fd
ÿ 11-83 U+00ff
ŷ 11-84 U+0177
ź 11-85 U+017a
ž 11-86 U+017e
ż 11-87 U+017c

B Needs for Japanese XML profile (Non-Normative)

[XML] adopts [ISO/IEC10646] or [Unicode 3.0] as the CCS, which contains all Japanese characters. UTF-8 and UTF-16 are the recommended CESs, and implementations are required to support them. Other existing CESs are optionally allowed, as long as they represent characters in [Unicode 3.0] only.

[XML], however, provides little information on existing CESs already in use for the interchange of Japanese characters. Such CESs are allowed as mere options among many others. Furthermore, [XML] says nothing about the appropriate CESs for each protocol (e.g. SMTP or HTTP) and those for information exchange files.

The mapping between such existing CESs and [ISO/IEC10646]/[Unicode 3.0] is not specified either. Some mutually different conversions are in use, and thus different XML processors may emit different outputs.

This technical report addresses existing CESs and clarifies open issues. Although problems with the use of such CESs are not solved, the nature of these problems has become clear.

C Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative)

There are four main ambiguities in conversion from Shift-JIS to [ISO/IEC10646] and [Unicode 3.0].

First, 0x5C and 0x7E are respectively converted to the yen sign and the overline by x-sjis-unicode-0.9 and x-sjis-jisx0221-1995, but respectively converted to backslash and tilde by x-sjis-cp932 and x-sjis-jdk1.1.7.

Second, x-sjis-cp932 is the only conversion table which provides peculiar mapping of 0x8160(WAVE DASH), 0x8161(DOUBLE VERTICAL LINE), 0x817C(MINUS SIGN), 0x8191(CENT SIGN), 0x8192(POUND SIGN) and 0x081CA(NOT SIGN).

Third, x-sjis-jisx0221-1995 is the only conversion table which maps 0x815C(FULLWIDTH EM DASH) to U+2014(EM DASH); the other conversion tables map it to U+2015(HORIZONTAL BAR).

Fourth, x-sjis-cp932 is the only conversion table which contains NEC special characters, NEC-selected IBM extended characters, IBM extended characters, and user-defined characters.

Ambiguities in conversion from Shift-JIS to [Unicode 3.0]/[ISO/IEC10646]
Octets in Shift-JIS x-sjis-jdk1.1.7 x-sjis-unicode-0.9 x-sjis-jisx0221-1995 x-sjis-cp932
0x5C(YEN SIGN) U+005C(REVERSE SOLIDUS) U+00A5(YEN SIGN) U+00A5(YEN SIGN) U+005C(REVERSE SOLIDUS)
0x7E(OVERLINE) U+007E(TILDE) U+203E(OVERLINE) U+203E(OVERLINE) U+007E(TILDE)
0x815C(FULLWIDTH EM DASH) U+2015(HORIZONTAL BAR) U+2015(HORIZONTAL BAR) U+2014(EM DASH) U+2015(HORIZONTAL BAR)
0x815F(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) U+FF3C(FULLWIDTH REVERSE SOLIDUS)
0x8160(WAVE DASH) U+301C(WAVE DASH) U+301C(WAVE DASH) U+301C(WAVE DASH) U+FF5E(FULLWIDTH TILDE)
0x8161(DOUBLEVERTICAL LINE) U+2016(DOUBLEVERTICAL LINE) U+2016(DOUBLEVERTICAL LINE) U+2016(DOUBLEVERTICAL LINE) U+2225(PARALLEL TO)
0x817C(MINUS SIGN) U+2212(MINUS SIGN) U+2212(MINUS SIGN) U+2212(MINUS SIGN) U+FF0D(FULLWIDTH HYPHEN-MINUS)
0x8191(CENT SIGN) U+00A2(CENT SIGN) U+00A2(CENT SIGN) U+00A2(CENT SIGN) U+FFE0(FULLWIDTH CENT SIGN)
0x8192(POUND SIGN) U+00A3(POUND SIGN) U+00A3(POUND SIGN) U+00A3(POUND SIGN) U+FFE1(FULLWIDTH POUND SIGN)
0x81CA(NOT SIGN) U+00AC(NOT SIGN) U+00AC(NOT SIGN) U+00AC(NOT SIGN) U+FFE2(FULLWIDTH NOT SIGN)
Extended characters in 13th, 89th-92th, and 115th-119th rows None None None Included

D Conversion tables for Shift-JIS and Japanese EUC (Non-Normative)

Conversion tables represented in the format of [Character Mapping Tables] are referenced from this appendix.

E Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)

X-eucjp-unicode-0.9 and x-eucjp-jisx0221-1995 cover only the characters defined by [US-ASCII], [JIS X 0208], and [JIS X 0212] and Halfwidth Katakana defined by [JIS X 0201]. In addition to these characters, conversion tables defined by OSF Japanese Vendors Council (OSF/JVC) [TOG/JVC CDE/Motif TOG/JVC CDE/Motif Technical WG], namely x-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201, and x-eucjp-open-19970715-ascii, cover IBM extension characters(0x8FF3F3-0x8FF4FE), NEC special characters (0xADA1-0xADFC), user-defined characters(0xF5A1-0xFEFE, and 0x8FF5A1-0x8FFEFE).

E.1 Range 0x20-0x7E ([US-ASCII] or [JIS X 0201])

As defined in the original specification of the Japanese EUC, the range 0x20-0x7E of x-eucjp-unicode-0.9, x-eucjp-jisx0221-1995, x-eucjp-open-19970715-ms, and x-eucjp-open-19970715-ascii is assumed to be [US-ASCII]. Charset x-eucjp-open-19970715-0201 is an exception, since this range is assumed to be [JIS X 0201] and converted as below:

Conversion in x-eucjp-open-19970715-0201
Octets in EUC UCS code point
0x5C(REVERSE SOLIDUS) U+00A5(YEN SIGN)
0x7E(TILDE) U+203E(OVERLINE)

E.2 Range 0x8EA1-0x8EDF(Halfwidth Katakana)

Halfwidth Katakana in [JIS X 0201] are allocated to the range 0x8EA1-0x8EDF. These code points are converted to the range U+xFF61-U+xFF9F in the compatilibity area of [ISO/IEC10646] and [Unicode 3.0] by any of the conversion tables.

E.3 Ambiguities in encoding the characters of [JIS X 0208] and [JISX 0212]

As for those characters specified by [JIS X 0208] and [JIS X 0212], charsets x-eucjp-jisx0221-1995 and x-eucjp-open-19970715-0201 provide an identical mapping.

X-eucjp-unicode-0.9 is different from these two charsets only in the mapping of 0xA1BD (EM DASH/ DASH(FULLWIDTH)).

Charset x-eucjp-open-19970715-ascii is different in the mapping of four code points, namely 0xA1B1(OVERLINE/NEGATION SIGN),0xA1C0(REVERSE SOLIDUS), 0xA1EF(YEN SIGN), and 0x8FA2B7(TILDE).

Charset x-eucjp-open-19970715-ms is different in the mapping of 0xA1BD(FULLWIDTH EM DASH), 0xA1C0(REVERSE SOLIDUS), 0xA1C1(WAVE DASH), 0xA1C2(DOUBLE VERTICAL LINE), 0xA1DD(MINUS SIGN), 0xA1F1(CENT SIGN), 0xA1F2(POUND SIGN), 0xA2CC(NOT SIGN), 0x8FA2B7(TILDE), and 0x8FA2C3(BROKEN BAR).

E.4 Ambiguities in IBM Extended Characters, NEC Special Characters, and User-defined Characters

Charsets x-eucjp-open-19970715-ms, x-eucjp-open-19970715-0201, x-eucjp-open-19970715-ascii contain IBM extended characters, NEC special characters, and (the area of) user-defined characters. These charsets provide an identical mapping of these characters. Charsets x-eucjp-unicode-0.9 and x-eucjp-jisx0221-1995, on the other hand, do not contain these characters.

Ambiguities in conversion from Japanese EUC to [Unicode 3.0]/[ISO/IEC10646](1)
Octets in Japanese EUC x-eucjp-unicode-0.9 x-eucjp-jisx0221-1995 x-eucjp-open-19970715-ms
0x5C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS)
0x7E(TILDE) U+007E(TILDE) U+007E(TILDE) U+007E(TILDE)
0xA1B1(OVERLINE/NEGATION SIGN) U+FFE3(FULLWIDTH MACRON) U+FFE3(FULLWIDTH MACRON) U+FFE3(FULLWIDTH MACRON)
0xA1BD(FULLWIDTH EM DASH) U+2015(HORIZONTAL BAR) U+2014(EM DASH) U+2015(HORIZONTAL BAR)
0xA1C0(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) U+FF3C(FULLWIDTH REVERSE SOLIDUS)
0xA1C1(WAVE DASH) U+301C(WAVE DASH) U+301C(WAVE DASH) U+FF5E(FULLWIDTH TILDE)
0xA1C2(DOUBLE VERTICAL LINE) U+2016(DOUBLE VERTICAL LINE) U+2016(DOUBLE VERTICAL LINE) U+2225(PARALLEL TO)
0xA1DD(MINUS SIGN) U+2212(MINUS SIGN) U+2212(MINUS SIGN) U+FF0D(FULLWIDTH HYPHEN-MINUS)
0xA1EF(YEN SIGN) U+FFE5(FULLWIDTH YEN SIGN) U+FFE5(FULLWIDTH YEN SIGN) U+FFE5(FULLWIDTH YEN SIGN)
0xA1F1(CENT SIGN) U+00A2(CENT SIGN) U+00A2(CENT SIGN) U+FFE0(FULLWIDTH CENT SIGN)
0xA1F2(POUND SIGN) U+00A3(POUND SIGN) U+00A3(POUND SIGN) U+FFE1(FULLWIDTH POUND SIGN)
0xA2CC(NOT SIGN) U+00AC(NOT SIGN) U+00AC(NOT SIGN) U+FFE2(FULLWIDTH NOT SIGN)
0x8FA2B7(TILDE) U+007E(TILDE) U+007E(TILDE) U+FF5E(FULLWIDTH TILDE)
0x8FA2C3(BROKEN BAR) U+00A6(BROKEN BAR) U+00A6(BROKEN BAR) U+FFE4(FULLWIDTH BROKEN BAR)
Ambiguities in conversion from Japanese EUC to [Unicode 3.0]/[ISO/IEC10646](2)
Octets in Japanese EUC x-eucjp-open-19970715-0201 x-eucjp-open-19970715-ascii
0x5C(REVERSE SOLIDUS) U+00A5(YEN SIGN) U+005C(REVERSE SOLIDUS)
0x7E(TILDE) U+203E(OVERLINE) U+007E(TILDE)
0xA1B1(OVERLINE/NEGATION SIGN) U+FFE3(FULLWIDTH MACRON) U+203E(OVERLINE)
0xA1BD(FULLWIDTH EM DASH) U+2014(EM DASH) U+2014(EM DASH)
0xA1C0(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) U+FF3C(FULLWIDTH REVERSE SOLIDUS)
0xA1C1(WAVE DASH) U+301C(WAVE DASH) U+301C(WAVE DASH)
0xA1C2(DOUBLE VERTICAL LINE) U+2016(DOUBLE VERTICAL LINE) U+2016(DOUBLE VERTICAL LINE)
0xA1DD(MINUS SIGN) U+2212(MINUS SIGN) U+2212(MINUS SIGN)
0xA1EF(YEN SIGN) U+FFE5(FULLWIDTH YEN SIGN) U+00A5(YEN SIGN)
0xA1F1(CENT SIGN) U+00A2(CENT SIGN) U+00A2(CENT SIGN)
0xA1F2(POUND SIGN) U+00A3(POUND SIGN) U+00A3(POUND SIGN)
0xA2CC(NOT SIGN) U+00AC(NOT SIGN) U+00AC(NOT SIGN)
0x8FA2B7(TILDE) U+007E(TILDE) U+FF5E(FULLWIDTH TILDE)
0x8FA2C3(BROKEN BAR) U+00A6(BROKEN BAR) U+00A6(BROKEN BAR)

F Charset parameters in HTTP1.1 (Non-Normative)

Japanese HTML documents, as widely known, are often misdecoded, because almost all of them lack the correct charset parameter of the media type "text/html". Refer to [Murata et al.] for further information.

To avoid misdecoding, specify the correct charset parameter. Do not use those servers which do not provide the charset parameter.

G Non-normative references (Non-Normative)

Murata et al.
Murata, Dürst, Nicol. Recommendation for the charset parameter: a mechanism for specifying character encoding schemes, http://www.fxis.co.jp/DMS/sgml/html_correct_charset.html , 1998
TOG/JVC CDE/Motif TOG/JVC CDE/Motif Technical WG
TOG/JVC CDE/Motif TOG/JVC CDE/Motif Technical WG. Problems and Solutions for Unicode and User/Vendor Defined Characters, OSF Japanese Vendors Council (OSF/JVC), http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html , 1996.
UNIX International
UNIX International. UNIX SYSTEM V Release 4 Nihongo Kankyou Kyoutuu Kiyaku (Common specifications for the Japanese computing environement), Toppan, 1992.