0% found this document useful (0 votes)
257 views13 pages

Utf-8, Utf-16, Utf-32 & Bom

The document discusses Unicode transformation formats (UTFs), including UTF-8, UTF-16, and UTF-32. It answers common questions about these encoding formats such as their definitions, differences between them, and how to convert between them and Unicode character codes.

Uploaded by

toktar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
257 views13 pages

Utf-8, Utf-16, Utf-32 & Bom

The document discusses Unicode transformation formats (UTFs), including UTF-8, UTF-16, and UTF-32. It answers common questions about these encoding formats such as their definitions, differences between them, and how to convert between them and Unicode character codes.

Uploaded by

toktar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

http://unicode.org/faq/utf_bom.

html Page 1 of 13

Frequently Asked Questions Home | Site Map | Search

UTF-8, UTF-16, UTF-32 & BOM

General questions, relating to UTF or Encoding


Form General questions,
relating to UTF or
Encoding Forms
Q: Is Unicode a 16-bit encoding?
Is Unicode a 16-bit
encoding?
A: No. The first version of Unicode was a 16-bit encoding,
from 1991 to 1995, but starting with Unicode 2.0 (July, 1996), Can Unicode text be
it has not been a 16-bit encoding. The Unicode Standard represented in more
encodes characters in the range U+0000..U+10FFFF, which is than one way?
roughly a 21-bit code space. Depending on the encoding form
you choose (UTF-8, UTF-16, or UTF-32), each character will What is a UTF?
then be represented either as a sequence of one to four 8-bit
bytes, one or two 16-bit code units, or a single 32-bit code Where can I get more
unit. information on
encoding forms?
Q: Can Unicode text be represented in more than one way? How do I write a UTF
converter?
A: Yes, there are several possible representations of Unicode
data, including UTF-8,€ UTF-16 and UTF-32. In addition, there Which of the UTFs do
are compression transformations such as the one described in I need to support?
the Unicode Technical Standard #6: A Standard Compression
Scheme for Unicode. What are some of the
differences between
the UTFs?
Q: What is a UTF?
Why do some UTFs
A: A Unicode transformation format (UTF) is an algorithmic have a BE or LE in
mapping from every Unicode code point (except surrogate their label, as in UTF
code points) to a unique byte sequence. The ISO/IEC 10646 -16LE?
standard uses the term “UCS transformation format” for UTF;
the two terms are merely synonyms for the same concept. Are there any byte
sequences that are
Each UTF is reversible, thus every UTF supports lossless round not generated by a
UTF? How should I
tripping: mapping from any Unicode coded character
interpret them?
sequence S to a sequence of bytes and back will produce S
again. To ensure round tripping, a UTF mapping must also Is there a standard
map all code points that are not valid Unicode characters to method to package a
unique byte sequences. These invalid code points are the 66 Unicode character so
noncharacters (including FFFE and FFFF), as well as unpaired it fits an 8-Bit ASCII
surrogates.€ stream?

The SCSU compression method, even though it is reversible, is Which of these


not a UTF because the same string can map to very many approaches is the
different byte sequences, depending on the particular SCSU best?
compressor. [AF]
Which of these
formats is the most
Q: Where can I get more information on encoding forms? standard?

A: For the formal definition of UTFs see Section 3.9, Unicode


Encoding Forms in the Unicode Standard. For more UTF-8 FAQ
information on encoding forms see Unicode Technical Report
What is the definition
UTR #17: Unicode Character Encoding Model. [AF]
of UTF-8?

08/10/2010
http://unicode.org/faq/utf_bom.html Page 2 of 13

Q: How do I write a UTF converter? Is the UTF-8 encoding


scheme the same
A: The freely available open source project International irrespective of
Components for Unicode (ICU) has UTF conversion built into whether the
it. The latest version may be downloaded from the ICU underlying processor
Project web site. [AF] is little endian or big
endian?
Q: Are there any byte sequences that are not generated by Is the UTF-8 encoding
a UTF? How should I interpret them? scheme the same
irrespective of
A: None of the UTFs can generate every arbitrary byte whether the
sequence. For example, in UTF-8 every byte of the form underlying system
110xxxxx2 must be followed with a byte of the form uses ASCII or EBCDIC
10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is encoding?
illegal, and must never be generated. When faced with this How do I convert a
illegal byte sequence while transforming or interpreting, a UTF-16 surrogate pair
UTF-8 conformant process must treat the first byte 110xxxxx2 such as <D800 DC00>
as an illegal termination error: for example, either signaling to UTF-8? A one four
an error, filtering the byte out, or representing the byte with byte sequence or as
a marker such as FFFD (REPLACEMENT CHARACTER). In the two separate 3-byte
latter two cases, it will continue processing at the second sequences?
byte 0xxxxxxx2.
How do I convert an
A conformant process must not interpret illegal or ill-formed unpaired UTF-16
byte sequences as characters, however, it may take error surrogate to UTF-8?
recovery actions. No conformant process€ may use irregular
byte sequences to encode out-of-band information.
UTF-16 FAQ
Q: Which of the UTFs do I need to support? What is UTF-16?

A: UTF-8 is most common on the web. UTF-16 is used by Java What are surrogates?
and Windows. UTF-32 is used by various Unix systems. The What is the algorithm
conversions between all of them are algorithmically based, to convert from UTF-
fast and lossless. This makes it easy to support data input or 16 to character
output in multiple formats, while using a particular UTF for codes?
internal storage or processing.€ [AF]
Isn’t there a simpler
Q: What are some of the differences between the UTFs? way to do this?

Why are some people


A: The following table summarizes some of the properties of opposed to UTF-16?
each of the UTFs.€
Will UTF-16 ever be
extended to more
than a million
characters?

Are there any 16-bit


values that are
invalid?

Are there any paired


surrogates that are
invalid?

Since the surrogate


pairs will be rare,
does that mean I can
dispense with them?

08/10/2010
http://unicode.org/faq/utf_bom.html Page 3 of 13

When will most


implementations of
Unicode support
surrogates?

UTF-32 FAQ
What is UTF-32?

Should I use UTF-32


(or UCS-4) for storing
Unicode strings in
memory?

How about using UTF-


32 interfaces in my
APIs?

Doesn’t it cause a
problem to have UTF-
16 string APIs,
instead of UTF-32
char APIs?

Are there exceptions


to the rule of
exclusively using
string parameters in
APIs?

How do I convert a
UTF-16 surrogate pair
such as <D800 DC00>
to UTF-32? As one or
as two 4-byte
sequences?

How do I convert an
unpaired UTF-16
surrogate to UTF-32?

Byte Order Mark (BOM)


FAQ
What is a BOM?

Where is a BOM
Useful?

What does ‘endian’


mean?

When a BOM is used,


is it only in 16-bit
Unicode text?

Can a UTF-8 data


stream contain the
BOM character (in
UTF-8 form)? If yes,

08/10/2010
http://unicode.org/faq/utf_bom.html Page 4 of 13

does it affect the


byte order?

What should I do with


U+FEFF in the middle
of a file?

I am using a protocol
that has BOM at the
start of text. How do
I represent an initial
ZWNBSP?

How do I tag data


that does not
interpret FEFF as a
BOM?

Why wouldn’t I
always use a protocol
that requires a BOM?

How I should deal


with BOMs?
Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Fewest bytes per character 1 2 2 2 4 4 4
Most bytes per character 4 4 4 4 4 4 4

In the table <BOM> indicates that the byte order is determined by a byte order mark, if
present at the beginning of the data stream, otherwise it is big-endian.€ [AF]

Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

A: UTF-16 and UTF-32 use code units that are two and four bytes long respectively. For
these UTFs, there are three sub-flavors: BE, LE and unmarked. The BE form uses big-
endian byte serialization (most significant byte first), the LE form uses little-endian
byte serialization (least significant byte first) and the unmarked form uses big-endian
byte serialization by default, but may include a byte order mark at the beginning to
indicate the actual byte serialization used. [AF]

Q: Is there a standard method to package a Unicode character so it fits an 8-Bit ASCII


stream?

A: There are three or four options for making Unicode fit into an 8-bit format.

a) Use UTF-8. This preserves ASCII, but not Latin-1, because the characters >127 are
different from Latin-1. UTF-8 uses the bytes in the ASCII only for ASCII characters.
Therefore, it works well in any environment where ASCII characters have a significance
as syntax characters, e.g. file name syntaxes, markup languages, etc., but where the
all other characters may use arbitrary bytes.
Example: “Latin Small Letter s with Acute” (015B) would be encoded as two bytes: C5
9B.

08/10/2010
http://unicode.org/faq/utf_bom.html Page 5 of 13

b) Use Java or C style escapes, of the form \uXXXXX or \xXXXXX. This format is not
standard for text files, but well defined in the framework of the languages in question,
primarily for source files.
Example: The Polish word “wyjście” with character “Latin Small Letter s with
Acute” (015B) in the middle (ś is one character) would look like: “wyj\u015Bcie".

c) Use the &#xXXXX; or &#DDDDD; numeric character escapes as in HTML or XML. Again,
these are not standard for plain text files, but well defined within the framework of
these markup languages.
Example: “wyjście” would look like “wyj&#x015B;cie"

d) Use SCSU. This format compresses Unicode into 8-bit format, preserving most of
ASCII, but using some of the control codes as commands for the decoder. However,
while ASCII text will look like ASCII text after being encoded in SCSU, other characters
may occasionally be encoded with the same byte values, making SCSU unsuitable for 8-
bit channels that blindly interpret any of the bytes as ASCII characters.
Example: “<SC2> wyjÛcie” where <SC2> indicates the byte 0x12 and “Û” corresponds to
byte 0xDB. [AF]

Q: Which of these approaches is the best?

A: That depends on the circumstances: Of these four approaches, d) uses the least
space, but cannot be used transparently in most 8-bit environments. a) is the most
widely supported in plain text files and b) and c) use the most space, but are widely
supported for program source files in Java and C, or within HTML and XML files
respectively.€ [AF]

Q: Which of these formats is the most standard?

A: All four require that the receiver can understand that format, but a) is considered
one of the three equivalent Unicode Encoding Forms and therefore standard. The use of
b), or c) out of their given context would definitely be considered non-standard, but
could be a good solution for internal data transmission. The use of SCSU is itself a
standard (for compressed data streams) but few general purpose receivers support
SCSU, so it is again most useful in internal data transmission. [AF]

UTF-8 FAQ

Q: What is the definition of UTF-8?

A: UTF-8 is the byte-oriented encoding form of Unicode. For details of its definition,
see Section 2.5 “Encoding Forms” and Section 3.9 “ Unicode Encoding Forms ” in the
Unicode Standard. See, in particular, Table 3-6 UTF-8 Bit Distribution and Table 3-7
Well-formed UTF-8 Byte Sequences, which give succinct summaries of the encoding
form. Make sure you refer to the latest version of the Unicode Standard, as the Unicode
Technical Committee has tightened the definition of UTF-8 over time to more strictly
enforce unique sequences and to prohibit encoding of certain invalid characters. There
is an Internet RFC 3629 about UTF-8. UTF-8 is also defined in Annex D of ISO/IEC 10646.
See also the question above, How do I write a UTF converter?

Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying
processor is little endian or big endian?

A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as


there is for encoding forms that use 16-bit or 32-bit code units. Where a BOM is used

08/10/2010
http://unicode.org/faq/utf_bom.html Page 6 of 13

with UTF-8, it is only used as an ecoding signature to distinguish UTF-8 from other
encodings — it has nothing to do with byte order.€ [AF]

Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying
system uses ASCII or EBCDIC encoding?

A: There is only one definition of UTF-8. It is the precisely the same, whether the data
were converted from ASCII or EBCDIC based character sets. However, byte sequences
from standard UTF-8 won’t interoperate well in an EBCDIC system, because of the
different arrangements of control codes between ASCII and EBCDIC. Unicode Technical
Report #16: UTF-EBCDIC defines is a specialized UTF€ that will interoperate in EBCDIC
systems. [AF]

Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8? A one
four byte sequence or as two separate 3-byte sequences?

A: The definition of UTF-8 requires that supplementary characters (those using


surrogate pairs in UTF-16) be encoded with a single four byte sequence. However,
there is a widespread practice of generating pairs of three byte sequences in older
software, especially software which pre-dates the introduction of UTF-16 or that is
interoperating with UTF-16 environments under particular constraints. Such an
encoding is not conformant to UTF-8 as defined. See UTR #26: Compatability Encoding
Scheme for UTF-16: 8-bit (CESU) for a formal description of such a non-UTF-8 data
format. When using CESU-8, great care must be taken that data is not accidentally
treated as if it was UTF-8, due to the similarity of the formats. [AF]

Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A different issue arises if an unpaired surrogate is encountered when converting ill-


formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte
sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully
reflects the nature of the input, Unicode conformance requires that encoding form
conversion always results in valid data stream. Therefore a converter must treat this as
an error. [AF]

UTF-16 FAQ

Q: What is UTF-16?

A: UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters,
and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used
characters in Unicode.

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all
modern scripts. (Ancient scripts were to be represented with private-use characters.)
Over time, and especially after the addition of over 14,500 composite characters for
compatibility with legacy sets, it became clear that 16-bits were not sufficient for the
user community. Out of this arose UTF-16. [AF]

Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, reserved for
use as the leading, and trailing values of paired code units in UTF-16. Leading, also
called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are
from DC0016 to DFFF16. They are called surrogates, since they do not represent
characters directly, but only as a pair.

08/10/2010
http://unicode.org/faq/utf_bom.html Page 7 of 13

Q: What’s the algorithm to convert from UTF-16 to character codes?

A: The Unicode Standard used to contain a short algorithm, now there is just a bit
distribution table. Here are three short code snippets that translate the information
from the bit distribution table into C code that will convert to and from UTF-16.

Using the following type definitions


typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;

the first snippet calculates the high (or leading) surrogate from a character code C.
const UTF16 HI_SURROGATE_START = 0xD800

UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;

where X, U and W correspond to the labels used in Table 3-4 UTF-16 Bit Distribution.
The next snippet does the same for the low surrogate.
const UTF16 LO_SURROGATE_START = 0xDC00

UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));

Finally, the reverse, where hi and lo are the high and low surrogate, and C the resulting
character
UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;

UTF32 C = U << 16 | X;

A caller would need to ensure that C, hi, and lo are in the appropriate ranges. [AF]

Q: Isn’t there a simpler way to do this?

A: There is a much simpler computation that does not try to follow the bit distribution
table.
// constants
const UTF32 LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
const UTF32 SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;

// computations
UTF16 lead = LEAD_OFFSET + (codepoint >> 10);
UTF16 trail = 0xDC00 + (codepoint & 0x3FF);

UTF32 codepoint = (lead << 10) + trail + SURROGATE_OFFSET;

[MD]

Q: Why are some people opposed to UTF-16?

A: People familiar with variable width East Asian character sets such as Shift-JIS ( SJIS)
are understandably nervous about UTF-16, which sometimes requires two code units to
represent a single character. They are well acquainted with the problems that variable-
width codes have caused. However, there are some important differences between the
mechanisms used in SJIS and UTF-16:

08/10/2010
http://unicode.org/faq/utf_bom.html Page 8 of 13

Overlap:

• In SJIS, there is overlap between the leading and trailing code unit values, and
between the trailing and single code unit values. This causes a number of problems:
■ It causes false matches. For example, searching for an “a” may match against the
trailing code unit of a Japanese character.
■ It prevents efficient random access. To know whether you are on a character
boundary, you have to search backwards to find a known boundary.
■ It makes the text extremely fragile. If a unit is dropped from a leading-trailing
code unit pair, many following characters can be corrupted.
• In UTF-16, the code point ranges for high and low surrogates, as well as for single
units are all completely disjoint. None of these problems occur:
■ There are no false matches.
■ The location of the character boundary can be directly determined from each
code unit value.
■ A dropped surrogate will corrupt only a single character.

Frequency:

• The vast majority of SJIS characters require 2 units, but characters using single units
occur commonly and often have special importance, for example in file names.
• With UTF-16, relatively few characters require 2 units. The vast majority of
characters in common use are single code units. Even in East Asian text, the
incidence of surrogate pairs should be well less than 1% of all text storage on
average. (Certain documents, of course, may have a higher incidence of surrogate
pairs, just as phthisique is an fairly infrequent word in English, but may occur quite
often in a particular scholarly text.) [AF]

Q: Will UTF-16 ever be extended to more than a million characters?

A: No. Both Unicode and ISO 10646 have policies in place that formally limit future
code assignment to the integer range that can be expressed with current UTF-16 (0 to
1,114,111). Even if other encoding forms (i.e. other UTFs) can represent larger
intergers, these policies mean that all encoding forms will always represent the same
set of characters. Over a million possible codes is far more than enough for the goal of
Unicode of encoding characters, not glyphs. Unicode is not designed to encode
arbitrary data. If you wanted, for example, to give each “instance of a character on
paper throughout history” its own code, you might need trillions or quadrillions of such
codes; noble as this effort might be, you would not use Unicode for such an
encoding.€ [AF]

Q: Are there any 16-bit values that are invalid?

A: The two values FFFE16 and FFFF16 as well as the 32 values from FDD016 to FDEF16
represent noncharacters. They are invalid in interchange, but may be freely used
internal to an implementation. Unpaired surrogates are invalid as well, i.e. any value in
the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or
any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to
DBFF16. [AF]

Q: Are there any paired surrogates that are invalid?

A: Some code points are designated as noncharacters. They are invalid in interchange,
but may be freely used internal to an implementation. For the 32 noncharacters that
are supplementary characters, the corresponding surrogate pairs are listed below.

08/10/2010
http://unicode.org/faq/utf_bom.html Page 9 of 13

UTF-16 UTF-8 UCS-4


D83F DFF* F0 9F BF B* 0001FFF*
D87F DFF* F0 AF BF B* 0002FFF*
D8BF DFF* F0 BF BF B* 0003FFF*
D8FF DFF* F1 8F BF B* 0004FFF*
D93F DFF* F1 9F BF B* 0005FFF*
D97F DFF* F1 AF BF B* 0006FFF*
...
DBBF DFF* F3 BF BF B* 000FFFF*
DBFF DFF* F4 8F BF B* 0010FFF*
* = E or F

Surrogate pairs that refer to unassigned characters should not occur in data that you
generate, but may legitimately occur in data from a system that’s conformant to a
later version of the Unicode Standard.

Q: Since the surrogate pairs will be rare, does that mean I can dispense with them?

A: Just because the characters are rare does not mean that they should be neglected.
It will become even more important to support surrogate pairs in the future as they
become more widely used for minor scripts, mathematics, and rare Han characters. The
fact that the characters are rare can be taken into account when optimizing code and
storage, however.

Q: When will most implementations of Unicode support surrogates?

A: A growing number of implementations support surrogates, including Windows XP and


Microsoft Office. Although Java does not yet support surrogates, there is a set of
utilities in ICU4J that provides€surrogate support at a low level; Java 1.4 will support
line layout for surrogates.

UTF-32 FAQ

Q: What is UTF-32?

A: Any Unicode character can be represented as a single 32-bit unit in UTF-32. This
single 4 code unit corresponds to the Unicode scalar value, which is the abstract
number associated with a Unicode character. UTF-32 is a subset of the encoding
mechanism called UCS-4 in ISO 10646. For more information, see Section 3.9 “Unicode
Encoding Forms” of The Unicode Standard. [AF]

Q: Should I use UTF-32 (or UCS-4) for storing Unicode strings in memory?

A: This depends. If you frequently need to access APIs that require string parameters to
be in UTF-32, it may be more convenient to work with UTF-32 strings all the time.
However, the downside of UTF-32 is that it forces you to use 32-bits for each character,
when only 21 bits are ever needed. The number of significant bits needed for the
average character in common texts is much lower, making the ratio effectively that
much worse. In many situations that does not matter, and the convenience of having a
fixed number of code units per character can be the deciding factor.

08/10/2010
http://unicode.org/faq/utf_bom.html Page 10 of 13

Increasing the storage for the same number of characters does have its cost in
applications dealing with large volume of text data: it can mean exhausting cache
limits sooner; it can result in noticeably increased read/write times or in reaching
bandwidth limits; and it requires more space for storage.What a number of
implementations do is to represent strings with UTF-16, but individual characters values
with UTF-32, for and example of the latter see below.

The chief selling point for Unicode is providing a representation for all the world’s
characters, eliminating the need for juggling multiple character sets and avoiding the
associated data corruption problems. These features were enough to swing industry to
the side of using Unicode (UTF-16). While a UTF-32 representation does make the
programming model somewhat simpler, the increased average storage size has real
drawbacks, making a complete transition to UTF-32 less compelling. [AF]

Q: How about using UTF-32 interfaces in my APIs?

A: Except in some environments that store text as UTF-32 in memory, most Unicode
APIs are using UTF-16. With UTF-16 APIs€ the low level indexing is at the storage or
code unit level, with higher-level mechanisms for graphemes or words specifying their
boundaries in terms of the code units. This provides efficiency at the low levels, and
the required functionality at the high levels.

If its ever necessary to locate the nth character, indexing by character can be
implemented as a high level operation. However, while converting from such a UTF-16
code unit index to a character index or vice versa is fairly straightforward, it does
involve a scan through the 16-bit units up to the index point. In a test run, for example,
accessing UTF-16 storage as characters, instead of code units resulted in a 10×
degradation. While there are some interesting optimizations that can be performed, it
will always be slower on average. Therefore locating other boundaries, such as
grapheme, word, line or sentence boundaries proceeds directly from the code unit
index, not indirectly via an intermediate character code index.

Q: Doesn’t it cause a problem to have only UTF-16 string APIs, instead of UTF-32 char
APIs?

A: Almost all international functions (upper-, lower-, titlecasing, case folding, drawing,
measuring, collation, transliteration, grapheme-, word-, linebreaks, etc.) should take
string parameters in the API, not single code-points (UTF-32). Single code-point APIs
almost always produce the wrong results except for very simple languages, either
because you need more context to get the right answer, or because you need to
generate a sequence of characters to return the right answer, or both.

For example, any Unicode-compliant collation (See Unicode Technical Standard #10:
Unicode Collation Algogrithm (UCA)) must be able to handle sequences of more than
one code-point, and treat that sequence as a single entity. Trying to collate by handling
single code-points at a time, would get the wrong answer. The same will happen for
drawing or measuring text a single code-point at a time; because scripts like Arabic are
contextual, the width of x plus the width of y is not equal to the width of xy. Once you
get beyond basic typography, the same is true for English as well; because of kerning
and ligatures the width of “fi” in the font may be different than the width of “f” plus
the width of “i". Casing operations must return strings, not single code-points; see
http://www.unicode.org/charts/case/ . In particular, the title casing operation
requires strings as input, not single code-points at a time.

Storing a single code point in a struct or class instead of a string, would exclude support
for graphemes, such as “ch” for Slovak, where a single code point may not be
sufficient, but a character sequence is needed to express what is required. In other

08/10/2010
http://unicode.org/faq/utf_bom.html Page 11 of 13

words, most API parameters and fields of composite data types should not be defined
as a character, but as a string. And if they are strings, it does not matter what the
internal representation of the string is.

Given that any industrial-strength text and internationalization support API has to be
able to handle sequences of characters, it makes little difference whether the string is
internally represented by a sequence of UTF-16 code units, or by a sequence of code-
points ( = UTF-32 code units). Both UTF-16 and UTF-8 are designed to make working
with substrings easy, by the fact that the sequence of code units for a given code point
is unique. [AF]

Q: Are there exceptions to the rule of exclusively using string parameters in APIs?

A: The main exception are very low-level operations such as getting character
properties (e.g. General Category or Canonical Class in the UCD). For those it is handy
to have interfaces that convert quickly to and from UTF-16 and UTF-32, and that allow
you to iterate through strings returning UTF-32 values (even though the internal format
is UTF-16).

Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-32? A one 4-
byte sequence or as two 4-byte sequences?

A: The definition of UTF-32 requires that supplementary characters (those using


surrogate pairs in UTF-16) be encoded with a single 4-byte sequence.

Q: How do I convert an unpaired UTF-16 surrogate to UTF-32?

A: If an unpaired surrogate is encountered when converting ill-formed UTF-16 data, any


conformant converter must treat this as an error. By represented such an unpaired
surrogate on its own, the resulting UTF-32 data stream would become ill-formed. While
it faithfully reflects the nature of the input, Unicode conformance requires that
encoding form conversion always results in valid data stream. [AF]

Byte Order Mark (BOM) FAQ

Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a
data stream, where it can be used as a signature defining the byte order and encoding
form, primarily of unmarked plaintext files. Under some higher level protocols, use of a
BOM may be mandatory (or prohibited) in the Unicode data stream defined in that
protocol. [AF]

Q: Where is a BOM useful?

A: A BOM is useful at the beginning of files that are typed as text, but for which it is
not known whether they are in big or little endian format—it can also serve as a hint
indicating that the file is in Unicode, as opposed to in a legacy encoding and
furthermore, it act as a signature for the specific encoding form used. [ AF]

Q: What does ‘endian’ mean?

A: Data types longer than a byte can be stored in computer memory with the most
significant byte (MSB) first or last. The former is called big-endian, the latter little-
endian. When data is exchanged, bytes that appear in the "correct" order on the

08/10/2010
http://unicode.org/faq/utf_bom.html Page 12 of 13

sending system may appear to be out of order on the receiving system. In that
situation, a BOM would look like 0xFFFE which is a noncharacter, allowing the receiving
system to apply byte reversal before processing the data. UTF-8 is byte oriented and
therefore does not have that issue. Nevertheless, an initial BOM might be useful to
identify the datastream as UTF-8. [AF]

Q: When a BOM is used, is it only in 16-bit Unicode text?

A: No, a BOM can be used as a signature no matter how the Unicode text is
transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be
whatever the Unicode character FEFF is converted into by that transformation format.
In that form, the BOM serves to indicate both that it is a Unicode file, and which of the
formats it is in. Examples:

Bytes Encoding Form


00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can
I still assume the remaining UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness
of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used
as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note
that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used
transparently in 8-bit environments, the use of a BOM will interfere with any protocol
or file format that expects specific ASCII characters at the beginning, such as the use of
"#!" of at the beginning of Unix shell scripts. [AF]

Q: What should I do with U+FEFF in the middle of a file?

A: In the absence of a protocol supporting its use as a BOM and when not at the
beginning of a text stream, U+FEFF should normally not occur. For backwards
compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and
is then part of the content of the file or string. The use of U+2060 WORD JOINER is
strongly preferred over ZWNBSP for expressing word joining semantics since it cannot
be confused with a BOM. When designing a markup language or data protocol, the use
of U+FEFF can be restricted to that of Byte Order Mark. In that case, any FEFF
occurring in the middle of a file can be treated as an unsupported character.€ [AF]

Q: I am using a protocol that has BOM at the start of text. How do I represent an
initial ZWNBSP?

A: Use U+2060 WORD JOINER instead.€

Q: How do I tag data that does not interpret FEFF as a BOM?

A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate
little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16. [MD]

08/10/2010
http://unicode.org/faq/utf_bom.html Page 13 of 13

Q: Why wouldn’t I always use a protocol that requires a BOM?

A: Where the data is typed, such as a field in a database, a BOM is unnecessary. In


particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-
32LE, a BOM is neither necessary nor permitted. Any FEFF would be interpreted as a
ZWNBSP.

Do not tag every string in a database or set of fields with a BOM, since it wastes space
and complicates string concatenation. Moreover, it also means two data fields may
have precisely the same content, but not be binary-equal (where one is prefaced by a
BOM).

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of
the BOM on certain Unicode data streams, such as files. When you need to conform
to such a protocol, use a BOM.
2. Some protocols allow optional BOMs in the case of untagged text. In those cases,
■ Where a text data stream is known to be plain text, but of unknown encoding,
BOM can be used as a signature. If there is no BOM, the encoding could be
anything.
■ Where a text data stream is known to be plain Unicode text (but not which
endian), then BOM can be used as a signature. If there is no BOM, the text should
be interpreted as big-endian.
3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If
UTF-8 is used with these protocols, use of the BOM as encoding form signature
should be avoided.
4. Where the precise type of the data stream is known (e.g. Unicode big-endian or
Unicode little-endian), the BOM should not be used. In particular, whenever a data
stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must
not be used. (See also Q: What is the difference between UCS-2 and UTF-16?.) [AF]

Last updated: - sexta-feira, 20 de agosto de 2010 18:02:31

08/10/2010

You might also like