Unicode CPP PDF
Unicode CPP PDF
□□□□□ □□□□□□□□
□□□□□□ □□□□□□□□ □□□□□□□□□□□ □□□□□□□□
□□□□□□□□□ □□□□□□ □□□
Unicode in C++
James McNellis (@JamesMcNellis)
Senior Software Development Engineer
Microsoft Visual C++
Before there was Unicode…
Single-Byte Encodings
ASCII
!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
ASCII
H e l l o ! \0
↓ ↓ ↓ ↓ ↓ ↓ ↓
48 65 6C 6C 6F 21 00
ASCII
ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒ
áíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙∙√ⁿ²■
ISO/IEC 8859
¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ISO/IEC 8859-5 (“Latin/Cyrillic”)
ЁЂЃЄЅІЇЈЉЊЋЎЏАБВГДЕЖЗИЙКЛМНОП
РСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмноп
рстуфхцчшщъыьэюя№ёђѓєѕіїјљњћќ§ўџ
ISO/IEC 8859-5 (“Latin/Cyrillic”)
А л л о \0
↓ ↓ ↓ ↓ ↓
B0 DB DB DE 00
ISO/IEC 8859-5 (“Latin/Cyrillic”)
B0 DB DB DE 00
↓ ↓ ↓ ↓ ↓
А л л о \0
But You Have to be Careful…
B0 DB DB DE 00
↓ ↓ ↓ ↓ ↓
° Ϋ Ϋ ή \0
Single-Byte Encodings
Some characters are representable using a single byte; others require two bytes
Two-byte characters consist of a lead byte and a trail byte
The lead byte will always have the high bit set; the trail byte may have any value
Starts from 7-bit ASCII, with a couple of substitutions (replaces \ with ¥ and ~ with ‾)
The encoding form is very complex due to overlap
Shift-JIS
http://en.wikipedia.org/wiki/Shift_JIS
Shift-JIS
44 → D (Latin Capital D)
84 44 → Д (Cyrillic Capital De)
84 84 → т (Cyrillic Capital Te)
Shift-JIS
44 44 84 84 84 84 84 44
↓ ↓ ↓ ↓ ↓
D D т т Д
Shift-JIS
… 84 84 84 84 84 84 84 …
↑
p
Shift-JIS
Universal: Must be able to represent all characters likely to be used in text interchange
Efficient: Plain text should be simple to parse
Unambiguous: Any given Unicode code point always represents the same character
A set of characters, not glyphs (so, not concerned with visual representation)
Unicode 1.0
H e l l o ! \0
↓ ↓ ↓ ↓ ↓ ↓ ↓
U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
Unicode 1.0
H e l l o ! \0
↓ ↓ ↓ ↓ ↓ ↓ ↓
U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
Unicode 1.0
X Δ Ж ヸ ᠼ ☃ ڳ
↓ ↓ ↓ ↓ ↓ ↓ ↓
U+0058 U+0394 U+0436 U+30F8 U+183C U+2603 U+06B3
UCS-2
00 48 00 65 00 6C 00 6C 00 6F 00 21 00 00
↓ ↓ ↓ ↓ ↓ ↓ ↓
U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
↓ ↓ ↓ ↓ ↓ ↓ ↓
H e l l o ! \0
UCS-2
00 48 00 65 00 6C 00 6C 00 6F 00 21 00 00
↓ ↓ ↓ ↓ ↓ ↓ ↓
U+4800 U+6500 U+6C00 U+6C00 U+6F00 U+2100 U+0000
↓ ↓ ↓ ↓ ↓ ↓ ↓
䠀 攀 氀 氀 漀 ℀ \0
UCS-2
FE FF 00 48 00 65 00 6C 00 6C 00 6F 00 21 00 00
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
U+FEFF U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
↓ ↓ ↓ ↓ ↓ ↓ ↓
H e l l o ! \0
UCS-2
FF FE 48 00 65 00 6C 00 6C 00 6F 00 21 00 00 00
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
U+FEFF U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
↓ ↓ ↓ ↓ ↓ ↓ ↓
H e l l o ! \0
UCS-2
Advantages:
A huge number of characters are representable
Characters from different scripts are easily combinable in a single string
Each code point is representable using a single code unit, so the encoding is simple
Disadvantages:
Multiple possible byte orderings, so byte order mark (BOM) is required for interchange
Every character requires two bytes, so text for many languages require twice as much storage
None of the byte-oriented string functions (like strcpy) work with UCS-2 strings
Code Space Usage
Code Space Usage
0
60,000
1.0 (1991)
2.0 (1996)
3.0 (1999)
Number of Characters
4.0 (2003)
5.0 (2006)
6.0 (2010)
7.0 (2014?)
Unicode Encodings Today
Expanding the Code Space
We can’t store that many distinct values using 16 bits, so the encoding has to change too
If 16 bits isn’t enough space, how about 32 bits?
UTF-32
00 00 00 48 00 00 00 69 00 00 00 21
↓ ↓ ↓
U+0048 U+0065 U+0021
↓ ↓ ↓
H i !
UTF-32
00 00 FE FF 00 00 00 48 00 00 00 69 00 00 00 21
↓ ↓ ↓ ↓
U+FEFF U+0048 U+0065 U+0021
↓ ↓ ↓
H i !
UTF-32
FF FE 00 00 48 00 00 00 69 00 00 00 21 00 00 00
↓ ↓ ↓ ↓
U+FEFF U+0048 U+0065 U+0021
↓ ↓ ↓
H i !
UTF-32
FF FE 00 00 48 00 00 00 69 00 00 00 30 F4 01 00
↓ ↓ ↓ ↓
U+FEFF U+0048 U+0065 U+1F430
↓ ↓ ↓
H i 🐰
UTF-32
Advantages:
A huge number of characters are representable
Each code point is representable using a single code unit, so the encoding is simple
…this isn’t all that useful in most practical string usage; we’ll see why later…
Disadvantages:
Two possible byte orderings, so byte order mark (BOM) is required
Every character requires four bytes; wasting at least 11 bits per code point
None of the byte-oriented string functions (like strcpy) work with UTF-32 strings
Nothing that was written to work with UCS-2 works with UTF-32
UTF-8
48 65 6C 6C 6F 21 00
↓ ↓ ↓ ↓ ↓ ↓ ↓
U+0048 U+0065 U+006C U+006C U+006F U+0021 U+0000
↓ ↓ ↓ ↓ ↓ ↓ ↓
H e l l o ! \0
UTF-8
58 CE 94 D0 B6 E3 83 B8 E1 A0 BC F0 9F 90 B0 00
↓ ↓ ↓ ↓ ↓ ↓ ↓
U+0058 U+0394 U+0436 U+30F8 U+183C U+1F430 U+0000
↓ ↓ ↓ ↓ ↓ ↓ ↓
X Δ Ж ヸ ᠼ 🐰 \0
UTF-8
Advantages:
ASCII text has the same representation in UTF-8
No byte order mark (BOM) is required, though there is an optional BOM (EF BB BF)
Many byte-oriented string functions (strcpy, strcat, strlen, etc.) work with UTF-8 strings
UTF-8 encoded text requires less storage than 16-bit and 32-bit encodings for most languages
Disadvantages:
It’s a variable-width encoding
Text for a few languages requires more storage than 16-bit encodings
UTF-16
1. Subtract 0x010000 from the code point; this yields a 20 bit number in [0x0, 0xFFFFF]
2. Split the number in half, with the upper ten bits in one half; the lower ten bits in the second
3. The upper ten bits are added to 0xD800 to form the lead surrogate
4. The lower ten bits are added to 0xDC00 to form the trail surrogate
UTF-16
🍸
0x1F378
- 0x10000
0x0F378 → 0b00001111001101111000
0b00001111001101111000
↓ ↓
0x003C 0x0378
+ 0xD800 0xDC00
0xD83C 0xDF78
UTF-16
Advantages:
Some level of compatibility with UCS-2 and systems designed to use a 16-bit encoding
Text for a few languages is smaller in UTF-16 than in UTF-8
Disadvantages:
Multiple possible byte orderings, so byte order mark (BOM) is required for interchange
Every character requires at least two bytes
None of the byte-oriented string functions (like strcpy) work with UTF-16 strings
It’s a variable-width encoding, but it’s often misused as a fixed-width encoding
Binary string comparison for UTF-16 produces different results than for UTF-8 and UTF-32
UTF-8, UTF-16, UTF-32
UTF-8
More compact than UTF-32; usually more compact than UTF-16
It’s a variable width encoding, so it’s more complex
By far the most commonly used for storage and data transmission
Dominant character encoding on the Internet
UTF-32
Simple, fixed width encoding
Lots of wasted space because of the large 32-bit code units
UTF-8 vs. UTF-16 Text Size
pnb 1443 1204 119.85 % // Panjabi, Western cmn 1563663 1055924 148.08 % // Mandarin Chinese
kor 67522 55332 122.03 % // Korean yue 115488 77874 148.30 % // Yue Chinese
mal 76590 60468 126.66 % // Malayalam wuu 150945 101588 148.59 % // Wu Chinese
tel 2642 2080 127.02 % // Telugu tha 18745 12610 148.65 % // Thai
hin 455546 349438 130.37 % // Hindi lzh 73285 48982 149.62 % // Literary Chinese
mar 1332509 1008604 132.11 % // Marathi bod 3925 2622 149.69 % // Tibetan
npi 460 348 132.18 % // Nepali jpn 9778643 6524720 149.87 % // Japanese
ben 27733 20802 133.32 % // Bengali
san 1465 1094 133.91 % // Sanskrit
kat 29970 22344 134.13 % // Georgian
ain 546 384 142.19 % // Ainu (Japan)
lao 1576 1084 145.39 % // Lao
khm 22676 15348 147.75 % // Central Khmer
Dynamic Composition
Dynamic Composition
ÀÁÂÃÄÅ
Dynamic Composition
A + ̈ → Ä
U+0041 U+0308
Dynamic Composition
e + ̃ + ̽ + ̪ → ̽e̪ ̃
U+0065 U+0303 U+033D U+032A
Dynamic Composition
e + ̃ + ̽ + ̪ → ẽ̪̽
U+0065 U+0303 U+033D U+032A
e + ̽ + ̃ + ̪ → ẽ̪̽
U+0065 U+033D U+0303 U+032A
◔_◔
2 x U+25D4 (CIRCLE WITH UPPER RIGHT
QUADRANT BLACK); ROLLING EYES
You must remember that it’s just a dumb sequence of code units
NOT code points, characters, text elements, grapheme clusters, etc.
Iterators (using begin() and end()) iterate over code units
size() does not return the number of code points; it returns the number of code units
Functions like front, back, operator[], push_back, find(CharT) overloads operate on code units
std::basic_string
// Wrong:
a.push_back('☃'); // U+2603
// Right:
char const snowman[]{u8"☃"}; // U+2603
a.insert(a.end(), begin(snowman), end(snowman));
std::basic_string
// Wrong:
a.push_back(u'🍸'); // U+1F378
// Right:
char16_t const glass[]{u"🍸"}; // U+1F378
a.insert(a.end(), begin(glass), end(glass));
std::basic_string
// Right:
a.push_back(U'🍸'); // U+1F378
“Length”
U+1F4CF (STRAIGHT RULER)
String Length
"Hello"
↑↑↑↑↑
12345
String Length
"1 Ä 🍸"
1 A ¨ 🍸
U+0031 U+0041 U+0308 U+1F378
UTF-8 31 41 CC 88 F0 9F 8D B8
UTF-16 0031 0041 0308 D83C DF78
UTF-32 00000031 00000041 00000308 0001F378
String Length: Number of Bytes
"1 Ä 🍸"
1 A ¨ 🍸
U+0031 U+0041 U+0308 U+1F378
Length
UTF-8 31 41 CC 88 F0 9F 8D B8 8
UTF-16 0031 0041 0308 D83C DF78 10
UTF-32 00000031 00000041 00000308 0001F378 16
String Length: Number of Code Units
"1 Ä 🍸"
1 A ¨ 🍸
U+0031 U+0041 U+0308 U+1F378
Length
UTF-8 31 41 CC 88 F0 9F 8D B8 8
UTF-16 0031 0041 0308 D83C DF78 5
UTF-32 00000031 00000041 00000308 0001F378 4
String Length: Number of Code Points
"1 Ä 🍸"
1 A ¨ 🍸
U+0031 U+0041 U+0308 U+1F378
Length
UTF-8 31 41 CC 88 F0 9F 8D B8 4
UTF-16 0031 0041 0308 D83C DF78 4
UTF-32 00000031 00000041 00000308 0001F378 4
String Length: Number of Text Elements
"1 Ä 🍸"
1 A ¨ 🍸
U+0031 U+0041 U+0308 U+1F378
Length
UTF-8 31 41 CC 88 F0 9F 8D B8 3
UTF-16 0031 0041 0308 D83C DF78 3
UTF-32 00000031 00000041 00000308 0001F378 3
String Length
std::string s{u8"1Ä🍸"};
std::size_t number_of_code_units{s.size()};
for (auto&& c : s)
{
// Iterating over code units, not characters
}
“Equality”
U+2248 (ALMOST EQUAL TO)
Representational Equality
if (a == b)
{
std::cout << "The strings are equal.\n";
}
Representational Equality
std::string a{u8"1Ä🍸"};
std::string b{u8"1Ä🍸"};
if (a == b)
{
std::cout << "The strings are equal.\n";
}
Multiple Representations
U+0041 U+0308
Ä
A + ¨
Ä
U+00C4
Latin Capital Letter A With Diaeresis
Multiple Representations
U+0041
A
A
A
U+FF21
Fullwidth Latin Capital Letter A
Multiple Representations
U+0066 U+0069
fi
f + i
fi
U+FB01
Latin Small Ligature Fi
Multiple Representations
Ⅻ
X + I + I
Ⅻ
U+216B
Roman Numeral Twelve
Multiple Representations
std::string a{"a"};
std::string b{"b"};
if (a < b)
{
std::cout << "a is ordered before b.\n";
}
Ordering
// Output: X Y Z a b c
String Collation
std::locale en_us{"en-US"};
std::sort(v.begin(), v.end(), en_us);
// Output: a b c X Y Z
String Collation
std::locale en_us{"de-DE"};
std::sort(v.begin(), v.end(), en_us);
// Output: a ä b z
String Collation
std::locale en_us{"sv-SE"};
std::sort(v.begin(), v.end(), en_us);
// Output: a b z ä
Unicode Collation using the Standard Library
std::locale loc{"en-US"};
char lowercase_a{'a'};
char uppercase_a{std::toupper(lowercase_a, loc)}; // A
Text Manipulation
Most Standard Library manipulation functions (like toupper) are code unit based
But Unicode text manipulation does not work well with this model
Consider ß in German: the uppercase form is SS, which is two characters
Text Manipulation
std::string utf8{u8"1Ä🍸"};
std::wstring_convert<
std::codecvt_utf8<char32_t>, char32_t
> utf32_converter;
std::u32string utf32{utf32_converter.from_bytes(utf8)};
I/O Conversions using <codecvt>
for (char32_t c{}; f.get(c); ) { std::cout << std::hex << c << ' '; }
// Output: 31 41 308 1f378
International Components for
Unicode (ICU)
International Components for Unicode
UChar32 cake_utf32c{static_cast<UChar32>(U'🍰')};
UChar32 cake_utf32s[2]{static_cast<UChar32>(U'🍰'), 0};
UChar cake_utf16 [3]{0xD83C, 0xDF70, 0};
char cake_utf8 [5]{u8"🍰"};
UnicodeString cake_1{cake_utf32c};
UnicodeString cake_2{UnicodeString::fromUTF32(cake_utf32s, 1)};
UnicodeString cake_2{cake_utf16};
UnicodeString cake_3{UnicodeString::fromUTF8(cake_utf8)};
UnicodeString
UErrorCode status{U_ZERO_ERROR};
Normalizer2 const& normalizer{*icu::Normalizer2::getInstance(
nullptr, "nfc", UNORM2_COMPOSE, status)};
UnicodeString result{};
normalizer.normalize(source, result, status);
// U+00C4 (Latin Capital Letter A with Diaeresis)
Collation
UErrorCode status{U_ZERO_ERROR};
std::unique_ptr<Collator> collator{
Collator::createInstance(Locale("en", "US"), status)};
UnicodeString a{UnicodeString::fromUTF8(u8"a")};
UnicodeString z{UnicodeString::fromUTF8(u8"Z")};
UErrorCode status{U_ZERO_ERROR};
std::unique_ptr<RegexMatcher> matcher{new RegexMatcher{
UnicodeString::fromUTF8(u8R"(\p{Number})"), 0, status}};
matcher->reset(UnicodeString::fromUTF8(u8"Ⅻ⅝"));
while (matcher->find())
{
// Handle match
}
Boost.Locale and Boost.Regex
Boost.Locale
boost::locale::generator gen{};
std::locale loc{gen.generate("en-US.UTF-8")};
Normalization
assert(a_decomp == b);
assert(b_comp == a);
Collation
std::string a{u8"A"};
std::string accented_a{u8"A\u0308"};
int result{std::use_facet<boost::locale::collator<char>>(loc)
.compare(boost::locale::collator_base::primary, a, accented_a)};
std::string a{u8"1Ä🍸"};
std::u16string b{boost::locale::conv::utf_to_utf<char16_t>(a)};
std::u32string c{boost::locale::conv::utf_to_utf<char32_t>(a)};
std::u32string d{boost::locale::conv::utf_to_utf<char32_t>(b)};
std::string e{boost::locale::conv::utf_to_utf<char >(b)};
Conversions
// А л л о
std::string s{"\xB0\xDB\xDB\xDE"};
std::string utf8_s{boost::locale::conv::to_utf<char>(s, "ISO-8859-5")};
Text Manipulation
namespace ba = boost::locale::boundary;
std::string subject{u8"1Ä🍸"};
ba::segment_index<std::string::const_iterator> map(
ba::character, subject.begin(), subject.end(), loc);
boost::u32regex r{boost::make_u32regex(u8R"(\p{Number})")};
std::string subject{u8"Ⅻ⅝"};
www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html
www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3572.html
wchar_t
mbtowc, wctomb
mbstowcs, wcstombs
Unicode Support in C99
wchar_t
mbtowc, wctomb
mbstowcs, wcstombs
wchar_t-equivalents for many I/O and string handling function
Unicode Support in C11
setlocale(LC_CTYPE, "en_US.UTF-8");
char32_t utf32_c = 0;
mbstate_t state = { 0 };
mbrtoc32(&utf32_c, utf8_c, 4, &state);
printf("0x%8x\n", utf32_c);
mbrtoc32
setlocale(LC_CTYPE, "en_US.UTF-8");
char16_t utf16_c[2] = { 0 };
mbstate_t state = { 0 };
mbrtoc16(&utf16_c[0], utf8_c, 4, &state);
mbrtoc16(&utf16_c[1], utf8_c, 4, &state);