Extr 040
Extr 040
representing the Latin letter a (#$0061) followed by the code point representing the
grave accent (#$0300), this should be displayed as a single accented character.
In Object Pascal coding terms, if you write the following (part of the CodePoints
application project), the message will have one single accented character, as in Fig-
ure 6.2.
var
str: String;
begin
str := #$0061 + #$0300;
ShowMessage (str);
Figure 6.2:
A single grapheme can
be the result of
multiple code points
In this case we have two characters, representing two code points, but only one
grapheme (or visual elements). The fact is that while in the Latin alphabet you can
use a specific Unicode code point to represent the given grapheme (letter a with
grave accent is code point $00E0), in other alphabets combining Unicode code
points is the only way to obtain a given grapheme (and the correct output).
Even if the display is that of an accented character, there is no automatic normaliza-
tion or transformation of the value (only a proper display), so the string internally
remains different from one with the single character à.
note The rendering of graphemes from multiple code points might depend on specific support from the
operating system and on text rendering techniques being used, so you might find out that for
some of the graphemes not all operating systems offer the correct output.
One of the elements of confusion behind Unicode is that there are multiple ways to
represent the same code point (or Unicode character numerical value) in terms of
actual storage, of physical bytes, in memory or on a file.
The issue stems from the fact that the only way to represent all Unicode code points
in a simple and uniform way is to use four bytes for each code point. This accounts
for a fixed-length representation (each character requires always the same amount
of bytes), but most developers would perceive this as too expensive in memory and
processing terms.
note In Object Pascal the Unicode Code Points can be represented directly in a 4-bytes representation
by using the UCS4Char data type.
That's why the Unicode standard defines other representations, generally requiring
less memory, but in which the number of bytes for each symbol is different, depend-
ing its code point. The idea is to use a smaller representation for the most common
elements, and a longer one for those infrequently encountered.
The different physical representations of the Unicode code points are called Unicode
Transformation Formats (or UTF). These are algorithmic mappings, part of the Uni-
code standard, that map each code point (the absolute numeric representation of a
character) to a unique sequence of bytes representing the given character. Notice
that the mappings can be used in both directions, converting back and forth between
different representations.
The standard defines three of these formats, depending on how many bits are used
to represent the initial part of the set (the initial 128 characters): 8, 16, or 32. It is
interesting to notice that all three forms of encodings need at most 4 bytes of data
for each code point.
• UTF-8 transforms characters into a variable-length encoding of 1 to 4 bytes.
UTF-8 is popular for HTML and similar protocols, because it is quite compact
when most characters (like tags in HTML) fall within the ASCII subset.
• UTF-16 is popular in many operating systems (including Windows and Mac OS
X) and development environments. It is quite convenient as most characters fit in
two bytes, reasonably compact, and fast to process.
• UTF-32 makes a lot of sense for processing (all code points have the same
length), but it is memory consuming and has limited use in practice.
There is a common misconception that UTF-16 can map directly all code points with
two bytes, but since Unicode defines over 100,000 code points you can easily figure
out they won't fit into 64K elements. It is true, however, that at times developers use
only a subset of Unicode, to make it fit in a 2-bytes-per-characters fixed-length rep-
resentation. In the early days, this subset of Unicode was called UCS-2, now you
often see it referenced as Basic Multilingual Plane (BMP). However, this is only a
subset of Unicode (one of many planes).
note A problem relating to multi-byte representations (UTF-16 and UTF-32) is which of the bytes
comes first? According to the standard, all forms are allowed, so you can have a UTF-16 BE (big-
endian) or LE (little-endian), and the same for UTF-32. The big-endian byte serialization has the
most significant byte first, the little-endian byte serialization has the least significant byte first.
The bytes serialization is often marked in files along with the UTF representation with a header
called Byte Order Mark (BOM).
note On the other hand, when you are receiving data from a web request or through other Internet pro-
tocols, you might have a specific header (part of the protocol) indicating the encoding, rather than
relying on a BOM.
Looking at Unicode
How do we create a table of Unicode characters like those I displayed earlier for
ASCII ones? We can start by displaying code points in the Basic Multilingual Plane
(BMP), excluding what are called surrogate pairs.
note Not all numeric values are true UTF-16 code points, since there are some non-valid numerical val-
ues for characters (called surrogates) used to form a paired code and represent code points above
65535. A good example of a surrogate pair is the symbol used in music scores for the F (or bass)
clef, 𝄢. It is code point 1D122 which is represented in UTF-16 by two values, D834 followed by
DD22.
Displaying all of the elements of the BMP would requires a 256 * 256 grid, hard to
accommodate on screen. This is why the ShowUnicode application project has a tab
with two pages: The first tab has the primary selector with 256 blocks, while the sec-
ond page shows a grid with the actual Unicode elements, one section at a time. This
program has a little more of a user interface than most others in the book, and you
can simply skim through its code if you are only interested in its output (and not the
internals).
When the program starts, it fills the ListView control in the first page of the TabCon-
trol with 256 entries, each indicating the first and last character of a group of 256.
Here is the actual code of the OnCreate event handler of the form and a simple func-
tion used to display each element, while the corresponding output is in Figure 6.3:
// helper function
function GetCharDescr (nChar: Integer): string;
begin
if Char(nChar).IsControl then
Result := 'Char #' + IntToStr (nChar) + ' [ ]'
else
Result := 'Char #' + IntToStr (nChar) +
' [' + Char (nChar) + ']';
end;