0% found this document useful (0 votes)
15 views4 pages

Extr 040

Uploaded by

skamelrech2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

Extr 040

Uploaded by

skamelrech2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

158 - 06: All About Strings

representing the Latin letter a (#$0061) followed by the code point representing the
grave accent (#$0300), this should be displayed as a single accented character.
In Object Pascal coding terms, if you write the following (part of the CodePoints
application project), the message will have one single accented character, as in Fig-
ure 6.2.
var
str: String;
begin
str := #$0061 + #$0300;
ShowMessage (str);

Figure 6.2:
A single grapheme can
be the result of
multiple code points

In this case we have two characters, representing two code points, but only one
grapheme (or visual elements). The fact is that while in the Latin alphabet you can
use a specific Unicode code point to represent the given grapheme (letter a with
grave accent is code point $00E0), in other alphabets combining Unicode code
points is the only way to obtain a given grapheme (and the correct output).
Even if the display is that of an accented character, there is no automatic normaliza-
tion or transformation of the value (only a proper display), so the string internally
remains different from one with the single character à.

note The rendering of graphemes from multiple code points might depend on specific support from the
operating system and on text rendering techniques being used, so you might find out that for
some of the graphemes not all operating systems offer the correct output.

From Code Points to Bytes (UTF)


While ASCII used a direct and easy mapping of character to their numeric represen-
tation, Unicode uses a more complex approach. As I mentioned, every element of
the Unicode alphabet has an associated code point, but the mapping to the actual
representation is often more complicated.

Marco Cantù, Object Pascal Handbook


06: All About Strings - 159

One of the elements of confusion behind Unicode is that there are multiple ways to
represent the same code point (or Unicode character numerical value) in terms of
actual storage, of physical bytes, in memory or on a file.
The issue stems from the fact that the only way to represent all Unicode code points
in a simple and uniform way is to use four bytes for each code point. This accounts
for a fixed-length representation (each character requires always the same amount
of bytes), but most developers would perceive this as too expensive in memory and
processing terms.

note In Object Pascal the Unicode Code Points can be represented directly in a 4-bytes representation
by using the UCS4Char data type.

That's why the Unicode standard defines other representations, generally requiring
less memory, but in which the number of bytes for each symbol is different, depend-
ing its code point. The idea is to use a smaller representation for the most common
elements, and a longer one for those infrequently encountered.
The different physical representations of the Unicode code points are called Unicode
Transformation Formats (or UTF). These are algorithmic mappings, part of the Uni-
code standard, that map each code point (the absolute numeric representation of a
character) to a unique sequence of bytes representing the given character. Notice
that the mappings can be used in both directions, converting back and forth between
different representations.
The standard defines three of these formats, depending on how many bits are used
to represent the initial part of the set (the initial 128 characters): 8, 16, or 32. It is
interesting to notice that all three forms of encodings need at most 4 bytes of data
for each code point.
• UTF-8 transforms characters into a variable-length encoding of 1 to 4 bytes.
UTF-8 is popular for HTML and similar protocols, because it is quite compact
when most characters (like tags in HTML) fall within the ASCII subset.
• UTF-16 is popular in many operating systems (including Windows and Mac OS
X) and development environments. It is quite convenient as most characters fit in
two bytes, reasonably compact, and fast to process.
• UTF-32 makes a lot of sense for processing (all code points have the same
length), but it is memory consuming and has limited use in practice.
There is a common misconception that UTF-16 can map directly all code points with
two bytes, but since Unicode defines over 100,000 code points you can easily figure
out they won't fit into 64K elements. It is true, however, that at times developers use
only a subset of Unicode, to make it fit in a 2-bytes-per-characters fixed-length rep-

Marco Cantù, Object Pascal Handbook


160 - 06: All About Strings

resentation. In the early days, this subset of Unicode was called UCS-2, now you
often see it referenced as Basic Multilingual Plane (BMP). However, this is only a
subset of Unicode (one of many planes).

note A problem relating to multi-byte representations (UTF-16 and UTF-32) is which of the bytes
comes first? According to the standard, all forms are allowed, so you can have a UTF-16 BE (big-
endian) or LE (little-endian), and the same for UTF-32. The big-endian byte serialization has the
most significant byte first, the little-endian byte serialization has the least significant byte first.
The bytes serialization is often marked in files along with the UTF representation with a header
called Byte Order Mark (BOM).

The Byte Order Mark


When you have a text file storing Unicode characters, there is a way to indicate
which is the UTF format being used for the code points. The information is stored in
a header or marker at the beginning of the file, called Byte Order Mark (BOM). This
is a signature indicating the Unicode format being used and the byte order form (lit-
tle or big endian – LE or BE). The following table provides a summary of the various
BOMs, which can be 2, 3, or 4 bytes long:
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
We'll see later in this chapter how Object Pascal manages the BOM within its
streaming classes. The BOM appears at the very beginning of a file with the actual
Unicode data immediately following it. So a UTF-8 file with the content AB contains
five hexadecimal values (3 for the BOM, 2 for the letters):
EF BB BF 41 42
If a text file has none of these signatures, it is generally considered as an ASCII text
file, but it might as well contain text with any encoding.

note On the other hand, when you are receiving data from a web request or through other Internet pro-
tocols, you might have a specific header (part of the protocol) indicating the encoding, rather than
relying on a BOM.

Marco Cantù, Object Pascal Handbook


06: All About Strings - 161

Looking at Unicode
How do we create a table of Unicode characters like those I displayed earlier for
ASCII ones? We can start by displaying code points in the Basic Multilingual Plane
(BMP), excluding what are called surrogate pairs.

note Not all numeric values are true UTF-16 code points, since there are some non-valid numerical val-
ues for characters (called surrogates) used to form a paired code and represent code points above
65535. A good example of a surrogate pair is the symbol used in music scores for the F (or bass)
clef, 𝄢. It is code point 1D122 which is represented in UTF-16 by two values, D834 followed by
DD22.

Displaying all of the elements of the BMP would requires a 256 * 256 grid, hard to
accommodate on screen. This is why the ShowUnicode application project has a tab
with two pages: The first tab has the primary selector with 256 blocks, while the sec-
ond page shows a grid with the actual Unicode elements, one section at a time. This
program has a little more of a user interface than most others in the book, and you
can simply skim through its code if you are only interested in its output (and not the
internals).
When the program starts, it fills the ListView control in the first page of the TabCon-
trol with 256 entries, each indicating the first and last character of a group of 256.
Here is the actual code of the OnCreate event handler of the form and a simple func-
tion used to display each element, while the corresponding output is in Figure 6.3:
// helper function
function GetCharDescr (nChar: Integer): string;
begin
if Char(nChar).IsControl then
Result := 'Char #' + IntToStr (nChar) + ' [ ]'
else
Result := 'Char #' + IntToStr (nChar) +
' [' + Char (nChar) + ']';
end;

procedure TForm2.FormCreate(Sender: TObject);


var
I: Integer;
ListItem: TListViewItem;
begin
for I := 0 to 255 do // 256 pages * 256 characters each
begin
ListItem := ListView1.Items.Add;
ListItem.Tag := I;
if (I < 216) or (I > 223) then
ListItem.Text :=
GetCharDescr(I*256) + '/' + GetCharDescr(I*256+255)
else

Marco Cantù, Object Pascal Handbook

You might also like