0% found this document useful (0 votes)
19 views4 pages

Extr 050

Uploaded by

skamelrech2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views4 pages

Extr 050

Uploaded by

skamelrech2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

162 - 06: All About Strings

ListItem.Text := 'Surrogate Code Points';


end;
end;

Figure 6.3:
The first page of the
ShowUnicode
application project has
a long list of sections of
Unicode characters

Notice how the code saves the number of the “page” in the Tag property of the items
of the ListView, an information used later to fill a page. As a user selects one of the
items, the application moves to the second page of the TabControl, filling its string
grid with the 256 characters of the section:
procedure TForm2.ListView1ItemClick(const Sender: TObject;
const AItem: TListViewItem);
var
I, NStart: Integer;
begin
NStart := AItem.Tag * 256;
for I := 0 to 255 do
begin
StringGrid1.Cells [I mod 16, I div 16] :=
IfThen (not Char(I + NStart).IsControl, Char (I + NStart), '');
end;
TabControl1.ActiveTab := TabItem2;

Marco Cantù, Object Pascal Handbook


06: All About Strings - 163

The IfThen function used in the code above is a two way test: If the condition passed
in the first parameter is true, the function returns the value of the second parameter;
if not, it returns the value of the third one. The test in the first parameter uses the
IsControl method of the Char type helper, to filter out non-printable control char-
acters.

note The IfThen function operates more or less like the ?: operator of most programming languages
based on the C syntax. There is a version for strings and a separate one for Integers. For the string
version you have to include the System.StrUtils unit, for the Integer version of IfThen the Sys-
tem.SysUtils unit.

The grid of Unicode characters produced by the application is visible in Figure 6.4.
Notice that the output varies depending on the ability of the selected font and the
specific operating system to display a given Unicode character.

Figure 6.4:
The second page of the
ShowUnicode
application project has
some of the actual
Unicode characters

Marco Cantù, Object Pascal Handbook


164 - 06: All About Strings

The Char Type Revisited


After this introduction to Unicode, let's get back to the real topic of this chapter,
which is how the Object Pascal language manages characters and strings. I intro-
duced the Char data type in Chapter 2, and mentioned some of the type helper
functions available in the Character unit. Now that you have a better understanding
of Unicode, it is worth revisiting that section and going though some more details.
First of all, the Char type does not invariably represent a Unicode code point. The
data type, in fact, uses 2 bytes for each element. While it does represent a code point
for elements in Unicode'e Basic Multi-language Plane (BMP), a Char can also be part
of a pair of surrogate values, representing a code point.
Technically, there is a different type you could use to represent any Unicode code
point directly, and this is the UCS4Char type, which used 4 bytes to represent a
value). This type is rarely used, as the extra memory required is generally hard to
justify, but you can see that the Character unit (covered next) also includes several
operations for this data type.
Back to the Char type, remember it is an enumerated type (even if a rather large
one), so it has the notion of sequence and offers code operations like Ord, Inc, Dec,
High, and Low. Most extended operations, including the specific type helper, are not
part of the basic system RTL units but require the inclusion of the Character unit.

Unicode Operations With The Character Unit


Most of the specific operations for Unicode characters (and also Unicode strings, of
course) are defined in a special units called System.Character. This unit defines the
TCharHelper helper for the Char type, which lets you apply operations directly to
variables of that type.

note The Character unit also defines a TCharacter record, which is basically a collection of static class
functions, plus a number of global routines mapped to these method. These are older, deprecated
functions, given that now the preferred way to work on the Char type at the Unicode level is the
use of the class helper.

The unit also defines two interesting enumerated types. The first is called TUnicode-
Category and maps the various characters in broad categories like control, space,
uppercase or lowercase letter, decimal number, punctuation, math symbol, and
many more. The second enumeration is called TUnicodeBreak and defines the family

Marco Cantù, Object Pascal Handbook


06: All About Strings - 165

of the various spaces, hyphen, and breaks. If you are used to ASCII operations, this
is a big change.
Numbers in Unicode are not only the characters between 0 and 9; spaces are not
limited to the character #32; and so on for many other assumption of the (much
simpler) 256-elements alphabet.
The Char type helper has over 40 methods that comprise many different tests and
operations. They can be used for:
• Getting the numeric representation of the character (GetNumericValue).
• Asking for the category (GetUnicodeCategory) or checking it against one of the
various categories (IsLetterOrDigit, IsLetter, IsDigit, IsNumber, IsControl,
IsWhiteSpace, IsPunctuation, IsSymbol, and IsSeparator). I used the IsCon-
trol operation in the previous demo.
• Checking if it is lowercase or uppercase (IsLower and IsUpper) or converting it
(ToLower and ToUpper).
• Verifying if it is part of a UTF-16 surrogate pair (IsSurrogate, IsLowSurrogate,
and IsHighSurrogate) and convert surrogate pairs in various ways.
• Converting it to and from UTF32 (ConvertFromUtf32 and ConvertToUtf32) and
UCS4Char type (ToUCS4Char).
• Checking if it is part of a given list of characters (IsInArray).
Notice that some of these operations can be applied to the type as a whole, rather
than to a specific variable. In that can you have to call them using the Char type as
prefix, as in the second code snippet below.
To experiment a bit with these operations on Unicode characters, I've create an
application project called CharTest. One of the examples of this demo is the effect of
calling uppercase and lowercase operations on Unicode elements. In fact, the classic
UpCase function of the RTL works only for the base 26 English language characters
of the ANSI representation, while it fails some Unicode character that do have a spe-
cific uppercase representations (not all alphabets have the concept of uppercase, so
this is not a universal notion).
To test this scenario, in the CharTest application project I've added the following
snippet that tries to convert an accented letter to uppercase:
var
ch1: Char;
begin
ch1 := 'ù';
Show ('UpCase ù: ' + UpCase(ch1));
Show ('ToUpper ù: ' + ch1.ToUpper);

Marco Cantù, Object Pascal Handbook

You might also like