0% found this document useful (0 votes)
30 views4 pages

186 - 06: All About Strings: Unicode Tunicodeencoding

Uploaded by

skamelrech2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

186 - 06: All About Strings: Unicode Tunicodeencoding

Uploaded by

skamelrech2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

186 - 06: All About Strings

...
public
class property ASCII: TEncoding read GetASCII;
class property BigEndianUnicode: TEncoding
read GetBigEndianUnicode;
class property Default: TEncoding read GetDefault;
class property Unicode: TEncoding read GetUnicode;
class property UTF7: TEncoding read GetUTF7;
class property UTF8: TEncoding read GetUTF8;

note The Unicode encoding is based on the TUnicodeEncoding class that uses the same UTF-16 LE
(Little Endian) format used by the string type. The BigEndianUnicode, instead, uses the less com-
mon Big Endian representation. If you are not familiar with “Endianness” this is a terms used to
indicate the sequence of two bytes making a code point (or any other data structure). Little
Endian has the most significant byte first, and Big Endian has the most significant byte last. For
more information, see en.wikipedia.org/wiki/Endianness.

Again, rather than exploring these classes in general, something a little difficult at
this point of the book, let's focus on a couple of practical examples. The TEncoding
class has methods for reading and writing Unicode strings to byte arrays, perform-
ing appropriate conversions.
To demonstrate UTF format conversions via TEncoding classes, but also to keep my
example simple and focused and avoid working with the file system, in the Encod-
ingsTest application project I've created an UTF-8 string in memory using some
specific data, and converted it to UTF-16 with a single function call:
var
Utf8string: TBytes;
Utf16string: string;
begin
// process Utf8data
SetLength (Utf8string, 3);
Utf8string[0] := Ord ('a'); // single byte ANSI char < 128
Utf8string[1] := $c9; // double byte, reversed latin a
Utf8string[2] := $90;
Utf16string := TEncoding.UTF8.GetString(Utf8string);
Show ('Unicode: ' + Utf16string);
The output should be:
Unicode: aɐ
Now to better understand the conversion and the difference in the representations,
I've added the following code:
Show ('Utf8 bytes:');
for AByte in Utf8String do
Show (AByte.ToString);

Show ('Utf16 bytes:');


UniBytes := TEncoding.Unicode.GetBytes (Utf16string);

Marco Cantù, Object Pascal Handbook


06: All About Strings - 187

for AByte in UniBytes do


Show (AByte.ToString);
This code produces a memory dump, with decimal values, for the two representa-
tions of the string, UTF-8 (a one byte and a two byte code point) and UTF-16 (with
both code points being 2 bytes):
Utf8 bytes:
97
201
144
Utf16 bytes:
97
0
80
2
Notice that direct character to byte conversion, for UTF-8, work only for ANSI-7
characters, that is values up to 127. For higher level ANSI characters there is no
direct mapping and you must perform a conversion, using the specific encoding
(which will however fail on multi-byte UTF-8 elements). So both of the following
produce wrong output:
// error: cannot use char > 128
Utf8string[0] := Ord ('à');
Utf16string := TEncoding.UTF8.GetString(Utf8string);
Show ('Wrong high ANSI: ' + Utf16string);
// try different conversion
Utf16string := TEncoding.ANSI.GetString(Utf8string);
Show ('Wrong double byte: ' + Utf16string);

// output
Wrong high ANSI:
Wrong double byte: àÉ
The encoding classes let you convert in both directions, so in this case I'm convert-
ing from UTF-16 to UTF-8, doing some processing of the UTF-8 string (something
to be done with care, given the variable length nature of this format), and convert
back to UTF-16:
var
Utf8string: TBytes;
Utf16string: string;
I: Integer;
begin
Utf16string := 'This is my nice string with à and Æ';
Show ('Initial: ' + Utf16string);

Utf8string := TEncoding.UTF8.GetBytes(Utf16string);
for I := 0 to High(Utf8string) do
if Utf8string[I] = Ord('i') then
Utf8string[I] := Ord('I');
Utf16string := TEncoding.UTF8.GetString(Utf8string);
Show ('Final: ' + Utf16string);

Marco Cantù, Object Pascal Handbook


188 - 06: All About Strings

The output is:


Initial: This is my nice string with à and Æ
Final: ThIs Is my nIce strIng wIth à and Æ

Other Types for Strings


While the string data type is by far the most common and largely used type for rep-
resenting strings, Object Pascal desktop compilers had and still have a variety of
string types. Some of these types can be used also on mobile applications, but the
general recommendation is to do the appropriate conversion or just use TBytes
directly to manipulate string with a 1-byte representation, as in the application
project described in the last section.
While developers who used Object Pascal in the past might have a lot of code based
on these pre-Unicode types (or directly managing UTF-8), modern applications
really require full Unicode support. Also while some types, like UTF8String, are
available in the desktop compilers, their support in terms of RTL is severely limited.
Again, you can use an array of bytes to represent a similar type and adapt existing
code to handle it, but the recommendation is to move to plain and standard Unicode
strings.

note While there has been a lot of discussion and criticism about the lack of native types like AnsiString
and UTF8String in the Object Pascal mobile compilers, honestly there is almost no other pro-
gramming language out there that has more than one native or intrinsic string type. Multiple
string types are more complex to master, can cause unwanted side effects (like extensive auto-
matic conversion calls that slow down programs), and cost a lot for the maintenance of multiple
versions of all of the string management and processing functions.

The UCS4String type


An interesting but little used string type is the UCS4String type, available on all
compilers. This is just an UTF32 representation of a string, and no more than an
array of UTF32Char elements, or 4-bytes characters. The reason behind this type, as
mentioned earlier, is that is offers a direct representation of all of the Unicode code
points. The obvious drawback is such a string takes twice as much memory than a
UTF-16 string (which already takes twice as much than an ANSI string).

Marco Cantù, Object Pascal Handbook


06: All About Strings - 189

Although this data type can be used in specific situations, it is not particularly suited
for general circumstances. Also, this types doesn't support copy-on-write nor has
any real system functions and procedures for processing it.

note Whilst the UCS4String guarantees one UTF32Char per Unicode code point, it cannot guarantee
one UTF32Char per grapheme, or “visual character”.

Older, Desktop Only String Types


As mentioned, the desktop versions of the Object Pascal compilers offer support for
some older, traditional string types. These include
• The ShortString type, which corresponds to the original Pascal language string
type. These strings have a limit of 255 characters. Each element of a short string
is of type ANSIChar (a type also available only in desktop compilers).
• The ANSIString type, which corresponds to variable-length strings. These strings
are allocated dynamically, reference counted, and use a copy-on-write technique.
The size of these strings is almost unlimited (they can store up to two billion
characters!). Also this string type is based on the ANSIChar type.
• The WideString type is similar to a 2-bytes Unicode string in terms of represen-
tation, is based on the Char type, but unlike the standard string type is doesn't
use copy-on-write and it is less efficient in terms of memory allocation. If you
wonder why it was added to the language, the reason was for compatibility with
string management in Microsoft's COM architecture.
• UTF8String is a string based on the variable character length UTF-8 format. As I
mentioned there is little run-time library support for this type.
• RawByteString is an array of characters with no code page set, on which no char-
acter conversion is ever accomplished by the system (thus logically resembling a
TBytes structure, but allowing some direct string operations that an array of
bytes currently lacks).
• A string construction mechanism allowing you to define a 1-byte string associated
with a specific ISO code page, a remnant of the pre-Unicode past.
Again, all of these string types can be used on desktop compilers, but are available
only for backwards compatibility reason. The goal is to use Unicode, TEncoding, and
other modern string management techniques whenever possible.

Marco Cantù, Object Pascal Handbook

You might also like