186 - 06: All About Strings: Unicode Tunicodeencoding
186 - 06: All About Strings: Unicode Tunicodeencoding
...
public
class property ASCII: TEncoding read GetASCII;
class property BigEndianUnicode: TEncoding
read GetBigEndianUnicode;
class property Default: TEncoding read GetDefault;
class property Unicode: TEncoding read GetUnicode;
class property UTF7: TEncoding read GetUTF7;
class property UTF8: TEncoding read GetUTF8;
note The Unicode encoding is based on the TUnicodeEncoding class that uses the same UTF-16 LE
(Little Endian) format used by the string type. The BigEndianUnicode, instead, uses the less com-
mon Big Endian representation. If you are not familiar with “Endianness” this is a terms used to
indicate the sequence of two bytes making a code point (or any other data structure). Little
Endian has the most significant byte first, and Big Endian has the most significant byte last. For
more information, see en.wikipedia.org/wiki/Endianness.
Again, rather than exploring these classes in general, something a little difficult at
this point of the book, let's focus on a couple of practical examples. The TEncoding
class has methods for reading and writing Unicode strings to byte arrays, perform-
ing appropriate conversions.
To demonstrate UTF format conversions via TEncoding classes, but also to keep my
example simple and focused and avoid working with the file system, in the Encod-
ingsTest application project I've created an UTF-8 string in memory using some
specific data, and converted it to UTF-16 with a single function call:
var
Utf8string: TBytes;
Utf16string: string;
begin
// process Utf8data
SetLength (Utf8string, 3);
Utf8string[0] := Ord ('a'); // single byte ANSI char < 128
Utf8string[1] := $c9; // double byte, reversed latin a
Utf8string[2] := $90;
Utf16string := TEncoding.UTF8.GetString(Utf8string);
Show ('Unicode: ' + Utf16string);
The output should be:
Unicode: aɐ
Now to better understand the conversion and the difference in the representations,
I've added the following code:
Show ('Utf8 bytes:');
for AByte in Utf8String do
Show (AByte.ToString);
// output
Wrong high ANSI:
Wrong double byte: àÉ
The encoding classes let you convert in both directions, so in this case I'm convert-
ing from UTF-16 to UTF-8, doing some processing of the UTF-8 string (something
to be done with care, given the variable length nature of this format), and convert
back to UTF-16:
var
Utf8string: TBytes;
Utf16string: string;
I: Integer;
begin
Utf16string := 'This is my nice string with à and Æ';
Show ('Initial: ' + Utf16string);
Utf8string := TEncoding.UTF8.GetBytes(Utf16string);
for I := 0 to High(Utf8string) do
if Utf8string[I] = Ord('i') then
Utf8string[I] := Ord('I');
Utf16string := TEncoding.UTF8.GetString(Utf8string);
Show ('Final: ' + Utf16string);
note While there has been a lot of discussion and criticism about the lack of native types like AnsiString
and UTF8String in the Object Pascal mobile compilers, honestly there is almost no other pro-
gramming language out there that has more than one native or intrinsic string type. Multiple
string types are more complex to master, can cause unwanted side effects (like extensive auto-
matic conversion calls that slow down programs), and cost a lot for the maintenance of multiple
versions of all of the string management and processing functions.
Although this data type can be used in specific situations, it is not particularly suited
for general circumstances. Also, this types doesn't support copy-on-write nor has
any real system functions and procedures for processing it.
note Whilst the UCS4String guarantees one UTF32Char per Unicode code point, it cannot guarantee
one UTF32Char per grapheme, or “visual character”.