0% found this document useful (0 votes)

30 views4 pages

186 - 06: All About Strings: Unicode Tunicodeencoding

Uploaded by

skamelrech2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views4 pages

186 - 06: All About Strings: Unicode Tunicodeencoding

Uploaded by

skamelrech2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

186 - 06: All About Strings

...
public
class property ASCII: TEncoding read GetASCII;
class property BigEndianUnicode: TEncoding
read GetBigEndianUnicode;
class property Default: TEncoding read GetDefault;
class property Unicode: TEncoding read GetUnicode;
class property UTF7: TEncoding read GetUTF7;
class property UTF8: TEncoding read GetUTF8;

note The Unicode encoding is based on the TUnicodeEncoding class that uses the same UTF-16 LE
(Little Endian) format used by the string type. The BigEndianUnicode, instead, uses the less com-
mon Big Endian representation. If you are not familiar with “Endianness” this is a terms used to
indicate the sequence of two bytes making a code point (or any other data structure). Little
Endian has the most significant byte first, and Big Endian has the most significant byte last. For
more information, see en.wikipedia.org/wiki/Endianness.

Again, rather than exploring these classes in general, something a little difficult at
this point of the book, let's focus on a couple of practical examples. The TEncoding
class has methods for reading and writing Unicode strings to byte arrays, perform-
ing appropriate conversions.
To demonstrate UTF format conversions via TEncoding classes, but also to keep my
example simple and focused and avoid working with the file system, in the Encod-
ingsTest application project I've created an UTF-8 string in memory using some
specific data, and converted it to UTF-16 with a single function call:
var
Utf8string: TBytes;
Utf16string: string;
begin
// process Utf8data
SetLength (Utf8string, 3);
Utf8string[0] := Ord ('a'); // single byte ANSI char < 128
Utf8string[1] := $c9; // double byte, reversed latin a
Utf8string[2] := $90;
Utf16string := TEncoding.UTF8.GetString(Utf8string);
Show ('Unicode: ' + Utf16string);
The output should be:
Unicode: aɐ
Now to better understand the conversion and the difference in the representations,
I've added the following code:
Show ('Utf8 bytes:');
for AByte in Utf8String do
Show (AByte.ToString);

Show ('Utf16 bytes:');

UniBytes := TEncoding.Unicode.GetBytes (Utf16string);

Marco Cantù, Object Pascal Handbook

06: All About Strings - 187

for AByte in UniBytes do

Show (AByte.ToString);
This code produces a memory dump, with decimal values, for the two representa-
tions of the string, UTF-8 (a one byte and a two byte code point) and UTF-16 (with
both code points being 2 bytes):
Utf8 bytes:
97
201
144
Utf16 bytes:
97
0
80
2
Notice that direct character to byte conversion, for UTF-8, work only for ANSI-7
characters, that is values up to 127. For higher level ANSI characters there is no
direct mapping and you must perform a conversion, using the specific encoding
(which will however fail on multi-byte UTF-8 elements). So both of the following
produce wrong output:
// error: cannot use char > 128
Utf8string[0] := Ord ('à');
Utf16string := TEncoding.UTF8.GetString(Utf8string);
Show ('Wrong high ANSI: ' + Utf16string);
// try different conversion
Utf16string := TEncoding.ANSI.GetString(Utf8string);
Show ('Wrong double byte: ' + Utf16string);

// output
Wrong high ANSI:
Wrong double byte: àÉ
The encoding classes let you convert in both directions, so in this case I'm convert-
ing from UTF-16 to UTF-8, doing some processing of the UTF-8 string (something
to be done with care, given the variable length nature of this format), and convert
back to UTF-16:
var
Utf8string: TBytes;
Utf16string: string;
I: Integer;
begin
Utf16string := 'This is my nice string with à and Æ';
Show ('Initial: ' + Utf16string);

Utf8string := TEncoding.UTF8.GetBytes(Utf16string);
for I := 0 to High(Utf8string) do
if Utf8string[I] = Ord('i') then
Utf8string[I] := Ord('I');
Utf16string := TEncoding.UTF8.GetString(Utf8string);
Show ('Final: ' + Utf16string);

Marco Cantù, Object Pascal Handbook

188 - 06: All About Strings

The output is:

Initial: This is my nice string with à and Æ
Final: ThIs Is my nIce strIng wIth à and Æ

Other Types for Strings

While the string data type is by far the most common and largely used type for rep-
resenting strings, Object Pascal desktop compilers had and still have a variety of
string types. Some of these types can be used also on mobile applications, but the
general recommendation is to do the appropriate conversion or just use TBytes
directly to manipulate string with a 1-byte representation, as in the application
project described in the last section.
While developers who used Object Pascal in the past might have a lot of code based
on these pre-Unicode types (or directly managing UTF-8), modern applications
really require full Unicode support. Also while some types, like UTF8String, are
available in the desktop compilers, their support in terms of RTL is severely limited.
Again, you can use an array of bytes to represent a similar type and adapt existing
code to handle it, but the recommendation is to move to plain and standard Unicode
strings.

note While there has been a lot of discussion and criticism about the lack of native types like AnsiString
and UTF8String in the Object Pascal mobile compilers, honestly there is almost no other pro-
gramming language out there that has more than one native or intrinsic string type. Multiple
string types are more complex to master, can cause unwanted side effects (like extensive auto-
matic conversion calls that slow down programs), and cost a lot for the maintenance of multiple
versions of all of the string management and processing functions.

The UCS4String type

An interesting but little used string type is the UCS4String type, available on all
compilers. This is just an UTF32 representation of a string, and no more than an
array of UTF32Char elements, or 4-bytes characters. The reason behind this type, as
mentioned earlier, is that is offers a direct representation of all of the Unicode code
points. The obvious drawback is such a string takes twice as much memory than a
UTF-16 string (which already takes twice as much than an ANSI string).

Marco Cantù, Object Pascal Handbook

06: All About Strings - 189

Although this data type can be used in specific situations, it is not particularly suited
for general circumstances. Also, this types doesn't support copy-on-write nor has
any real system functions and procedures for processing it.

note Whilst the UCS4String guarantees one UTF32Char per Unicode code point, it cannot guarantee
one UTF32Char per grapheme, or “visual character”.

Older, Desktop Only String Types

As mentioned, the desktop versions of the Object Pascal compilers offer support for
some older, traditional string types. These include
• The ShortString type, which corresponds to the original Pascal language string
type. These strings have a limit of 255 characters. Each element of a short string
is of type ANSIChar (a type also available only in desktop compilers).
• The ANSIString type, which corresponds to variable-length strings. These strings
are allocated dynamically, reference counted, and use a copy-on-write technique.
The size of these strings is almost unlimited (they can store up to two billion
characters!). Also this string type is based on the ANSIChar type.
• The WideString type is similar to a 2-bytes Unicode string in terms of represen-
tation, is based on the Char type, but unlike the standard string type is doesn't
use copy-on-write and it is less efficient in terms of memory allocation. If you
wonder why it was added to the language, the reason was for compatibility with
string management in Microsoft's COM architecture.
• UTF8String is a string based on the variable character length UTF-8 format. As I
mentioned there is little run-time library support for this type.
• RawByteString is an array of characters with no code page set, on which no char-
acter conversion is ever accomplished by the system (thus logically resembling a
TBytes structure, but allowing some direct string operations that an array of
bytes currently lacks).
• A string construction mechanism allowing you to define a 1-byte string associated
with a specific ISO code page, a remnant of the pre-Unicode past.
Again, all of these string types can be used on desktop compilers, but are available
only for backwards compatibility reason. The goal is to use Unicode, TEncoding, and
other modern string management techniques whenever possible.

Marco Cantù, Object Pascal Handbook

WPFLocalization Guidance
No ratings yet
WPFLocalization Guidance
66 pages
Ex 0001
No ratings yet
Ex 0001
4 pages
Ex 0005
No ratings yet
Ex 0005
4 pages
Extr 040
No ratings yet
Extr 040
4 pages
Extr 050
No ratings yet
Extr 050
4 pages
Extr 030
No ratings yet
Extr 030
4 pages
Ex 0003
No ratings yet
Ex 0003
4 pages
Maxbox Starter120 Unicode
No ratings yet
Maxbox Starter120 Unicode
7 pages
Ex 0002
No ratings yet
Ex 0002
4 pages
Delphi in A Unicode World Updated
No ratings yet
Delphi in A Unicode World Updated
30 pages
DTC Unicode Programming
No ratings yet
DTC Unicode Programming
14 pages
Java and Unicode: The Confusion About String and Char in Java
No ratings yet
Java and Unicode: The Confusion About String and Char in Java
15 pages
J02a JavaCharsStrings
No ratings yet
J02a JavaCharsStrings
36 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Ex 0004
No ratings yet
Ex 0004
4 pages
Extra 01
No ratings yet
Extra 01
3 pages
Cstring Management: Joseph M. Newcomer
No ratings yet
Cstring Management: Joseph M. Newcomer
17 pages
Converting Strings To Bytes and Vice Versa
No ratings yet
Converting Strings To Bytes and Vice Versa
2 pages
10200
No ratings yet
10200
38 pages
Delphi and Unicode 2013
No ratings yet
Delphi and Unicode 2013
29 pages
Ott-03-0035 Unicode and C Business Functions
No ratings yet
Ott-03-0035 Unicode and C Business Functions
11 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
PPL Unit 2 PPT
No ratings yet
PPL Unit 2 PPT
195 pages
Complete-Reference-Vb Net 12
No ratings yet
Complete-Reference-Vb Net 12
1 page
Converting String To Byte Array in C# - Stack Overflow
No ratings yet
Converting String To Byte Array in C# - Stack Overflow
1 page
Complete-Reference-Vb Net 61
No ratings yet
Complete-Reference-Vb Net 61
1 page
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Week 3 Unicode and Windows Architecture
No ratings yet
Week 3 Unicode and Windows Architecture
20 pages
Chapter 06 Data Types
No ratings yet
Chapter 06 Data Types
32 pages
Strings: Steven Skiena
No ratings yet
Strings: Steven Skiena
20 pages
Lecture 22
No ratings yet
Lecture 22
17 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
November 13, 2003 Week 2
No ratings yet
November 13, 2003 Week 2
146 pages
If You Wish To Include A Double Quote Inside The String, That Can Be Done by Escaping It With A Backslash
No ratings yet
If You Wish To Include A Double Quote Inside The String, That Can Be Done by Escaping It With A Backslash
8 pages
Module 2 (Data Types)
No ratings yet
Module 2 (Data Types)
97 pages
PPL Unit02
No ratings yet
PPL Unit02
153 pages
New - Unit 2
No ratings yet
New - Unit 2
264 pages
03-String Data Type in Go. Strings in Go Deserve Special Attention - by Uday Hiwarale - RunGo - Medium
No ratings yet
03-String Data Type in Go. Strings in Go Deserve Special Attention - by Uday Hiwarale - RunGo - Medium
14 pages
ch6 1-Datatypes
No ratings yet
ch6 1-Datatypes
84 pages
018 Repraesentation III Online
No ratings yet
018 Repraesentation III Online
46 pages
Programming Paradigms PP - Module2
No ratings yet
Programming Paradigms PP - Module2
33 pages
01 KM 072010004930012
No ratings yet
01 KM 072010004930012
174 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
Data Types I
No ratings yet
Data Types I
26 pages
Ascii and Unicode
No ratings yet
Ascii and Unicode
6 pages
Character and Byte Streams
No ratings yet
Character and Byte Streams
8 pages
Writing Endian-Independent Code in C
No ratings yet
Writing Endian-Independent Code in C
10 pages
Cs321 Winter 2023 Lecture 3 Strings
No ratings yet
Cs321 Winter 2023 Lecture 3 Strings
36 pages
02 Essential C Security 101
No ratings yet
02 Essential C Security 101
76 pages
Utf-8, Utf-16, Utf-32 & Bom
No ratings yet
Utf-8, Utf-16, Utf-32 & Bom
13 pages
cs321 Wi
No ratings yet
cs321 Wi
36 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
C++ - STD - Wstring Vs STD - String - Stack Overflow
No ratings yet
C++ - STD - Wstring Vs STD - String - Stack Overflow
16 pages
Null-Terminated String - Wikipedia
No ratings yet
Null-Terminated String - Wikipedia
3 pages
Cha 04
No ratings yet
Cha 04
8 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
Lec13 String
No ratings yet
Lec13 String
43 pages
Lesson 3
No ratings yet
Lesson 3
9 pages
Big and Little Endian
No ratings yet
Big and Little Endian
3 pages
Strings
No ratings yet
Strings
19 pages
Dictionary of Computing
From Everand
Dictionary of Computing
Handz Valentin, Sr
No ratings yet
Extra 3
No ratings yet
Extra 3
4 pages
Extr 010
No ratings yet
Extr 010
4 pages
Ex 01
No ratings yet
Ex 01
5 pages
Ex 03
No ratings yet
Ex 03
4 pages
Cha 03
No ratings yet
Cha 03
8 pages
CH 04
No ratings yet
CH 04
9 pages
Cha 01
No ratings yet
Cha 01
7 pages
OWASPLondon20161124 JSON Hijacking Gareth Heyes
No ratings yet
OWASPLondon20161124 JSON Hijacking Gareth Heyes
44 pages
S@T 01.60 V3.0.0 (Release 2007)
No ratings yet
S@T 01.60 V3.0.0 (Release 2007)
72 pages
Getting Started With MRE
No ratings yet
Getting Started With MRE
29 pages
As 5
No ratings yet
As 5
28 pages
Data Encoding Techniques
No ratings yet
Data Encoding Techniques
10 pages
UNI File Spec v1.3
No ratings yet
UNI File Spec v1.3
22 pages
C# Unit 1
No ratings yet
C# Unit 1
303 pages
Openedge Abl Datatypes
No ratings yet
Openedge Abl Datatypes
48 pages
Unicode-Ppt Avi45
No ratings yet
Unicode-Ppt Avi45
24 pages
People Soft Functions
No ratings yet
People Soft Functions
14 pages
IDN SDK - Programmer's Guide
No ratings yet
IDN SDK - Programmer's Guide
37 pages
Cheatsheet Yara
No ratings yet
Cheatsheet Yara
7 pages
The JavaScript Object Notation
No ratings yet
The JavaScript Object Notation
22 pages
The Windows NT Registry File Format
No ratings yet
The Windows NT Registry File Format
12 pages
Viaduct - Debugging Processes, Maps and Rules
No ratings yet
Viaduct - Debugging Processes, Maps and Rules
25 pages
Computer Science and Entrepreneurship, Notes 9th Chapter 2
No ratings yet
Computer Science and Entrepreneurship, Notes 9th Chapter 2
14 pages
jBASE Internationalization
No ratings yet
jBASE Internationalization
57 pages
3GPP TS 23.038
No ratings yet
3GPP TS 23.038
57 pages
Project Infinite v.P.1.1
No ratings yet
Project Infinite v.P.1.1
197 pages
XML and Database
No ratings yet
XML and Database
609 pages
Fonts UTF-8 WhitePaperv6
No ratings yet
Fonts UTF-8 WhitePaperv6
13 pages
HIVE Built-In Functions
No ratings yet
HIVE Built-In Functions
16 pages
Unicode KbdsonWindows
No ratings yet
Unicode KbdsonWindows
16 pages
Java Programming Tutorial: Basic Input & Output (I/O)
No ratings yet
Java Programming Tutorial: Basic Input & Output (I/O)
58 pages
RPG Programming
No ratings yet
RPG Programming
43 pages
Final Unit-4 ADBMS
No ratings yet
Final Unit-4 ADBMS
64 pages

186 - 06: All About Strings: Unicode Tunicodeencoding

Uploaded by

186 - 06: All About Strings: Unicode Tunicodeencoding

Uploaded by

186 - 06: All About Strings

Show ('Utf16 bytes:');

Marco Cantù, Object Pascal Handbook

for AByte in UniBytes do

Marco Cantù, Object Pascal Handbook

The output is:

Other Types for Strings

The UCS4String type

Marco Cantù, Object Pascal Handbook

Older, Desktop Only String Types

Marco Cantù, Object Pascal Handbook

You might also like