0% found this document useful (0 votes)

15 views4 pages

Extr 040

Uploaded by

skamelrech2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views4 pages

Extr 040

Uploaded by

skamelrech2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

158 - 06: All About Strings

representing the Latin letter a (#$0061) followed by the code point representing the
grave accent (#$0300), this should be displayed as a single accented character.
In Object Pascal coding terms, if you write the following (part of the CodePoints
application project), the message will have one single accented character, as in Fig-
ure 6.2.
var
str: String;
begin
str := #$0061 + #$0300;
ShowMessage (str);

Figure 6.2:
A single grapheme can
be the result of
multiple code points

In this case we have two characters, representing two code points, but only one
grapheme (or visual elements). The fact is that while in the Latin alphabet you can
use a specific Unicode code point to represent the given grapheme (letter a with
grave accent is code point $00E0), in other alphabets combining Unicode code
points is the only way to obtain a given grapheme (and the correct output).
Even if the display is that of an accented character, there is no automatic normaliza-
tion or transformation of the value (only a proper display), so the string internally
remains different from one with the single character à.

note The rendering of graphemes from multiple code points might depend on specific support from the
operating system and on text rendering techniques being used, so you might find out that for
some of the graphemes not all operating systems offer the correct output.

From Code Points to Bytes (UTF)

While ASCII used a direct and easy mapping of character to their numeric represen-
tation, Unicode uses a more complex approach. As I mentioned, every element of
the Unicode alphabet has an associated code point, but the mapping to the actual
representation is often more complicated.

Marco Cantù, Object Pascal Handbook

06: All About Strings - 159

One of the elements of confusion behind Unicode is that there are multiple ways to
represent the same code point (or Unicode character numerical value) in terms of
actual storage, of physical bytes, in memory or on a file.
The issue stems from the fact that the only way to represent all Unicode code points
in a simple and uniform way is to use four bytes for each code point. This accounts
for a fixed-length representation (each character requires always the same amount
of bytes), but most developers would perceive this as too expensive in memory and
processing terms.

note In Object Pascal the Unicode Code Points can be represented directly in a 4-bytes representation
by using the UCS4Char data type.

That's why the Unicode standard defines other representations, generally requiring
less memory, but in which the number of bytes for each symbol is different, depend-
ing its code point. The idea is to use a smaller representation for the most common
elements, and a longer one for those infrequently encountered.
The different physical representations of the Unicode code points are called Unicode
Transformation Formats (or UTF). These are algorithmic mappings, part of the Uni-
code standard, that map each code point (the absolute numeric representation of a
character) to a unique sequence of bytes representing the given character. Notice
that the mappings can be used in both directions, converting back and forth between
different representations.
The standard defines three of these formats, depending on how many bits are used
to represent the initial part of the set (the initial 128 characters): 8, 16, or 32. It is
interesting to notice that all three forms of encodings need at most 4 bytes of data
for each code point.
• UTF-8 transforms characters into a variable-length encoding of 1 to 4 bytes.
UTF-8 is popular for HTML and similar protocols, because it is quite compact
when most characters (like tags in HTML) fall within the ASCII subset.
• UTF-16 is popular in many operating systems (including Windows and Mac OS
X) and development environments. It is quite convenient as most characters fit in
two bytes, reasonably compact, and fast to process.
• UTF-32 makes a lot of sense for processing (all code points have the same
length), but it is memory consuming and has limited use in practice.
There is a common misconception that UTF-16 can map directly all code points with
two bytes, but since Unicode defines over 100,000 code points you can easily figure
out they won't fit into 64K elements. It is true, however, that at times developers use
only a subset of Unicode, to make it fit in a 2-bytes-per-characters fixed-length rep-

Marco Cantù, Object Pascal Handbook

160 - 06: All About Strings

resentation. In the early days, this subset of Unicode was called UCS-2, now you
often see it referenced as Basic Multilingual Plane (BMP). However, this is only a
subset of Unicode (one of many planes).

note A problem relating to multi-byte representations (UTF-16 and UTF-32) is which of the bytes
comes first? According to the standard, all forms are allowed, so you can have a UTF-16 BE (big-
endian) or LE (little-endian), and the same for UTF-32. The big-endian byte serialization has the
most significant byte first, the little-endian byte serialization has the least significant byte first.
The bytes serialization is often marked in files along with the UTF representation with a header
called Byte Order Mark (BOM).

The Byte Order Mark

When you have a text file storing Unicode characters, there is a way to indicate
which is the UTF format being used for the code points. The information is stored in
a header or marker at the beginning of the file, called Byte Order Mark (BOM). This
is a signature indicating the Unicode format being used and the byte order form (lit-
tle or big endian – LE or BE). The following table provides a summary of the various
BOMs, which can be 2, 3, or 4 bytes long:
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
We'll see later in this chapter how Object Pascal manages the BOM within its
streaming classes. The BOM appears at the very beginning of a file with the actual
Unicode data immediately following it. So a UTF-8 file with the content AB contains
five hexadecimal values (3 for the BOM, 2 for the letters):
EF BB BF 41 42
If a text file has none of these signatures, it is generally considered as an ASCII text
file, but it might as well contain text with any encoding.

note On the other hand, when you are receiving data from a web request or through other Internet pro-
tocols, you might have a specific header (part of the protocol) indicating the encoding, rather than
relying on a BOM.

Marco Cantù, Object Pascal Handbook

06: All About Strings - 161

Looking at Unicode
How do we create a table of Unicode characters like those I displayed earlier for
ASCII ones? We can start by displaying code points in the Basic Multilingual Plane
(BMP), excluding what are called surrogate pairs.

note Not all numeric values are true UTF-16 code points, since there are some non-valid numerical val-
ues for characters (called surrogates) used to form a paired code and represent code points above
65535. A good example of a surrogate pair is the symbol used in music scores for the F (or bass)
clef, 𝄢. It is code point 1D122 which is represented in UTF-16 by two values, D834 followed by
DD22.

Displaying all of the elements of the BMP would requires a 256 * 256 grid, hard to
accommodate on screen. This is why the ShowUnicode application project has a tab
with two pages: The first tab has the primary selector with 256 blocks, while the sec-
ond page shows a grid with the actual Unicode elements, one section at a time. This
program has a little more of a user interface than most others in the book, and you
can simply skim through its code if you are only interested in its output (and not the
internals).
When the program starts, it fills the ListView control in the first page of the TabCon-
trol with 256 entries, each indicating the first and last character of a group of 256.
Here is the actual code of the OnCreate event handler of the form and a simple func-
tion used to display each element, while the corresponding output is in Figure 6.3:
// helper function
function GetCharDescr (nChar: Integer): string;
begin
if Char(nChar).IsControl then
Result := 'Char #' + IntToStr (nChar) + ' [ ]'
else
Result := 'Char #' + IntToStr (nChar) +
' [' + Char (nChar) + ']';
end;

procedure TForm2.FormCreate(Sender: TObject);

var
I: Integer;
ListItem: TListViewItem;
begin
for I := 0 to 255 do // 256 pages * 256 characters each
begin
ListItem := ListView1.Items.Add;
ListItem.Tag := I;
if (I < 216) or (I > 223) then
ListItem.Text :=
GetCharDescr(I*256) + '/' + GetCharDescr(I*256+255)
else

Marco Cantù, Object Pascal Handbook

Data Types T2 ASCII and Unicode
No ratings yet
Data Types T2 ASCII and Unicode
24 pages
15 Representation of Nonnumeric Data Character Codes 31-01-2024 PDF
No ratings yet
15 Representation of Nonnumeric Data Character Codes 31-01-2024 PDF
13 pages
COMS1000 Data Representation A Second Half
No ratings yet
COMS1000 Data Representation A Second Half
12 pages
Characters and Fonts
No ratings yet
Characters and Fonts
4 pages
Unicode UTF Summary
No ratings yet
Unicode UTF Summary
5 pages
018 Repraesentation III Online
No ratings yet
018 Repraesentation III Online
46 pages
Extr 050
No ratings yet
Extr 050
4 pages
Ex 0005
No ratings yet
Ex 0005
4 pages
Extra 01
No ratings yet
Extra 01
3 pages
Immediate Access To Unicode Demystified A Practical Programmer S Guide To The Encoding Standard 1st Edition Richard Gillam Ebook Full Chapters
No ratings yet
Immediate Access To Unicode Demystified A Practical Programmer S Guide To The Encoding Standard 1st Edition Richard Gillam Ebook Full Chapters
87 pages
15 Representation of Nonnumeric Data Character Codes 31 01 2024 PDF
No ratings yet
15 Representation of Nonnumeric Data Character Codes 31 01 2024 PDF
13 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Utf-8, Utf-16, Utf-32 & Bom
No ratings yet
Utf-8, Utf-16, Utf-32 & Bom
13 pages
Information Systems Basics: H. Turgut Uyar Date: 2022-09-19 1.0
No ratings yet
Information Systems Basics: H. Turgut Uyar Date: 2022-09-19 1.0
37 pages
Lec 1c - Character Representation
No ratings yet
Lec 1c - Character Representation
11 pages
Ex 0001
No ratings yet
Ex 0001
4 pages
Week 3 Unicode and Windows Architecture
No ratings yet
Week 3 Unicode and Windows Architecture
20 pages
UNI Teaching
No ratings yet
UNI Teaching
20 pages
186 - 06: All About Strings: Unicode Tunicodeencoding
No ratings yet
186 - 06: All About Strings: Unicode Tunicodeencoding
4 pages
Unicode Vs UTF-8
No ratings yet
Unicode Vs UTF-8
2 pages
Charsets Encodings Java
No ratings yet
Charsets Encodings Java
64 pages
Extr 030
No ratings yet
Extr 030
4 pages
Computer Codes
No ratings yet
Computer Codes
28 pages
Maxbox Starter120 Unicode
No ratings yet
Maxbox Starter120 Unicode
7 pages
Notes CH2
No ratings yet
Notes CH2
4 pages
Power Point
No ratings yet
Power Point
10 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
Week05 Lecture
No ratings yet
Week05 Lecture
5 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Chapter 1 Part 3 Continuation
No ratings yet
Chapter 1 Part 3 Continuation
2 pages
Encodings, Unicode and Erlang by Richard Carlsson
No ratings yet
Encodings, Unicode and Erlang by Richard Carlsson
47 pages
Short Notes On ASCII
100% (1)
Short Notes On ASCII
16 pages
Universal Character Set Characters
No ratings yet
Universal Character Set Characters
34 pages
Whatsnew Creo7
No ratings yet
Whatsnew Creo7
192 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
Dex Format
No ratings yet
Dex Format
26 pages
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
No ratings yet
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
12 pages
Csc121 - Topic 1 Introduction To Computer Systems
No ratings yet
Csc121 - Topic 1 Introduction To Computer Systems
83 pages
Lecture - ASCII and Unicode
No ratings yet
Lecture - ASCII and Unicode
38 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
Dictionary of Computing
From Everand
Dictionary of Computing
Handz Valentin, Sr
No ratings yet
Machine Level Representation of Data Character Representation
No ratings yet
Machine Level Representation of Data Character Representation
14 pages
Ranorex Tutorial
No ratings yet
Ranorex Tutorial
106 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
An Introduction To Unicode - The Trainer's Friend
No ratings yet
An Introduction To Unicode - The Trainer's Friend
52 pages
10200
No ratings yet
10200
38 pages
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
No ratings yet
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
26 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Uni Code
No ratings yet
Uni Code
9 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Test 12 - Practical Questions
No ratings yet
Test 12 - Practical Questions
10 pages
Invoice 39091
No ratings yet
Invoice 39091
1 page
Sap Scm220Dp - TXT Tscm40 Downloading Uploading Fails Every Time
No ratings yet
Sap Scm220Dp - TXT Tscm40 Downloading Uploading Fails Every Time
1 page
Unicode®: Character Encodings
No ratings yet
Unicode®: Character Encodings
11 pages
Cha 01
No ratings yet
Cha 01
7 pages
Unicode Tutorial
No ratings yet
Unicode Tutorial
15 pages
Unicode Enabling of ABAP
No ratings yet
Unicode Enabling of ABAP
82 pages
Ascii and Unicode
No ratings yet
Ascii and Unicode
6 pages
Lecture - 9 Abstract Classes and Abstract Methods
No ratings yet
Lecture - 9 Abstract Classes and Abstract Methods
35 pages
Extr 010
No ratings yet
Extr 010
4 pages
Extra 3
No ratings yet
Extra 3
4 pages
CH 04
No ratings yet
CH 04
9 pages
The Best - How To Install GitLab On Debian 12 Step-by-Step
No ratings yet
The Best - How To Install GitLab On Debian 12 Step-by-Step
21 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Utf-8 - Wikipedia, The Free Encyclopedia
No ratings yet
Utf-8 - Wikipedia, The Free Encyclopedia
10 pages
Cha 03
No ratings yet
Cha 03
8 pages
PROFIBUS Fieldbus Coupler MVS3007
No ratings yet
PROFIBUS Fieldbus Coupler MVS3007
114 pages
Programacion Web Parte-4
No ratings yet
Programacion Web Parte-4
4 pages
CV - Venkata Subhash Muthareddy
No ratings yet
CV - Venkata Subhash Muthareddy
6 pages
Ex 01
No ratings yet
Ex 01
5 pages
Ex 0002
No ratings yet
Ex 0002
4 pages
Ex 03
No ratings yet
Ex 03
4 pages
Towards CRISP-ML (Q) : A Machine Learning Process Model With Quality Assurance Methodology
No ratings yet
Towards CRISP-ML (Q) : A Machine Learning Process Model With Quality Assurance Methodology
22 pages
Capabilities and Limitations of Autodesk Revit in A Construction Technology Course
No ratings yet
Capabilities and Limitations of Autodesk Revit in A Construction Technology Course
9 pages
10.2005.5 Unicode
No ratings yet
10.2005.5 Unicode
4 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
Single Sign-On Implementation
No ratings yet
Single Sign-On Implementation
19 pages
Help Qthread en
No ratings yet
Help Qthread en
33 pages
EWARM DDFFormat
No ratings yet
EWARM DDFFormat
6 pages
SAP Single Sign-On 3.0 Product Overview
No ratings yet
SAP Single Sign-On 3.0 Product Overview
39 pages
Untitled
No ratings yet
Untitled
12 pages
Webdynpro
100% (2)
Webdynpro
66 pages
Delhi DSSSB 03 - 2023 Various Post Online Form 2023 - DSSSB
No ratings yet
Delhi DSSSB 03 - 2023 Various Post Online Form 2023 - DSSSB
4 pages
Problem Addressed by The Topic
No ratings yet
Problem Addressed by The Topic
2 pages
Sophia Stephenson Resume
No ratings yet
Sophia Stephenson Resume
1 page
HM-12/HM-13 Firmware Upgrade Instructions: Firmware Upgrade May Damage The Module Boot Loader System, Please Use Caution
No ratings yet
HM-12/HM-13 Firmware Upgrade Instructions: Firmware Upgrade May Damage The Module Boot Loader System, Please Use Caution
5 pages
Beshir Sid Ahmed
No ratings yet
Beshir Sid Ahmed
4 pages
Software-Defined Networking: Challenges and Research Opportunities For Future Internet
No ratings yet
Software-Defined Networking: Challenges and Research Opportunities For Future Internet
26 pages
Annual Report 2020-2021 - 220422
No ratings yet
Annual Report 2020-2021 - 220422
33 pages
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
From Everand
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
Sherwyn Allibang
2/5 (1)
M4818
No ratings yet
M4818
24 pages
Fundamentals of Web Technology Unit - 1: Tcp/Ip
100% (2)
Fundamentals of Web Technology Unit - 1: Tcp/Ip
16 pages
10-Maintenance of GeneXpert
No ratings yet
10-Maintenance of GeneXpert
18 pages
MXK F Chassis W219 - M
No ratings yet
MXK F Chassis W219 - M
5 pages
Internet and Email Access Policy
No ratings yet
Internet and Email Access Policy
6 pages
Telefire ADR-3000 Brochure PDF
No ratings yet
Telefire ADR-3000 Brochure PDF
2 pages

Extr 040

Uploaded by

Extr 040

Uploaded by

158 - 06: All About Strings

From Code Points to Bytes (UTF)

Marco Cantù, Object Pascal Handbook

Marco Cantù, Object Pascal Handbook

The Byte Order Mark

Marco Cantù, Object Pascal Handbook

procedure TForm2.FormCreate(Sender: TObject);

Marco Cantù, Object Pascal Handbook

You might also like