0% found this document useful (0 votes)

8 views5 pages

Unicode Better Explained

The document provides an overview of Unicode, explaining its necessity for encoding characters from various languages and addressing issues with ASCII and code pages. It discusses the evolution of encoding methods, including UCS-2, UTF-16, and UTF-8, highlighting their advantages and limitations in handling character data. The author emphasizes the importance of understanding encoding to properly interpret data and avoid compatibility issues across different systems.

Uploaded by

Beatle1967

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views5 pages

Unicode Better Explained

Uploaded by

Beatle1967

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

BetterExplained Learn Right, Not Rote.

Home
All Posts
About
FAQ
C ontact
Shared Links
ebook

All Posts > Programming

Unicode and You

I’m a Unicode newbie. But like many newbies, I had an urge to learn once my interest was piqued by an
[introduction to Unicode->http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html].

Unicode isn’t hard to understand, but it does cover some low-level C S concepts, like [byte order->understanding-
big-and-little-endian-byte-order]. Reading about Unicode is a nice lesson in design tradeoffs and backwards
compatibility.

My thoughts are below. Read them alone, or as a follow-up to Joel’s unicode article above. If you’re like me, you’ll
get an itch to read about the details in the Unicode specs or in Wikipedia. Really, it can be cool, I swear.

Key concepts
Let’s level set on some ideas:

Ideas and data are different. The concept of “A” is something different than marks on paper, the sound
“aaay” or the number 65 stored inside a computer.

One idea has many possible encodings. An encoding is just a method to transform an idea (like the letter
“A”) into raw data (bits and bytes). The idea of “A” can be encoded many different ways. Encodings differ in
efficiency and compatibility.

Know thy encoding. When reading data, you must know the encoding used in order to interpret it properly. This
is a simple but important concept. If you see the number 65 in binary, what does it really mean? “A” in ASC II?
Your age? Your IQ? Unless there is some context, you’d never know. Imagine if someone came up to you and
said “65”. You’d have no idea what they were talking about. Now imagine they came up and said “The following
number is an ASC II character: 65”. Weird, yes, but see how much clearer it is?

Embrace the philosophy that a concept and the data that stores it are different. Let it rustle around in your mind…

Got it? Let’s dive in.

Back to ASCII and Code Pages

You’ve probably heard of the ASC II/ANSI characters sets. They map the numeric values 0-127 to various
Western characters and control codes (newline, tab, etc.). Note that values 0-127 fit in the lower 7 bits in an 8-bit
byte. ASC II does not explicitly define what values 128-255 map to.

Now, ASC II encoding works great for English text (using Western characters), but the world is a big place. What
about Arabic, C hinese and Hebrew?

To solve this, computer makers defined “code pages” that used the undefined space from 128-255 in ASC II,
mapping it to various characters they needed. Unfortunately, 128 additional characters aren’t enough for the
entire world: code pages varied by country (Russian code page, Hebrew code page, etc.).

If people with the same code page exchanged data, all was good. C haracter #200 on my machine was the same
as C haracter #200 on yours. But if codepages mixed (Russian sender, Hebrew receiver), things got strange.

The character mapped to #200 was different in Russian and Hebrew, and you can imagine the confusion that
caused for things like email and birthday invitations. It’s a big IF whether or not someone will read your message
using the same codepage you authored your text. If you visit an international website, for example, your browser
could try to guess the codepage if it was not specified (“Hrm… this text has a lot of character #213 and #218…
probably Hebrew”). But clearly this method was error-prone: codepages needed to be rescued.

Unicode to the Rescue

The world had a conundrum: they couldn’t agree on what numbers mapped to what letters in ASC II. The Unicode
group went back to the basics: Letters are abstract concepts. Unicode labeled each abstract character with a
“code point”. For example, “A” mapped to code point U+0041 (this code point is in hex; code point 65 in decimal).

The Unicode group did the hard work of mapping each character in every language to some code point (not
without fierce debate, I am sure). When all was done, the Unicode standard left room for over 1 million code
points, enough for all known languages with room to spare for undiscovered civilizations. For fun, you can browse
the codepoints with the charmap utility (Start Menu > Run > C harmap) or online at Unicode.org.

This brings us to our first design decision: compatibility.

For compatibility with ASC II, code points U+0000 to U+007F (0-127) were the same as ASC II. Purists probably
didn’t like this, because the full Latin character sets were defined elsewhere, and now one letter had 2 codepoints.
Also, this put Western characters “first”, whereas C hinese, Arabic and the “nonstandard” languages were stuck in
the non-sleek codepoints that require 2 bytes to store.

However, this design was necessary – ASC II was a standard, and if Unicode was to be adopted by the Western
world it needed to be compatible, without question. Now, the majority of common languages fit into the first 65535
codepoints, which can be stored as 2 bytes.

Phew. The world was a better place, and everyone agreed on what codepoint mapped to what character.

But the question remained: How do we store a codepoint as data?

Encoding to the Rescue

From above, encoding turns an idea into raw data. In this case, the idea is a codepoint.

For example, let’s look at the ASC II “encoding” scheme to store Unicode codepoints. The rules are pretty simple:

C ode points from U+0000 to U+007F are stored in a single byte

C ode points above U+0080 are dropped on the floor, never to be seen again

Simple, right?

As you can see, ASC II isn’t great for storing Unicode – in fact, it ignores most Unicode codepoints altogether. If
you have a Unicode document and save it as ASC II -wham- all your special characters are gone. You’ll often see
this as a warning in some text editors when you save Unicode data in a file original saved as ASC II.

But the example has a purpose. An encoding is a system to convert an idea into data. In this case, the conversion
can be politely called “lossy”.

I did Unicode experiments with Notepad (can read/write Unicode) and Programmer’s Notepad, a hex editor. I
wanted to see the raw bytes that notepad was saving. To the examples for yourself:

Open notepad and type “Hello”

Save file separately as ANSI, Unicode, Unicode Big Endian, UTF-8
Open file with Programmer’s Notepad and do View > View Hex

All about ASCII

Let’s write “Hello” in notepad, save as ANSI (ASC II) and open it in a hex editor. It looks like this:
Byte: 48 65 6C 6C 6FLetter: H e l l o

ASC II is important because many tools and communication protocols only accept ASC II characters. It’s a
generally accepted minimum bar for text. Because of its universal acceptance, some Unicode encodings will
transform codepoints into series of ASC II characters so they can be transmitted without issue.

Now, in the example above, we know the data is text because we authored it. If we randomly found the file, we
could assume it was ASC II text given its contents, but it might be an account number or other data for all we
know, that happens to look like “Hello” in ASC II.

Usually, we can make a good guess about what data is supposed to be, based on certain headers or “Magic
Numbers” (special character sequences) that appear in certain places. But you can never be sure, and sometimes
you can guess wrong.

Don’t believe me? Ok, do the following

Open notepad
Write “this program can break”
Save the file as “blah.txt” (or anything else
Open the file in notepad

Wow… whoa… what happened? I’ll leave this as an exercise for the reader.

UCS-2 / UTF-16
This is the encoding I first thought of when I heard “Unicode” – store every character as 2 bytes (what a waste!).
At a base level, this can handle codepoints 0×0000 to 0xFFFF, or 0-65535 for you humans out there. And 65,535
should be enough characters for anybody (there are ways to store codepoints above 65535, but read the spec for
more details).

Storing data in multiple bytes leads to my favorite conundrum: byte order! Some computers store the little byte
first, others the big byte.

To resolve the problem, we can do the following:

Option 1: Choose a convention that says all text data must be big or little-endian. This won’t happen –
computers on the wrong side of the decision would suffer inefficiency every time they opened a file, since
they cannot convert it to the other byte order.
Option 2: Everyone agrees to a byte order mark (BOM), a header at the top of each file. If you open
a file and the BOM is backwards, it means it was encoded in a different byte order and needs to be
converted.

The solution was the BOM header: UC S-2 encodings could write codepoint U+FEFF as a file header. If you open a
UC S-2 string and see FEFF, the data is in the right byte order and can be used directly. If you see FFFE, the data
came from another type of machine, and needs to be converted to your architecture. This involves swapping
every byte in the file.

But unfortunately, things are not that simple. The BOM is actually a valid Unicode character – what if someone
sent a file without a header, and that character was actually part of the file?

This is an open issue in Unicode. The suggestion is to avoid U+FEFF except for headers, and use alternative
characters instead (there are equivalents).

This opens up design observation #2: Multi-byte data will have [byte order issues-
>http://betterexplained.com/articles/understanding-big-and-little-endian-byte-order/]!

ASC II never had to worry about byte order – each character was a single byte, and could not be misinterpreted.
But realistically, if you see bytes 0xFEFF or 0xFFEE at the start of a file, it’s a good chance it’s a BOM in a Unicode
text file. It’s probably an indication of byte order. Probably.

(Aside: UC S-2 stores data in a flat 16-bit chunk. UTF-16 allows up to 20 bits split between 2 16-bit characters,
known as a surrogate pair. Each character in the surrogate pair is an invalid unicode character by itself, but
together a valid one can be extracted.)

UCS-2 Example
Type “Hello” in notepad and save it as Unicode (little-endian UC S-2 is the native format on Windows):

Hello-little-endian:
FF FE 4800 6500 6C00 6c00 6F00header H e l l o

Save it again as Unicode Big Endian, and you get:

Hello-big-endian:
FE FF 0048 0065 006C 006C 006Fheader H e l l o

Observations

The header BOM (U+FEFF) shows up as expected: FF FE for little-endian, FEFF for big
Letters use 2 bytes no matter what: “H” is 0×48 in ASC II, and 0×0048 in UC S-2
Encoding is simple. Take the codepoint in hex and write it out in 2 bytes. No extra processing is required.
The encoding is too simple. It wastes space for plain ASC II text that does not use the high-order byte. And
ASC II text is very common.
The encoding inserts null bytes (0×00) which can be a problem. Old-school ASC II programs may think the
Unicode string has ended when it gets to the null byte. On a little-endian machine, reading one byte at a
time, you’d get to H (H = 0×4800) and then hit the null and stop. On a big endian machine, you’d hit the
null first (H is 0×0048) and not even see the H in ASC II. Not good.

Design observation #3* C onsider backwards compatibility. How will an old program read new data? Ignoring new
data is good. Breaking on new data is bad.

UTF-8
UC S-2 / UTF-16 is nice and simple, but boy it does waste some bits. Not only does it double ASC II, but the
converted ASC II might not even be readable due to the null characters.

Enter UTF-8. Its goal is to encode Unicode characters in single byte where possible (ASC II), and not break ASC II
applications by having null characters. It is the default encoding for XML.

Read the UTF-8 specs for more detail, but at a high level:

C ode points 0 – 007F are stored as regular, single-byte ASC II.

C ode points 0080 and above are converted to binary and stored (encoded) in a series of bytes.
The first “count” byte indicates the number of bytes for the codepoint, including the count byte. These
bytes start with 11..0:

110xxxxx (11 -> 2 bytes in sequence, including “count” byte)

1110xxxx (1110 -> 3 bytes in sequence)

11110xxx (11110 -> 4 bytes in sequence)

Bytes starting with 10… are “data” bytes and contain information for the codepoint. A 2-byte example looks
like this

110xxxxx 10xxxxxx

This means there are 2 bytes in the sequence. The X’s represent the binary value of the codepoint, which needs
to squeeze in the remaining bits.

Observations about UTF-8

No null bytes. All ASC II characters (0-127) are the same. Non-ASC II characters all start with “1” as the
highest bit.
ASC II text is stored identically and efficiently.
Unicode characters start with “1” as the high bit, and can be ignored by ASC II-only programs (however,
they may be discarded in some cases! See UTF-7 for more details).
There is a time-space tradeoff. There is processing to be done on every Unicode character, but this is a
reasonable tradeoff.

Design principle #4

UTF-8 addresses the 80% case well (ASC II), while making the other cases possible (Unicode). UC S-2
addresses all cases equally, but is inefficient in the 80% case for solve for the 99% case. But UC S-2 is less
processing-intensive than UTF-8, which requires bit manipulation on all Unicode characters.
Why does XML store data in UTF-8 instead of UC S-2? Is space or processing power more important when
reading XML documents?
Why does Windows XP store strings as UC S-2 natively? Is space or processing power more important for
the OS internals?

In any case, UTF-8 still needs a header to indicate how the text was encoded. Otherwise, it could be interpreted as
straight ASC II with some codepage to handle values above 127. It still uses the U+FEFF codepoint as a BOM, but
the BOM itself is encoded in UTF-8 (clever, eh?).

UTF-8 Example
Hello-UTF-8:
EF BB BF 48 65 6C 6C 6Fheader H e l l o

Again, the ASC II text is not changed in UTF-8. Feel free to use charmap to copy in some Unicode characters and
see how they are stored in UTF-8. Or, you can experiment online.

UTF-7
While UTF-8 is great for ASC II, it still stores Unicode data as non-ASC II characters with the high-bit set. Some
email protocols do not allow non-ASC II values, so UTF-8 data would not be sent properly. Systems that can
handle data with anything in the high bit are “8-bit clean”; systems that require data have values 0-127 (like
SMTP) are not. So how do we send Unicode data through them?

Enter UTF-7. The goal is to encode Unicode data in 7 bits (0-127), which is compatible with ASC II. UTF-7 works
like this

C odepoints in the ASC II range are stored as ASC II, except for certain symbols (+, -) that have special
meaning
C odepoints above ASC II are converted to binary, and stored in base64 encoding (stores binary
information in ASC II)

How do you know which ASC II letters are real ASC II, and which are base64 encoded? Easy. ASC II characters
between the special symbols “+” and “-“ are considered base64 encoded.

“-” acts like an escape suffix character. If it follows a character, that item is interpreted literally. So, “+-“ is
interpreted as “+” without any special encoding. This is how you store an actual “+” symbol in UTF-7.

UTF-7 Example
Wikipedia has some UTF-7 examples, as Notepad can’t save as UTF-7.

“Hello” is the same as ASC II — we are using all ASC II characters and no special symbols:
“Hello” is the same as ASC II — we are using all ASC II characters and no special symbols:
Byte: 48 65 6C 6C 6FLetter: H e l l o

“£1″ (1 British pound) becomes:

+AKM-1

The characters “+AKM-” means AKM should be decoded in base64 and converted to a codepoint, which maps to
0x00A3 or the British pound symbol. The “1” is kept the same, since it is a ASC II character.

UTF is pretty clever, eh? It’s essentially a Unicode to ASC II conversion that removes any characters that have
their highest-bit set. Most ASC II characters will look the same, except for the special characters (- and +) that
need to be escaped.

Wrapping it up – what I’ve learned

I’m still a newbie but have learned a few things about Unicode:

Unicode does not mean 2 bytes. Unicode defines code points that can be stored in many different ways
(UC S-2, UTF-8, UTF-7, etc.). Encodings vary in simplicity and efficiency.
Unicode has more than 65,535 (16 bits) worth of characters. Encodings can specify more characters, but
the first 65535 cover most of the common languages.
You need to know the encoding to correctly read a file. You can often guess that a file is Unicode based on
the Byte Order Mark (BOM), but confusion can still arise unless you know the exact encoding. Even text
that looks like ASC II could actually be encoded with UTF-7; you just don’t know.

Unicode is an interesting study. It opened my eyes to design tradeoffs, and the importance of separating the core
idea from the encoding used to save it.

Blockchain Unconfirmed Transaction Hack Script PR
100% (2)
Blockchain Unconfirmed Transaction Hack Script PR
4 pages
Mobile App Security
No ratings yet
Mobile App Security
17 pages
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
No ratings yet
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
26 pages
Lecture - ASCII and Unicode
No ratings yet
Lecture - ASCII and Unicode
38 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
Ascii and Unicode
No ratings yet
Ascii and Unicode
6 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
1.3 Data Storage - Part 1
No ratings yet
1.3 Data Storage - Part 1
15 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
No ratings yet
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
18 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
Short Notes On ASCII
100% (1)
Short Notes On ASCII
16 pages
Lesson Plan Data Representation Characters
No ratings yet
Lesson Plan Data Representation Characters
3 pages
Machine Level Representation of Data Character Representation
No ratings yet
Machine Level Representation of Data Character Representation
14 pages
Representation of Text
No ratings yet
Representation of Text
5 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
DLD Lecture 09 and 10
No ratings yet
DLD Lecture 09 and 10
16 pages
10.2005.5 Unicode
No ratings yet
10.2005.5 Unicode
4 pages
Character Sets, Encodings, and Unicode
No ratings yet
Character Sets, Encodings, and Unicode
26 pages
SS3 Note 2nd Term
No ratings yet
SS3 Note 2nd Term
10 pages
10200
No ratings yet
10200
38 pages
ASCII Codes PDF
No ratings yet
ASCII Codes PDF
16 pages
Text Encoding
No ratings yet
Text Encoding
8 pages
6.0 Bit Operations
No ratings yet
6.0 Bit Operations
22 pages
Computer Codes
No ratings yet
Computer Codes
22 pages
Ascii vs. Binary Files
No ratings yet
Ascii vs. Binary Files
5 pages
Charsets Encodings Java
No ratings yet
Charsets Encodings Java
64 pages
Encodings, Unicode and Erlang by Richard Carlsson
No ratings yet
Encodings, Unicode and Erlang by Richard Carlsson
47 pages
Week05 Lecture
No ratings yet
Week05 Lecture
5 pages
Data Representation - Characters
No ratings yet
Data Representation - Characters
15 pages
ASCII1
No ratings yet
ASCII1
12 pages
Character Sets KS4 Presentation
No ratings yet
Character Sets KS4 Presentation
16 pages
Text
No ratings yet
Text
3 pages
Computer Codes
No ratings yet
Computer Codes
28 pages
Power Point
No ratings yet
Power Point
10 pages
Computer Codes
No ratings yet
Computer Codes
24 pages
Extr 030
No ratings yet
Extr 030
4 pages
Characters and Strings: Eric Roberts CS 106A April 27, 2012
No ratings yet
Characters and Strings: Eric Roberts CS 106A April 27, 2012
30 pages
COMS1015 Data Representation A
No ratings yet
COMS1015 Data Representation A
32 pages
Huffman Encoding Supplement
No ratings yet
Huffman Encoding Supplement
10 pages
Information Systems Basics: H. Turgut Uyar Date: 2022-09-19 1.0
No ratings yet
Information Systems Basics: H. Turgut Uyar Date: 2022-09-19 1.0
37 pages
Unicode Tutorial
No ratings yet
Unicode Tutorial
15 pages
03 - Unicode Characters and Strings - en
No ratings yet
03 - Unicode Characters and Strings - en
4 pages
Character Encoding For Sanskrit and Other Languages
No ratings yet
Character Encoding For Sanskrit and Other Languages
8 pages
Binary Coding Scheme
No ratings yet
Binary Coding Scheme
2 pages
Presentation - 12 Character Sets
No ratings yet
Presentation - 12 Character Sets
21 pages
Lesson 4 - Ascii
No ratings yet
Lesson 4 - Ascii
34 pages
An Introduction To Unicode - The Trainer's Friend
No ratings yet
An Introduction To Unicode - The Trainer's Friend
52 pages
Problem Addressed by The Topic
No ratings yet
Problem Addressed by The Topic
2 pages
Data Types T2 ASCII and Unicode
No ratings yet
Data Types T2 ASCII and Unicode
24 pages
Ascii VS Unicode
No ratings yet
Ascii VS Unicode
2 pages
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
No ratings yet
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
12 pages
Encoding Schemes
100% (1)
Encoding Schemes
23 pages
31 Character Sets SAMPLE A Level
No ratings yet
31 Character Sets SAMPLE A Level
13 pages
ZPrinter 350 and 450 User Manual
No ratings yet
ZPrinter 350 and 450 User Manual
84 pages
Server Parameter Files
No ratings yet
Server Parameter Files
4 pages
AI Record Notebook
No ratings yet
AI Record Notebook
11 pages
BVH2448GB
No ratings yet
BVH2448GB
266 pages
Vital Event Registration System
No ratings yet
Vital Event Registration System
16 pages
MeanStack Lab Manual-4-1 R20
No ratings yet
MeanStack Lab Manual-4-1 R20
113 pages
Developer Relations
No ratings yet
Developer Relations
2 pages
Arukiran Resume
No ratings yet
Arukiran Resume
1 page
User Manual: Μchiller
No ratings yet
User Manual: Μchiller
102 pages
L35 - Theory Material - Introduction To SCADA Architecture - Block Diagram
No ratings yet
L35 - Theory Material - Introduction To SCADA Architecture - Block Diagram
24 pages
EX-B150M-V5 Devices Report PDF
No ratings yet
EX-B150M-V5 Devices Report PDF
20 pages
User Manual: Model
No ratings yet
User Manual: Model
96 pages
7048-1712772218517-Unit-01 Programming Assignment - 2024
No ratings yet
7048-1712772218517-Unit-01 Programming Assignment - 2024
73 pages
CO Unit 1 Chap 1 Notes
No ratings yet
CO Unit 1 Chap 1 Notes
11 pages
100 Linux Commands PDF
No ratings yet
100 Linux Commands PDF
6 pages
Service Manual Xerox Wide Format 8850
No ratings yet
Service Manual Xerox Wide Format 8850
407 pages
Wincc Scada - Diagnostics
No ratings yet
Wincc Scada - Diagnostics
25 pages
PLSQL
No ratings yet
PLSQL
92 pages
Suresh SQLDBA 4+ Years
No ratings yet
Suresh SQLDBA 4+ Years
3 pages
Scrum: in Five Minutes
No ratings yet
Scrum: in Five Minutes
16 pages
EC601 Course Outline
No ratings yet
EC601 Course Outline
7 pages
Federal Government of Nigeria 2011 BUDGET Federal Ministry of Information & Communication
No ratings yet
Federal Government of Nigeria 2011 BUDGET Federal Ministry of Information & Communication
24 pages
Flrig Users Manual 1.4.7
No ratings yet
Flrig Users Manual 1.4.7
102 pages
ĐỀ THI THỬ SỐ 9-ĐỀ PHÁT TRIỂN THEO ĐỀ MINH HỌA 2024 MÔN ANH-BIÊN SOẠN CÔ PHẠM LIỄU
No ratings yet
ĐỀ THI THỬ SỐ 9-ĐỀ PHÁT TRIỂN THEO ĐỀ MINH HỌA 2024 MÔN ANH-BIÊN SOẠN CÔ PHẠM LIỄU
6 pages
Tarea Numero 1.1 Ing
No ratings yet
Tarea Numero 1.1 Ing
2 pages
Huawei AR650 Series Universal Computing Gateway Datasheet
No ratings yet
Huawei AR650 Series Universal Computing Gateway Datasheet
5 pages
Big Data Overview
No ratings yet
Big Data Overview
15 pages
Mastercam 9
92% (13)
Mastercam 9
69 pages

Unicode Better Explained

Uploaded by

Unicode Better Explained

Uploaded by

BetterExplained Learn Right, Not Rote.

All Posts > Programming

Unicode and You

Got it? Let’s dive in.

Back to ASCII and Code Pages

Unicode to the Rescue

This brings us to our first design decision: compatibility.

But the question remained: How do we store a codepoint as data?

Encoding to the Rescue

C ode points from U+0000 to U+007F are stored in a single byte

Open notepad and type “Hello”

All about ASCII

Don’t believe me? Ok, do the following

To resolve the problem, we can do the following:

Save it again as Unicode Big Endian, and you get:

C ode points 0 – 007F are stored as regular, single-byte ASC II.

110xxxxx (11 -> 2 bytes in sequence, including “count” byte)

1110xxxx (1110 -> 3 bytes in sequence)

11110xxx (11110 -> 4 bytes in sequence)

Observations about UTF-8

“£1″ (1 British pound) becomes:

Wrapping it up – what I’ve learned

You might also like