0 ratings0% found this document useful (0 votes) 38 views14 pagesCoding Encoding
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
+ Kunststube
© Jencoding
+ /escapism
fisset
ifrontback
‘static
\Router
ARison
ACSRFP
APOTools
What every programmer absolutely, positively needs to know about
encodings and character sets to work with text
Ifyou are dealing with text in a computer, you need to know about encodings. Period. Yes, even if you are just
sending emails. Even if you are just receiving emails, You don't need to understand every last detail, but you
must at least know what this whole "encoding" thing is about. And the good news first: while the topic can get
messy and confusing, the basic idea is really, really simple.
This article is about encodings and character sets. An article by Joel Spolsky entitled The Absolute Minimum
Every Software Developer Absolutely, rely Must Know About (No Excuses!)
is anice introduction to the topic and I greatly enjoy reading it every once in a while. I hesitate to refer people to
it who have trouble understanding encoding problems though since, while entertaining, it is pretty light on actual
technical details. I hope this article can shed some more light on what exactly an encoding is and just why all
your text screws up when you least need it, This article is aimed at developers (with a focus on PHP), but any
‘computer user should be able to benelit from it,
Getting the basics straight
Everybody is aware of this at some level, but somehow this knowledge seems to suddenly disappear in a
discussion about text, so let's get it out first: A computer cannot store "letters", "numbers", "pictures" or anything
else. The only thing it can store and work with are bits. A bit can only have two values: yes or no, true or false,
1 or @ or whatever else you want to call these two values. Since a computer works with electricity, an "actual" bit
is a blip of electricity that either is or isn't there. For humans, this is usually represented using 1 and @ and I'll
stick with this convention throughout this article
To use bits to represent anything at all besides bits, we need rules. We need to convert a sequence of bits into
something like letters, numbers and pictures using an encoding scheme, or encoding for short. Like this:
1100010 61101001 @1110108 e1110011
b i t s
In this encoding, 1108016 stands for the letter "b", @1161001 for the letter "i", @1110100 stands for "t" and
1110011 for "s". A certain sequence of bits stands for a letter and a letter stands for a certain sequence of bits. If
you can keep this in your head for 26 letters or are really fast with looking stuff up in a table, you could read bits
like a book.
The above encoding scheme happens to be ASCII. A string of 1s and @s is broken down into parts of eight bit
each (a byte for short). The ASCII encoding specifies a table translating bytes into human readable letters. Here's
a short excerpt of that table:
bits character
hitpskunststube.nevencoding! ane9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
bits character
e1000001 A.
e100001 B
e1000011 C
e10001e8 D
e1e001e1 E
e1oeo11e F
There are 95 human readable characters speci the ASCII table, including the letters A through Z both in
upper and lower case, the numbers 0 through 9, a handful of punctuation marks and characters like the dollar
symbol, the ampersand and a few others. It also includes 33 values for things like space, line feed, tab,
backspace and so on. These are not printable per se, but still visible in some form and useful to humans directly.
A number of values are only useful to a computer, like codes to signify the start or end of a text, In total there are
128 characters defined in the ASCII encoding, which is a nice round number (for people dealing with
computers), since it uses all possible combinations of 7 bits (#00000, 8800001, e0ee10 through 1111111).
And there you have it, the way to represent human-readable text using only as and es.
1001000 ¢1100101 01101100 01101100 e11¢1111 ee100000
1010111 21101111 21119010 e11e1108 01100100
"Hello World”
Important terms
To encode something in ASCII, follow the table from right to left, substituting letters for bits. To decode a string
of bits into human readable characters, follow the table from left to right, substituting bits for letters,
encode [en'k6d|
verb [ with obj. ]
convert into a coded form
code [kod
noun
a system of words, letters, figures, or other symbols substituted for other words, letters, etc.
To encode means to use something to represent something else. An encoding is the set of rules with which to
convert something from one representation to another.
Other terms which deserve clarification in this context:
character set, charset
The set of characters that can be encoded. "The ASCII encoding encompasses a character set of 128
characters." Essentially synonymous to "encoding",
code page
‘A "page" of codes that map a character to a number or bit sequence. A.k.a. "the table", Essentially
synonymous to "encoding"
string
A string is a bunch of items strung together. A bit string is a bunch of bits, like 61610011. A character
string is a bunch of characters, 1ike this. Synonymous to "sequence"
Binary, octal, decimal, hex
hitpskunststube.nevencoding! 2a‘91242, 349 PM ‘What Every Programmer Absolitely, Pstively Needs to Know About
‘There are many ways to write numbers. 10011111 in binary is 237 in octal is 159 in decimal is 9F in
hexadecimal. They all represent the same value, but hexadecimal is shorter and easier to read than binary. I will
stick with binary throughout this article to get the point across better and spare the reader one layer of
abstraction. Do not be alarmed to see character codes referred to in other notations elsewhere, it's all the same
thing
ings and Character Sets to Work With Text
Excusez-moi?
Now that we know what we're talking about, let's just say it: 95 characters really isn'ta lot when it comes to
languages. It covers the basics of English, but what about writing a risqué letter in French? A StraBeniibergangs-
Jinderungsgesetz in German? An invitation to a smérgisbord in Swedish? Well, you couldn't, Not in ASCII.
There's no specification on how to represent any of the letters é, B, d, &, 6 or 4 in ASCII, so you can't use them.
"But look at it," the Europeans said, "in a common computer with 8 bits to the byte, ASCII is wasting an entire
bit which is always set to @! We can use that bit to squeeze a whole ‘nother 128 values into that table!" And so
they did. But even so, there are more than 128 ways to stroke, slice, slash and dot a vowel. Not all variations of
letters and squiggles used in all European languages can be represented in the same table with a maximum of
256 values. So what the world ended up with is a wealth of encoding schemes, standards, de-facto standards and
half-standards that all cover a different subset of characters. Somebody needed to write a document about
Swedish in Czech, found that no encoding covered both languages and invented one. Or so I imagine it went
countless times over.
And not to forget about Russian, Hindi, Arabic, Hebrew, Korean and all the other languages currently in active
use on this planet. Not to mention the ones not in use anymore. Once you have solved the problem of how to
write mixed language documents in all of these languages, try yourself on Chinese. Or Japanese, Both contain
tens of thousands of characters. You have 256 possible values to a byte consisting of 8 bit. Go!
Multi-byte encodings
To create a table that maps characters to letters for a language that uses more than 256 characters, one byte
simply isn't enough. Using two bytes (16 bits), its possible to encode 65,536 distinct values. BIG-S is such a
double-byte encoding. Instead of breaking a string of bits into blocks of eight, it breaks it into blocks of 16 and
has a big (I mean, BIG) table that specifies which character each combination of bits maps to. BIG-5 in its basic
form covers mostly Traditional Chinese characters. GB18030 is another encoding which essentially does the
same thing, but includes both Traditional and Simplified Chinese characters. And before you ask, yes, there are
encodings which cover only Simplified Chinese, Can't just have one encoding now, can we?
Here a small excerpt from the GB18030 table:
bits character
19908001 1000000 5
19000001 @1000001 |
18908001 1000010 T
19000001 1000011
19998001 @1000100 5
GB18030 covers quite a range of characters (including a large part of latin characters), but in the end is yet
another specialized encoding format among many.
Unicode to the confusion
hitpskunststube.nevencoding! ate9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
o22 aes 6 7a ea ae a
Latin Extended-B
wo BBE KbBLICCHDAdQdida
oof FFOYWL+KR ERWNNQO
mOoQaPhR22EUrTt Tu
omuU0YyzZz238es25582p
cco | I) $ | DZDZdz UL | NINini A a T
oo 1 OSUUOGOGVGOHAAA
peer Ga ky 09 8 4
ore JDZDzdzG GWPNHAAAADS
ooARBAAESES@TITIOIO6
mo RP RP UDOGSs TE33 fA
omNndssZzAabee060600
mO6YYbLmbEIdDMAC ELT S
ow Zz 27 BUAE PSII QqRrvry
IPA Extensions
ooeanbr¢ddosone aay
cof goyruhfhttcrt+{ bw
mumnAannec@aoiuilarerrid
mores ffrifateuvoamiy
mze5 gr SScOsodHg at
mf 2S ages ftmb kw ayy
Finally somebody had enough of the mess and set out to forge-e-ring-to-bind-then-aif create one encoding
standard to unify all encoding standards. This standard is Unicode. It basically defines a ginormous table of
1,114,112 code points that can be used for all sorts of letters and symbols. That's plenty to encode all European,
Middle-Eastern, Far-Eastern, Southern, Northern, Western, pre-historian and future characters mankind knows
about.2 Using Unicode, you can write a document containing virtually any language using any character you can
type into a computer. This was either impossible or very very hard to get right before Unicode came along.
There's even an unofficial section for Klingon in Unicode. Indeed, Unicode is big enough to allow for unofficial,
private-use areas.
So, how many bits does Unicode use to encode all these characters? None, Because Unicode is not an encoding.
Confused? Many people seem to be. Unicode first and foremost defines a table of code points for characters.
That's a fancy way of saying "65 stands for A, 66 stands for B and 9,731 stands for 8 (seriously, it does). How
these code points are actually encoded into bits is a different topic. To represent 1,114,112 different values, two
bytes aren't enough. Three bytes are, but three bytes are often awkward to work with, so four bytes would be the
comfortable minimum, But, unless you're actually using Chinese or some of the other characters with big
numbers that take a lot of bits to encode, you're never going to use a huge chunk of those four bytes. If the letter
"A" was always encoded to e80eeeee ee02e02e eaeeeee e1000001, "B" always to 20a80000 80000000 28000000
1009010 and so on, any document would bloat to four times the necessary size.
Po optimize this, there are several ways to encode Unicode code points into bits. UTF-32 is such an encoding
that encodes all Unicode code points using 32 bits. That is, four bytes per character. I's very simple, but often
wastes a lot of space. UTF-16 and UTF-8 ate variable-length encodings. Ifa character can be represented using
a single byte (because its code point is a very small number), UTF-8 will encode it with a single byte. Ifit
requires two bytes, it will use two bytes and so on, It has elaborate ways to use the highest bits in a byte to signal
hitpskunststube.nevencoding! ana9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
how many bytes a character consists of. This can save space, but may also waste space if these signal bits need
to be used often, UTF-16 is in the middle, using at least two bytes, growing to up to four bytes as necessary.
character encoding bits
A UTE-8 1000001
A UTF-16 ee0eee80 e1000001
A UTF-32 98008000 20000000 0000000 01000001
& UTE-8 11100011 10800001 10080010
% UTEG 20110000 e1e00010
cy UTF-32 90080008 eoeeeeee ee11000¢ o1080010
And that's all there is to it. Unicode is a large table mapping characters to numbers and the different UTF
encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme.
There's nothing special about it, it's just trying to cover everything while still being efficient. And that's A Good
Thing.™
Code points
°
LATIN CAP...RING BELOW
Unicode U+1£00
UTF-8 ELBS 80
Characters are referred to by their "Unicode code point", Unicode code points are written in hexadecimal (to
keep the numbers shorter), preceded by a "U>" (that’s just what they do, it has no other meaning than "this is a
Unicode code point"). The character A has the Unicode code point U+1E00, In other (decimal) words, it is the
7680th character of the Unicode table. It is officially called "LATIN CAPITAL LETTER A WITH RING
BELOW".
TL;DR
A summary of all the above: Any character can be encoded in many different bit sequences and any particular
bit sequence can represent many different characters, depending on which encoding is used to read or write
them, The reason is simply because different encodings use different numbers of bits per characters and different
values to represent different characters.
bits encoding characters
11290100 @1¢00010 Windows Latin | AB
11980160 @1000018 Mac Roman fB
11200120 @1000018 GB18030 cy
characters encoding bits
hitpskunststube.nevencoding! site9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
characters encoding bits
Fob Windows Latin 1 e1ee9i10 11111000 11110110
Foo ‘Mac Roman @1eee120 10111111 19011010
Fos UTES e10¢@110 11000011 10111000 11000011 s0110110
Misconceptions, confusions and problems
Having said all that, we come to the actual problems experienced by many users and programmers every day,
how those problems relate to all of the above and what their solution is. The biggest problem of allis:
Why in god's name are my characters garbled?!
EGEAERA[EFEBEL EOCOROCUC*emce
If you open a document and it looks like this, there's one and only one reason for it: Your text editor, browser,
word processor or whatever else thats trying to read the document is assuming the wrong encoding. That's all.
‘The document is not broken (well, unless it is, see below), there's no magic you need to perform, you simply
need to select the right encoding to display the document.
‘The hypothetical document above contains this sequence of bits:
19990011 41000111 10000011 19012011 1000011 1010010 19000001 91011011
19990011 21100112 10009011 01600010 1090011 10010011 19000011 21001111
19900010 11091101 1901011 11161111 10900010 16110101 18800010 10101101
19990818 11001000 12800010 1010010
Now, quick, what encoding is that? If you just shrugged, you'd be correct. Who knows, right?
Well, le’ try to interpret this as ASCII. Hmm, most of these bytes start with a 1 bit, Ifyou remember correctly,
ASCII doesn't use that bit. So it's not ASCII. What about UTF-8? Hmm, no, most of these sequences are not
valid UTF-8.4 So UTE-8 is out, too. Let's try "Mac Roman" (yet another encoding scheme for them Europeans).
Hey, all those bytes are valid in Mac Roman, 19090011 maps to "E", @1¢0@111 to "G" and so on. If you read this
bit sequence using the Mac Roman encoding, the result is "EGEIERA[EfEBEIEOCOIOCuC#G»C¢". That looks
like a valid string, no? Yes? Maybe? Well, how's the computer to know? Maybe somebody meant to write
"EGEIERA[EMBEIEOCOIOCuC#C»C¢". For all I know that could be a DNA sequence.> Unless you have a
better suggestion, let's declare this to be a DNA sequence, say this document was encoded in Mac Roman and
call ita day.
hitpskunststube.nevencoding! ete9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
v Default
Unicode (UTF-8)
Western (ISO Latin 1)
Western (Mac OS Roman)
Japanese (Shift JIS)
Japanese (ISO 2022-JP)
Japanese (EUC)
Japanese (Shif
JIS X0213)
Traditional Chinese (Big 5)
Traditional Chinese (Big 5 HKSCS)
Traditional Chinese (Windows, DOS)
Korean (ISO 2022-KR)
Korean (Mac OS)
Korean (Windows, DOS)
Arabic (ISO 8859-6)
Arabic (Windows)
Hebrew (ISO 8859-8)
Hebrew (Windows)
Greek (ISO 8859-7)
Greek (Windows)
Ofcourse, that unfortunately is complete nonsense, The correct answer is that this text is encoded in the
Japanese Shift-IIS encoding and was supposed to read "7 =F 4 PEEL < 720". Well, who'd've
thunk?
The primary cause of garbled text is: Somebody is trying to read a byte sequence using the wrong encoding. The
computer always needs to be told what encoding some text is in, Otherwise it can't know. There are different
ways how different kinds of documents can specify what encoding they're in and these ways should be used. A
raw bit sequence is always a mystery box and could mean anything,
Most browsers allow the selection of a different encoding in the View menu under the menu option "Text
Encoding", which causes the browser to reinterpret the current page using the selected encoding. Other programs
may offer something like "Reopen using encoding..." in the File menu, or possibly an "Import..." option which
allows the user to manually select an encoding.
My document doesn't make sense in any encoding!
Ifa sequence of bits doesn't make sense (to a human) in any encoding, the document has mostly likely been
converted incorrectly at some point. Say we took the above text "EGEIERA[EfEBEIEOCOIOCuC#GnGe"
because we didn't know any better and saved it as UTF-8. The text editor assumed it correctly read a Mac
Roman encoded text and you now want to save this text in a different encoding. All of these characters are valid
Unicode characters after all. That is to say, there's a code point in Unicode that can represent "E", one that can
represent "G" and so on. So we can happily save this text as UTF-8:
11990011 10001¢@1 01000111 1100011 1001001 11000011 1010110 11000011
19901601 21010012 1108011 19800101 61011011 1190011 19001901 21100110
hitpskunststube.nevencoding! ma9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
11900011 10001001 e1¢e0010 1160011 1001001 11000011 10101100 11000011
19901001 21001111 11¢¢9011 19800111 1100011 10010101 11000011 10101100
11990811 1001010 11209011 19800111 1109010 10110101 11000011 10000111
11192016 1900101 19109000 11000011 10000111 1100001 10111011 11000011
19900111 11000010 19100010
‘This is now the UTF-8 bit sequence representing the text "EGEIERA[EfEBEIEOC OiOC nC". This bit
sequence has absolutely nothing to do with our original document. Whatever encoding we try to open it in, we
‘won't ever get the text "> F 4 TNEREL < 1ZL0" from it, It is completely lost. It would be possible to
recover the original text from it if we knew that a Shift-JIS document was misinterpreted as Mac Roman and
then accidentally saved as UTF-8 and reversed this chain of missteps. But that would be a lucky fluke.
Many times certain bit sequences are invalid in a particular encoding. If we tried to open the original document
using ASCII, some bytes would be valid in ASCII and map to a real character and others wouldn't. The program
you're opening it with may decide to silently discard any bytes that aren't valid in the chosen encoding, or
possibly replace them with 2. There's also the "Unicode replacement character" (U+FFFD) which a program
may decide to insert for any character it couldn't decode correctly when trying to handle Unicode. If a document
is saved with some characters gone or replaced, then those characters are really gone for good with no way to
reverse-engineer them.
Ifa document has been misinterpreted and converted to a different encoding, it's broken. Trying to "repair" it
may or may not be successful, usually it isn't. Any manual bit-shifting or other encoding voodoo is mostly that,
voodoo. It's trying to fix the symptoms after the patient has already died.
So how to handle encodings correctly?
It’s really simple: Know what encoding a certain piece of text, that is, a certain byte sequence, is in, then interpret
it with that encoding. That's all you need to do. If you're writing an app that allows the user to input some text,
specify what encoding you accept from the user. For any sort of text field, the programmer can usually decide its
encoding, For any sort of file a user may upload or import into a program, there needs to be a specification what
encoding that file should be in. Alternatively, the user needs some way to tell the program what encoding the file
is in, This information may be part of the file format itself, or it may be a selection the user has make (not that
most users would usually know, unless they have read this article).
Ifyou need to convert from one encoding to another, do so cleanly using tools that are specialized for that.
Converting between encodings is the tedious task of comparing two code pages and deciding that character 152
in encoding A is the same as character 4122 in encoding B, then changing the bits accordingly. This particular
wheel does not need reinventing and any mainstream programming language includes some way of converting
text from one encoding to another without needing to think about code points, pages or bits at all.
Say, your app must accept files uploaded in GB18030, but intemnally you are handling all data in UTF-32. A tool
like iconv can cleanly convert the uploaded file with a one-liner like iconv("GB18030", "UTF-32", $string)
That is, it will preserve the characters while changing the underlying bits:
character GB18030 encoding UTF-32 encoding
a 30112121 e1101300 #9900000 eo0e0000 o1111116 eer0e111
That's all there is to it, The content of the string, that is, the human readable characters, didn't change, but it's
now a valid UTF-32 string, If you keep treating it as UTF-32, there's no problem with garbled characters. As
discussed at the very beginning though, not all encoding schemes can represent all characters. It’s not possible to
encode the character "i something Bad™ would
happen if you tried to.
any encoding scheme designed for European languages.
Unicode all the way
Itpsihunststube nevencoaing! ane‘91242, 349 PM ‘What Every Programmer Absolitely, Pstively Needs to Know About Encodings and Cnaractr Sets to Work With Text
Precisely because of that, there's virtually no excuse in this day and age not to be using Unicode all the way.
Some specialized encodings may be more efficient than the Unicode encodings for certain languages. But unless
you're storing terabytes and terabytes of very specialized text (and that's a Jot of text), there's usually no reason
to worry about it. Problems stemming from incompatible encoding schemes are much worse than a wasted
gigabyte or two these days. And this will become even truer as storage and bandwidth keeps growing larger and
cheaper
If your system needs to work with other encodings, convert them to Unicode upon input and convert them back
to other encodings on output as necessary. Otherwise, be very aware of what encodings you're dealing with at
which point and convert as necessary, if that's possible without losing any information,
Flukes
have this website talking to a database. My app handles everything as UTF-8 and stores it as such
in the database and everything works fine, but when I look at my database admin interface my text,
is garbled. - Anonymous code monkey
‘There are situations where encodings are handled incorrectly but things still work. An often-encountered
situation is a database that's set to 1atin-1 and an app that works with UTF-8 (or any other encoding). Pretty
much any combination of 4s and @s is valid in the single-byte latin-1 encoding scheme. If the database receives
text from an application that looks like 11100111 1011100 10100111, itll happily store it, thinking the app meant,
to store the three latin characters "¢,§". Affer all, why not? It then later returns this bit sequence back to the app,
which will happily accept it as the UTE-8 sequence for "#8", which it originally stored. The database admin
interface automatically figures out that the database is set to Latin-1 though and interprets any text as Latin-1,
so all values look garbled only in the admin interface.
That's a case of fool's luck where things happen to work when they actually aren't, Any sort of operation on the
text in the database may or may not work as intended, since the database is not interpreting the text correctly. In
a worst case scenario, the database inadvertently destroys all text during some random operation two years after
the system went into production because it was operating on text assuming the wrong encoding
UTF-8 and ASCII
The ingenious thing about UTF-8 is that it's binary compatible with ASCII, which is the de-facto baseline for all
encodings. All characters available in the ASCII encoding only take up a single byte in UTF-8 and they're the
exact same bytes as are used in ASCIL In other words, ASCII maps 1:1 unto UTF-8. Any character not in ASCII
takes up two or more bytes in UTE-8. For most programming languages that expect to parse ASCII, this means
you can include UTF-8 text directly in your programs:
BF;
$string =
Saving this as UTF-8 results in this bit sequence:
0100100 01110011 @1110168 61110016 e1161001 81101116 @1180111 e@1¢0000
ee111101 8100000 g010001 1110e11¢ 10111190 10100010 11160101 10101101
19010111 9@1¢0e10 09111011
Only bytes 12 through 17 (the ones starting with 1) are UTF-8 characters (two characters with three bytes each).
All the surrounding characters are perfectly good ASCIL A parser would read this as follows:
$string = "11100110 10111100 1018010 1110101 10101201 16010111";
To the parser, anything following a quotation mark is just a byte sequence which it will take as-is until it
encounters another quotation mark. If you simply output this byte sequence, you're outputting UTF-8 text, No
need to do anything else. The parser does not need to specifically support UTF-8, it just needs to take strings
hitpsskunststube.nevencoding! anne9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
literally, Naive parsers can support Unicode this way without actually supporting Unicode, Many modern
languages are explicitly Unicode-aware though.
Encodings and PHP
This last section deals with issues surrounding Unicode and PHP. Some portions of it are applicable to
programming languages in general while others are PHP specific. Nothing new will be revealed about
encodings, but concepts described above will be rehashed in the light of practical application.
PHP doesn't natively support Unicode. Except it actually supports it quite well. The previous section shows how
UTF-8 characters can be embedded in any program directly without problems, since UTF-8 is backwards
compatible with ASCII, which is all PHP needs. The statement "PHP doesn't natively support Unicode" is true
though and it seems to cause a lot of confusion in the PHP community.
False promises
‘One specific pet-peeve of mine are the functions ut#8_encode and ut#&_decode. I often see nonsense along the
lines of "To use Unicode in PHP you need to wtf8_encode your text on input and utf8_decode on output". These
‘two fimetions seem to promise some sort of automagic conversion of text to UTF-8 which is "necessary" since
“PHP doesn't support Unicode”. If you've been following this article at all though, you should know by now that
1. there's nothing special about UTF-8 and
2. you cannot encode text fo UTE-8 after the fact
To clarify that second point: All text is already encoded in some encoding, When you type it into the source
code, it has some encoding. Specifically, whatever you saved it as in your text editor. If you get it from a
database, it's already in some encoding. If you read it from a file, it's already in some encoding
Text is either encoded in UTF-8 or it's not, If it's not, it's encoded in ASCII, ISO-8859-1, UTF-16 or some other
encoding. If i's not encoded in UTF-8 but is supposed to contain "UTF-8 characters".? then you have a case of
cognitive dissonance. If it does contain actual characters encoded in UTE-8, then it’s actually UTF-8 encoded.
Text can't contain Unicode characters without being encoded in one of the Unicode encodings.
So what in the world does utf8_encode do then?
incodes an ISO-8859-I string to UTF-8"=
Aha! So what the author actually wanted to say is that it converts the encoding of text from ISO-8859-1 to UTF-
8. That's all there is to it. utf&_encode must have been named by some European without any foresight and is a
horrible, horrible misnomer. The same goes for utf8_decode. These functions are useless for any purpose other
than converting between ISO-8859-1 and UTF-8. If you need to convert a string from any other encoding to any
other encoding, look no further than iconv,
ut 8_encode is not a magic wand that needs (o be swung over any and all text because "PHP doesn't support
Unicode". Rather, it seems to cause more encoding problems than it solves thanks to terrible naming and
unknowing developers.
Native-schmative
So what does it mean for a language to natively support or not support Unicode? It basically refers to whether
the language assumes that one character equals one byte or not. For example, PHP allows direct access to the
characters of a string using array notation:
echo $string(@];
hipa:tkunststube netioncoding! rome‘91242, 349 PM ‘What Every Programmer Absolitely, Pstively Needs to Know About Encodings and Cnaractr Sets to Work With Text
If that $string was in a single-byte encoding, this would give us the first character. But only because "character"
coincides with "byte" in a single-byte encoding. PHP simply gives us the first byte without thinking about
"characters". Strings are byte sequences to PHP, nothing more, nothing less. All this "readable character" stuff is,
a human thing and PHP doesn't care about it.
1000100 @1101111 @1101110 ee1ee111 o111¢100
Db ° A . t
1100011 @1100001 e111001@ eitee1e1 ee1eeee1
c a r e !
‘The same goes for many standard functions such as substr, stepos, trim and so on, The non-support arises if
there's a discrepancy between the length of a byte and a character.
1110e11@ 10111109 101¢0010 1110101 10101101 10010111
x =
x
~
Using $stringt@] on the above string will, again, give us the first byte, which is 11100118. In other words, a third
of the three-byte character "iJ". 11100110 is, by itself, an invalid UTF-8 sequence, so the string is now broken. If
you felt like it, you could try to interpret that in some other encoding where 11100110 represents a valid
‘character, which will result in some random character. Have fun, but don't use it in production.
And that's actually all there is to it. "PHP doesn't natively support Unicode" simply means that most PHP
functions assume one byte = one character, which may lead to it chopping multi-byte characters in half or
calculating the length of strings incorrectly if you're naively using non-multi-byte-aware functions on multi-byte
strings. It does not mean that you can't use Unicode in PHP or that every Unicode string needs to be blessed by
ut8_encode or other such nonsense.
Luckily, there's the Multibyte String extension, which replicates all important string functions in a multi-byte
aware fashion. Using mb_substr($string, @, 1, 'UTF-8") on the above string correctly retums 11160110
1@1111@¢ 1¢190018, which is the whole "j8" character. Because the mb_ functions now have to actually think
about what they're doing, they need to know what encoding they're working on. Therefore every mb_function
accepts an $encoding parameter as well. Alternatively, this can be set globally for all nb_ functions using
imb_internal_encoding.
Using and abu:
1g PHP's handling of encodings
‘The whole issue of PHP's (non-)support for Unicode is that it just doesn't care. Strings are byte sequences to
PHP. What bytes in particular doesn't matter. PHP doesn't do anything with strings except keeping them stored in
memory. PHP simply doesn't have any concept of either characters or encodings. And unless it tries to
‘manipulate strings, it doesn't need to either; it just holds onto bytes that may or may not eventually be
interpreted as characters by somebody else. The only requirement PHP has of encodings is that PHP source code
needs to be saved in an ASCII compatible encoding. The PHP parser is looking for certain characters that tell it
what to do. $ (@¢1¢0109) signals the start of a variable, = (2@111101) an assignment, " (ge1¢e010) the start and end
Ipsihunstsube nevencoding! nine9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
of a string and so on. Anything else that doesn't have any special significance to the parser is just taken as a
literal byte sequence. That includes anything between quotes, as discussed above. This means the following:
1, You can't save PHP source code in an ASCII-incompatible encoding. For example, in UTF-16 a " is
encoded as 0000000 20100010, To PHP, which tries to read everything as ASCII, that's a NUL byte
followed by a ". PHP will probably get a hiccup if every other character it finds is a NUL byte
2. You can save PHP source code in any ASCI-compatible encoding. Ifthe first 128 code points of an
encoding are identical to ASCU, PHP can parse it. All characters that are in any way significant to PHP
are within the 128 code points defined by ASCII. Ifstring literals contain any code points beyond that,
PHP doesn’t care. You can save PHP source code in ISO-8859-1, Mac Roman, UTE-8 or any other ASCI-
compatible encoding. The string literals in your script will have whatever encoding you saved your source
code as.
Any external file you process with PHP can be in whatever encoding you like. If PHP doesn't need to
parse it, there are no requirements to meet to keep the PHP parser happy.
$f00 = file_get_contents(‘bar.txt");
The above will simply read the bits in bar.txt into the variable $f00. PHP doesn't try to interpret, convert,
encode or otherwise fiddle with the contents. The file can even contain binary data such as an image, PHP
doesn't care.
4, If intemal and external encodings have to match, they have to match, A common case is localization,
where the source code contains something like echo localize('Foobar*) and an external localization file
contains something along the lines of this
msgid “Foobai
msgstr "7 =
Both "Foobar" strings need to have an identical bit representation if you want to find the correct
localization. If the source code was saved in ASCII but the localization file in UTF-16, the strings
wouldn't match, Fither some sort of encoding conversion would be necessary or the use of an encoding-
aware string matching function.
The astute reader might ask at this point whether it's possible to save a, say, UTF-16 byte sequence inside a
string literal of an ASCII encoded source code file, to which the answer would be: absolutely.
echo “UTF-16";
Ifyou can bring your text editor to save the echo “ and "; parts in ASCII and only UTF-16 in UTF-16, this will
‘work just fine. The necessary binary representation for that looks like this:
1100101 61100011 21101600 01101111 9e180080 20100010
e c h ° "
1111111¢ 11111111 98209000 e1610101 20090000 e1010100
(UTF-16 marker) —U T
22200000 a1000118 eaaeeeee o0101101 eeeeeeee 00110001
F - 1
eea0e90 aaiie11e ea199e1e ee111e11
6 " i
‘The first line and the last two bytes are ASCII. The rest is UTF-16 with two bytes per character. The leading
32111110 11111111 on line 2 is a marker required at the start of UTF-16 encoded text (required by the UTF-16
standard, PHP doesn't give a damn). This PHP script will happily output the string "UTF-16" encoded in UTF-
16, because it simple outputs the bytes between the two double quotes, which happens to represent the text
hitpskunststube.nevencoding! rane9124122, 349 PM ‘what Every Programmer Absoluiely, Positively Needs to Know About Encodings and Character Sets to Work With Text
"UTF-16" encoded in UTF-16. The source code file is neither completely valid ASCII nor UTF-16 though, so
working with it in a text editor won't be much fun.
Bottom line
PHP supports Unicode, or in fact any encoding, just fine, as long as certain requirements are met to keep the
parser happy and the programmer knows what he's doing. You really only need to be careful when manipulating
strings, which includes slicing, trimming, counting and other operations that need to happen on a character level
rather than a byte level. If you're not "doing anything" with your strings besides reading and outputting them,
you will hardly have any problems with PHP's support of encodings that you wouldn't have in any other
language as well,
Encoding-aware languages
‘What does it mean for a language to support Unicode then? Javascript for example supports Unicode. In fact,
any string in Javascript is UTF-16 encoded. In fact, i's the only thing Javascript deals with. You cannot have a
string in Javascript that is not UTF-16 encoded. Javascript worships Unicode to the extent that there's no facility
to deal with any other encoding in the core language. Since Javascript is most often run in a browser that's not a
problem, since the browser can handle the mundane logistics of encoding and decoding input and output
are simply encoding-aware. Internally they store strings in a particular encoding, often UTF-16,
to
Other languag
In turn they need to be told or try to detect the encoding of everything that has to do with text. They ne
know what encoding the souree code is saved in, what encoding a file they're supposed to read is in, what
encoding you want to output text in; and they convert encodings on the fly as needed with some manifestation of
Unicode as the middleman. They're doing the same thing you can/should/need to do in PHP semi-automatically
behind the scenes. That's neither better nor worse than PIP, just different. The nice thing about it is that standard
language functions that deal with strings Just Work™, while in PHP one needs to spare some attention to
whether a string may contain multi-byte characters or not and choose string manipulation functions accordingly.
The depths of Unicode
Since Unicode deals with many different scripts and many different problems, it has a lot of depth to it. For
example, the Unicode standard contains information for such problems as CJK ideograph unification. That
‘means, information that two or more Chinese/Japanese/Korean characters actually represent the same character
in slightly different writing methods. Or rules about converting from lower case to upper case, vice-versa and
round-trip, which is not always as straight forward in all scripts as itis in most Western European Latin-derived
scripts. Some characters can also be represented using different code points. The letter "6" for example can be
represented using the code point U+00F6 ("LATIN SMALL LETTER O WITH DIAERESIS") or as the two
code points U+006F ("LATIN SMALL LETTER 0") and U~0308 ("COMBINING DIAERESIS"), that is the
letter "o" combined with """. In UTF-8 that's either the double-byte sequence 1190011 10110110 or the three-
byte sequence e1101111 11001100 10001089, both representing the same human readable character. As such,
there are rules governing Normalization within the Unicode standard, i.e. how either of these forms can be
converted into the other. This and a lot more is outside the scope of this article, but one should be aware of it
Final TL;DR
+ Text is always a sequence of bits which needs to be translated into human readable text using lookup
tables. If the wrong lookup table is used, the wrong character is used.
+ You're never actually directly dealing with "characters" or "text", you're always dealing with bits as seen
through several layers of abstractions, Incorrect results are a sign of one of the abstraction layers failing,
‘+ Iftwo systems are talking to each other, they always need to specify what encoding they want to talk to
each other in. The simplest example of this is this website telling your browser that it's encoded in UTF-8.
hitpskunststube.nevencoding! sane‘1242, 349 PM ‘Wat Every Programmer Absolutely, Positively Needs to Know About
+ Inthis day and age, the standard encoding is UTF-8 since it can encode virtually any character of interest,
is backwards compatible with the de-facto baseline ASCII and is relatively space efficient for the majority
of use cases nonetheless,
‘© Other encodings still occasionally have their uses, but you should have a concrete reason for
wanting to deal with the headaches associated with character sets that can only encode a subset of
Unicode.
+ The days of one byte = one character are over and both programmers and programs need to catch up on
this.
ings and Character Sets to Work With Text
Now you should really have no excuse anymore the next time you garble some text.
1. Yes, that means ASCII can be stored and transferred using only 7 bits and it often is. No, this is not within
the scope of this article and for the sake of argument we'll assume the highest bit is "wasted" in ASCII. ©
2, And if it isn't, it will be extended. It already has been several times. ©
3. Please note that when I'm using the term "starting" together with "byte", I mean it from the human-
readable point of view. ©
4, Peruse the UTE-8 specification if you want to follow this with pen and paper. ©
5, Hey, I'm a programmer, not a biologist. ©
6. And of course there'll be no recent backup. ©
7. A"Unicode character" is a code point in the Unicode table, "3" is not a Unicode character, it's the
Hiragana letter %. There is a Unicode code point for it, but that doesn't make the letter itself a Unicode
character. A "UTE-8 character" is an oxymoron, but may be stretched to mean what's technically called a
"UTE-8 sequence", which is a byte sequence of one, two, three or four bytes representing one Unicode
character. Both terms are often used in the sense of "any letter that ain't part of my keyboard” though,
which means absolutely nothing. ©
8, httpx/Awww.php.net/manual/en/function.utf8-encode,php @
About the author
David C. Zentgraf is a web developer working partly in Japan and Europe and is a regular on
you have feedback, criticism or additions, please fel free to try (@deceze on Twitter, take an educated guess at
his email address or look it up using time-honored methods, This article was published on kunststube.net. And
no, there is no dirty word in "Kunststube"
OSO
‘What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work
With Text by David C. Zentgraf is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike
3.0 Unported License.
Last updated on Monday, April 27th, 2015.
hitpskunststube.nevencoding! sane