Working with Unicode
(with emphasis on Python 2)
“❤❤❤” vs “â¤â¤â¤”
Where can Unicode encoding issues happen?
EVERYWHERE
● Python 2 (“str” vs “unicode”)
● Python 2 Libraries (“csv” module doesn’t write Unicode 😠)
● Erroneously encoded files (“Western-1252” or “ASCII”)
● MySQL connections (“latin1” vs “utf8” vs “utf8mb4”)
○ MySQL Workbench only does “utf8”, which is only a subset of “utf8mb4”
What the heck is an encoding⸮❔?⁉፧¿
All data is stored in the computer as bits. Usually grouped into chunks of 8
(bytes).
For convenience, instead of writing binary, we write in hexadecimal:
01101000 01100101 01101100 01101100 01101111
68 65 6C 6C 6F
“But if all data is stored as 1s and 0s, how do we see the alphabets????”
Character Encoding 1: ASCII and Windows-1252
ASCII Example:
ASCII is a single-byte character encoding:
1 byte -> 1 character
Hexadecimal (what we store): 68 65 6C 6C 6F
Interpreted as ASCII: h e l l o
Character Encoding 2: UTF-8
Variable-width encoding:
one character can be represented with 1-4 bytes
Generally, ASCII characters have the same byte representation as UTF-8.
ASCII-encoded files can be decoded with UTF-8 without any fiasco.
(Backwards compatibility)
Note: UTF-8’s 8 refers to the 8-bit size of its code units, as opposed to UTF-16 which uses 16-bit and UTF-32 which
uses 32-bit
UTF-8 Example
Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84
UTF-8: 你 好 ! 😄
Wrong encoding?
When you try to use ASCII to interpret data meant to be interpreted with UTF-8:
Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84
UTF-8: 你 好 ! 😄
Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84
ASCII/latin1: ä ½ å ¥ ½ ! ð
If you see “gibberish” in your data...
See if you can fix the encoding from the source.
If stored in MySQL already, see if this command works:
select convert(binary convert(field_name using latin1) using utf8) from
table_name
https://stackoverflow.com/questions/20151835/how-to-convert-wrongly-encoded-data-to-utf-8
Python2
Use Python 3 if you can. Will save you from needing to dive into the seventh
circle of hell.
Python 2’s str is a series of bytes.
chinese.txt: “hi猫”
>>> text = open('chinese.txt').read()
>>> text
'hi\xe7\x8c\xab'
>>> type(text)
<type 'str'>
>>> len(text) Example from
http://www.pgbovine.net/unicode-python.htm
5
Python2’s Unicode
>>> unicode_text = text.decode('utf-8')
>>> type(unicode_text)
<type 'unicode'>
>>> unicode_text
u'hi\u732b'
>>> len(unicode_text)
3
>>> unicode_text[0]
u'h'
>>> unicode_text[1]
u'i'
>>> unicode_text[2]
u'\u732b'
Working in Unicode… Now I want to write my data!
>>> u"abc"
u'abc'
>>> str(u"abc")
'abc'
>>> u"äöü"
u'\xe4\xf6\xfc'
>>> str(u"äöü")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
Unicode to String: Encode!
>>> u"äöü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
String to Unicode: Decode!
>>> '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf8')
u'\xe4\xf6\xfc'
or
>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf8')
u'\xe4\xf6\xfc'
Why can’t I print this Unicode???
>>> x = u'\u732b'
>>> type(x)
<type 'unicode'>
>>> print x
Traceback (most recent call last):
File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode
character u'\u732b' in position 0: ordinal not in range(128)
Some terminals have the sys.stdout.encoding set to “US-ASCII”, which tries to
convert stdout outputs into ASCII.
Don’t try to print Unicode, as it can raise an exception. Print in string.
What should I do to print my pretty Unicode text?
>>> x = u'\u732b'
>>> import sys
>>> sys.stdout.encoding
'US-ASCII'
>>> x.encode('utf-8')
'\xe7\x8c\xab'
>>> print x.encode('utf-8')
猫
Takeaways
● Unicode → Str : Encode
● Str → Unicode : Decode
● Working with text in Python2? The moment you read the data in, encode
it into unicode objects.
○ NEVER COMBINE UNICODE WITH STR
● Finished working with your (properly encoded) unicode objects? Decode
into str objects again.
● Always know what encoding you’re working with.
Notes: Python2 escape sequences
\x
Next two characters should be interpreted as hex digits for a character
code
\u
Next four digits should be interpreted as ordinal number for the Unicode
character
Links & References
http://kunststube.net/encoding/
http://www.pgbovine.net/unicode-python.htm
https://pythonhosted.org/kitchen/unicode-frustrations.html
https://stackoverflow.com/questions/20151835/how-to-convert-wrongly-enco
ded-data-to-utf-8
https://docs.python.org/2/howto/unicode.html
https://docs.python.org/2/tutorial/introduction.html#unicode-strings