0% found this document useful (0 votes)

156 views19 pages

Working With Unicode

Working with Unicode can be challenging due to encoding issues that can occur everywhere from Python code to files and databases. Unicode represents text as characters rather than bytes, using encodings like UTF-8. In Python 2, strings are byte sequences while Unicode is used for text, requiring explicit encoding and decoding. Always work with Unicode internally and encode to strings only for I/O operations like printing. Knowing the encoding is also crucial, as incorrectly assuming the encoding can lead to "gibberish".

Uploaded by

PhillyPu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

156 views19 pages

Working With Unicode

Uploaded by

PhillyPu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Working with Unicode

(with emphasis on Python 2)

“❤❤❤” vs “â¤â¤â¤”
Where can Unicode encoding issues happen?
EVERYWHERE

● Python 2 (“str” vs “unicode”)

● Python 2 Libraries (“csv” module doesn’t write Unicode 😠)
● Erroneously encoded ﬁles (“Western-1252” or “ASCII”)
● MySQL connections (“latin1” vs “utf8” vs “utf8mb4”)
○ MySQL Workbench only does “utf8”, which is only a subset of “utf8mb4”
What the heck is an encoding⸮❔?⁉፧¿
All data is stored in the computer as bits. Usually grouped into chunks of 8
(bytes).

For convenience, instead of writing binary, we write in hexadecimal:

01101000 01100101 01101100 01101100 01101111

68 65 6C 6C 6F

“But if all data is stored as 1s and 0s, how do we see the alphabets????”
Character Encoding 1: ASCII and Windows-1252
ASCII Example:
ASCII is a single-byte character encoding:

1 byte -> 1 character

Hexadecimal (what we store): 68 65 6C 6C 6F

Interpreted as ASCII: h e l l o
Character Encoding 2: UTF-8
Variable-width encoding:

one character can be represented with 1-4 bytes

Generally, ASCII characters have the same byte representation as UTF-8.

ASCII-encoded ﬁles can be decoded with UTF-8 without any ﬁasco.

(Backwards compatibility)

Note: UTF-8’s 8 refers to the 8-bit size of its code units, as opposed to UTF-16 which uses 16-bit and UTF-32 which
uses 32-bit
UTF-8 Example

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

UTF-8: 你好 ! 😄
Wrong encoding?
When you try to use ASCII to interpret data meant to be interpreted with UTF-8:

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

UTF-8: 你好 ! 😄

Hex: E4 BD A0 E5 A5 BD 21 F0 9F 98 84

ASCII/latin1: ä ½ å ¥ ½ ! ð
If you see “gibberish” in your data...
See if you can ﬁx the encoding from the source.

If stored in MySQL already, see if this command works:

select convert(binary convert(field_name using latin1) using utf8) from

table_name

https://stackoverﬂow.com/questions/20151835/how-to-convert-wrongly-encoded-data-to-utf-8
Python2
Use Python 3 if you can. Will save you from needing to dive into the seventh
circle of hell.

Python 2’s str is a series of bytes.

chinese.txt: “hi猫”

>>> text = open('chinese.txt').read()

>>> text
'hi\xe7\x8c\xab'
>>> type(text)
<type 'str'>
>>> len(text) Example from
http://www.pgbovine.net/unicode-python.htm
5
Python2’s Unicode
>>> unicode_text = text.decode('utf-8')
>>> type(unicode_text)
<type 'unicode'>
>>> unicode_text
u'hi\u732b'
>>> len(unicode_text)
3
>>> unicode_text[0]
u'h'
>>> unicode_text[1]
u'i'
>>> unicode_text[2]
u'\u732b'
Working in Unicode… Now I want to write my data!
>>> u"abc"
u'abc'
>>> str(u"abc")
'abc'

>>> u"äöü"
u'\xe4\xf6\xfc'
>>> str(u"äöü")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
Unicode to String: Encode!
>>> u"äöü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
String to Unicode: Decode!
>>> '\xc3\xa4\xc3\xb6\xc3\xbc'.decode('utf8')
u'\xe4\xf6\xfc'

>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf8')

u'\xe4\xf6\xfc'
Why can’t I print this Unicode???
>>> x = u'\u732b'
>>> type(x)
<type 'unicode'>
>>> print x
Traceback (most recent call last):
File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode
character u'\u732b' in position 0: ordinal not in range(128)

Some terminals have the sys.stdout.encoding set to “US-ASCII”, which tries to

convert stdout outputs into ASCII.

Don’t try to print Unicode, as it can raise an exception. Print in string.

What should I do to print my pretty Unicode text?
>>> x = u'\u732b'
>>> import sys
>>> sys.stdout.encoding
'US-ASCII'
>>> x.encode('utf-8')
'\xe7\x8c\xab'
>>> print x.encode('utf-8')
猫
Takeaways
● Unicode → Str : Encode
● Str → Unicode : Decode
● Working with text in Python2? The moment you read the data in, encode
it into unicode objects.
○ NEVER COMBINE UNICODE WITH STR
● Finished working with your (properly encoded) unicode objects? Decode
into str objects again.
● Always know what encoding you’re working with.
Notes: Python2 escape sequences
\x

Next two characters should be interpreted as hex digits for a character

code

Next four digits should be interpreted as ordinal number for the Unicode
character
Links & References
http://kunststube.net/encoding/

http://www.pgbovine.net/unicode-python.htm

https://pythonhosted.org/kitchen/unicode-frustrations.html

https://stackoverﬂow.com/questions/20151835/how-to-convert-wrongly-enco
ded-data-to-utf-8

https://docs.python.org/2/howto/unicode.html

https://docs.python.org/2/tutorial/introduction.html#unicode-strings

Encoding Schemes
100% (1)
Encoding Schemes
23 pages
Utf-8 Survival
100% (6)
Utf-8 Survival
52 pages
Module 3.1 - Encryption
No ratings yet
Module 3.1 - Encryption
58 pages
Week2DataandExpression 95176
No ratings yet
Week2DataandExpression 95176
63 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Unicode Vs UTF-8
No ratings yet
Unicode Vs UTF-8
2 pages
Power Point
No ratings yet
Power Point
10 pages
Introduction To Database Management System Second Edition PDF
100% (2)
Introduction To Database Management System Second Edition PDF
553 pages
Lecture - ASCII and Unicode
No ratings yet
Lecture - ASCII and Unicode
38 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
No ratings yet
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
12 pages
Ruby Conf 2006: I18N, M17N, Unicode, and All That
No ratings yet
Ruby Conf 2006: I18N, M17N, Unicode, and All That
60 pages
018 Repraesentation III Online
No ratings yet
018 Repraesentation III Online
46 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
13 pages
p62 0x09 UTF8 Shellcode by Greuff
No ratings yet
p62 0x09 UTF8 Shellcode by Greuff
16 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Uni Code
No ratings yet
Uni Code
13 pages
Introduction To Encoding and Decoding
No ratings yet
Introduction To Encoding and Decoding
10 pages
Understanding Files - Binary vs. Text
No ratings yet
Understanding Files - Binary vs. Text
10 pages
Web Security (CAT-309) - Unit 1 Lecture 5
No ratings yet
Web Security (CAT-309) - Unit 1 Lecture 5
11 pages
ASCII Values Python Presentation
No ratings yet
ASCII Values Python Presentation
9 pages
Final Black Book
No ratings yet
Final Black Book
127 pages
Maxbox Starter120 Unicode
No ratings yet
Maxbox Starter120 Unicode
7 pages
Howto Unicode
No ratings yet
Howto Unicode
13 pages
Unicode & Character Encodings in Python - A Painless Guide - Real Python
No ratings yet
Unicode & Character Encodings in Python - A Painless Guide - Real Python
20 pages
Basic Data Types in Python
No ratings yet
Basic Data Types in Python
8 pages
Byte Objects Vs String in Python
No ratings yet
Byte Objects Vs String in Python
2 pages
Set 11 - 62546805 - 2025 - 06 - 10 - 15 - 23
No ratings yet
Set 11 - 62546805 - 2025 - 06 - 10 - 15 - 23
7 pages
Character Encoding For Sanskrit and Other Languages
No ratings yet
Character Encoding For Sanskrit and Other Languages
8 pages
Slide 3
No ratings yet
Slide 3
9 pages
Character Sets and Encoding
No ratings yet
Character Sets and Encoding
7 pages
SQL Server
100% (1)
SQL Server
163 pages
1.2 Text, Sound and Images
No ratings yet
1.2 Text, Sound and Images
5 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
008 What Is UTF-8 - UTF-8 Character Encoding Tutorial
No ratings yet
008 What Is UTF-8 - UTF-8 Character Encoding Tutorial
4 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
No ratings yet
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
18 pages
SS3 Note 2nd Term
No ratings yet
SS3 Note 2nd Term
10 pages
10200
No ratings yet
10200
38 pages
String Conversion Tools
No ratings yet
String Conversion Tools
2 pages
Byte Objects Vs String in Python
No ratings yet
Byte Objects Vs String in Python
2 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
13 pages
Text Encoding
No ratings yet
Text Encoding
8 pages
Cryptography Help in Python3: Convert HEX To Raw Bytes
No ratings yet
Cryptography Help in Python3: Convert HEX To Raw Bytes
3 pages
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
No ratings yet
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
26 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Variables and Identifiers: The Representation of Character Values
No ratings yet
Variables and Identifiers: The Representation of Character Values
2 pages
Utf-8 - Wikipedia, The Free Encyclopedia
No ratings yet
Utf-8 - Wikipedia, The Free Encyclopedia
10 pages
Unicode®: Character Encodings
No ratings yet
Unicode®: Character Encodings
11 pages
Uni Code
No ratings yet
Uni Code
9 pages
Python Unicode Objects
No ratings yet
Python Unicode Objects
2 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Linux Unicode Programming
No ratings yet
Linux Unicode Programming
10 pages
Department of Education: Republic of The Philippines
100% (1)
Department of Education: Republic of The Philippines
2 pages
UD29637B A - Thermal Bi Spectrum PTZ Network Camera - User Manual - 5.5.49 - 20221129
No ratings yet
UD29637B A - Thermal Bi Spectrum PTZ Network Camera - User Manual - 5.5.49 - 20221129
128 pages
HP Scanjet Enterprise 8500 Service Manual
100% (1)
HP Scanjet Enterprise 8500 Service Manual
198 pages
Data Science Syllabus From Beginner To Advanced
No ratings yet
Data Science Syllabus From Beginner To Advanced
7 pages
Programacion Web Parte-4
No ratings yet
Programacion Web Parte-4
4 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
Project Report On: Submitted By: Rajesh Kumar Ist Sem. Roll. No. 1212970064
No ratings yet
Project Report On: Submitted By: Rajesh Kumar Ist Sem. Roll. No. 1212970064
32 pages
04 Advanced SQL Commands
No ratings yet
04 Advanced SQL Commands
78 pages
Lead4pass Latest RedHat EX200 Dumps PDF Training Materials
100% (1)
Lead4pass Latest RedHat EX200 Dumps PDF Training Materials
7 pages
Professional Cloud Network Engineer
No ratings yet
Professional Cloud Network Engineer
5 pages
uniFLOW Network Load - V1.1
No ratings yet
uniFLOW Network Load - V1.1
41 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
Unit-3 Relational Database Management Systems (Basic)
No ratings yet
Unit-3 Relational Database Management Systems (Basic)
55 pages
Problem Addressed by The Topic
No ratings yet
Problem Addressed by The Topic
2 pages
MVSR Operating Systems
No ratings yet
MVSR Operating Systems
34 pages
Configuration Archive and Rollback
No ratings yet
Configuration Archive and Rollback
7 pages
Use The Following Hierarchy of Resources When You Are Preparing To Sequence An Application
No ratings yet
Use The Following Hierarchy of Resources When You Are Preparing To Sequence An Application
7 pages
CBEF
No ratings yet
CBEF
9 pages
DTR600 - 700 - 720 Series Software Update Steps For R01.02.02 LACR
No ratings yet
DTR600 - 700 - 720 Series Software Update Steps For R01.02.02 LACR
13 pages
Artificial Intelligence Smart Exam Proctoring System
No ratings yet
Artificial Intelligence Smart Exam Proctoring System
10 pages
Dcschmid - Personal Neovim Cheatsheet
No ratings yet
Dcschmid - Personal Neovim Cheatsheet
3 pages
Use of Cardlayout To Write A Two-Level Card Deck That Operating System
No ratings yet
Use of Cardlayout To Write A Two-Level Card Deck That Operating System
10 pages
ANSWER To Maaz
No ratings yet
ANSWER To Maaz
8 pages
Person Name Entity Recognition For Arabic
No ratings yet
Person Name Entity Recognition For Arabic
8 pages
Snake - ioLICENSE at Master Bibhuticodersnake - Io GitHub
No ratings yet
Snake - ioLICENSE at Master Bibhuticodersnake - Io GitHub
1 page
Shah Amin Group
No ratings yet
Shah Amin Group
11 pages
Sales Executive
No ratings yet
Sales Executive
2 pages
Open Office Help
No ratings yet
Open Office Help
2 pages
Encircle The Correct Answer. Refrain From Alternation. Alternation Is Equivalent To Five (5) Points Deductions
No ratings yet
Encircle The Correct Answer. Refrain From Alternation. Alternation Is Equivalent To Five (5) Points Deductions
4 pages
Anusha Resume 1
No ratings yet
Anusha Resume 1
3 pages
CM x86 13.0 r1 Android x86
No ratings yet
CM x86 13.0 r1 Android x86
2 pages
Blowfish Cipher Tutorials - Herong's Tutorial Examples
From Everand
Blowfish Cipher Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet

Working With Unicode

Uploaded by

Working With Unicode

Uploaded by

Working with Unicode

(with emphasis on Python 2)

● Python 2 (“str” vs “unicode”)

For convenience, instead of writing binary, we write in hexadecimal:

01101000 01100101 01101100 01101100 01101111

1 byte -> 1 character

Hexadecimal (what we store): 68 65 6C 6C 6F

one character can be represented with 1-4 bytes

Generally, ASCII characters have the same byte representation as UTF-8.

ASCII-encoded ﬁles can be decoded with UTF-8 without any ﬁasco.

If stored in MySQL already, see if this command works:

select convert(binary convert(field_name using latin1) using utf8) from

Python 2’s str is a series of bytes.

>>> text = open('chinese.txt').read()

>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf8')

Some terminals have the sys.stdout.encoding set to “US-ASCII”, which tries to

Don’t try to print Unicode, as it can raise an exception. Print in string.

Next two characters should be interpreted as hex digits for a character

You might also like