03 - Unicode Characters and Strings - en

The document discusses the history of character encoding and representation in computers. It describes the evolution from early character sets like ASCII with 128 characters encoded using 7 bits to wider encoding standards like Unicode that can support millions of characters from different languages. UTF-8 became the dominant encoding for transmitting text over networks and storing files due to its efficiency and backwards compatibility with ASCII.

Uploaded by

Box Box

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views4 pages

03 - Unicode Characters and Strings - en

Uploaded by

Box Box

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 4

So, we started this entire course printing hello world and I just said "Hello

world," and out comes hello world. It'd be nice if that was super simple. In 1970,
it was simple because there was pretty much one character set. Even in 1970, when I
started, we didn't even have a lowercase characters. We just had uppercase
characters, and I'll tell you we were happy when we just had uppercase characters.
You kids these days with your lowercase characters, and numbers, and slashes, and
stuff. So, the problem that computers have is they have to come up with a way, I
mean, computers don't understand letters actually what computers understand his
numbers. So, we had to come up with a mapping between letters and numbers. So, we
came up with a mapping, and there's been many mappings historically. The one that
is the most common mapping of the 1980s, is this mapping called ASCII, the American
Standard Code for Information Interchange and it says basically this number equals
this letter. So for example, the number for Hello World, for capital H the number
is 72. Somebody just decided that the capital H was going to be 72 lowercase e,
number is 101 and new line is 10. So if you were really and truly going to look at
what was going on inside the computer, it's storing these numbers but the problem
is, is there are 128 of these, which means that you can't put every character into
a 0-128. So, in the early days, we just dealt with whatever characters are
possible. Like I said, when I started you could only do uppercase, you couldn't
even do lowercase. So, there is this function as long as you're dealing with simple
values that you can say, "Hey, what is the actual value for the letter H?" and it's
called ord which stands for ordinal. What's the ordinal? What does the number
corresponding to H? That's 72. What's the number corresponding to lowercase e? It's
101, and what's a number corresponding to new line? That is a 10. Remember, new
line is a single character. This also explains why the lowercase letters are all
greater than the uppercase letters because they're ordinal. For ASCII now, there's
so many characters sites but just for the default old school 128 characters that we
could represent with ASCII, the uppercase letters were had a lower ordinal than
lowercase letters. So, Hi is less than z, z, z, all lowercase and that's because
all lowercase letters are less. I mean all uppercase letters are less than all
lowercase letters actually this could be a, a, a. That's what I should've said
there, okay. So, don't worry about that just know that they are all numbers and in
the early days, life was simple. We would store every character in a byte of memory
otherwise known as 8-bits of memory, it's the same thing when you say I have a many
gigabyte USB stick, that's a 16 gigabyte USB stick that means there are 16 billion
bytes of memory on there which means we could put 16 million characters down here
in the old days. Okay? So, the problem is, is the old days, we just had so few
characters that we could put one character in a byte. So, the ord function tells us
the numeric value of a simple ASCII character. So, like I said, if you take a look
at this, the e is 101 and H, capital H is 72 and then the new line which is here at
line feed which is 10. Now, we can represent these in hexadecimal which is base 16,
or octal which is base eight, or actual binary which is what's really going on
which has nothing but zeroes and ones, but these are just this is the binary for
10, 0001010 and so these are just, these three are just alternate versions of these
numbers. The numbers go up to 127, and if you look at the binary, you can see in
this, this is actually seven bits or binary, you can see that it's all one. So, it
starts at all zeros goes into all ones. So, it's like zeros and ones are what the
computer's always do. If you go all the way back to the hardware, the little wires
and stuff, the wires or character are taking zeros and ones. So, this is what we
did in the 60s and 70s we just said whatever we're capable of squeezing in, we're
just totally happy, we're not going to have anything tricky and like I said,
halfway you know early in my undergraduate career, I started to see lowercase
letters I'm like, "Oh that's really beautiful." lowercase letters. Now, the real
world is nothing like this. There are all kinds of characters and they had to come
up with a scheme by which we could map these characters and for awhile there were a
whole bunch of incompatible ways to represent characters other than these ASCII,
also known as Latin character sets, also known as Arabic character sets. These
other character sets just completely invented their own way of representing and so,
you have these situations where you know Japanese computers pretty much couldn't
talk to American computers or European computers at all. I mean the Japanese
computers just had their own way of representing characters and American computers
had their own way of representing characters and they just couldn't talk, but they
invented this thing called Unicode. So, Unicode is this universal code for hundreds
of millions of different characters and hundreds of different character sets, so
that instead of saying, "Oh sorry. You don't fit with your language from some South
Sea Island." it's okay we've got space in Unicode for that. So, Unicode has lots
and lots of character not 128, lots and lots of characters. So, there was a time
like I said in the 70s or the 80s where everyone has something different and even
like in the early 2000s as the Internet, what happened was as the Internet came
out, it became unimportant issue to have as a way to exchange data. So we had to
say, "Oh well, it's not sufficient for Japanese computers to talk to Japanese
computers and American computers to talk to America computers when Japanese and
American commuters to exchange data.". So, they built these character sets and so
there is Unicode which is this abstraction of all different possible characters and
there are different ways of representing them inside of computers. So, there's a
couple of simple things that you might think are good ideas that turn out to be not
such good ideas, although they're used. So the first thing we did is these UTF-16,
UTF-32 and UTF-8 are basically ways of representing a larger set of characters. Now
the gigantic one is 32-bits which is four bytes, it's four times as much data for a
single character and so that's quite a lot of data. So, you're dividing the number
of characters by four so if this is 16 gigabytes it can only still, it can only
handle four billion characters or something, divided by four, right? Four bytes per
character and so that's not so efficient and then some there's a compromise or like
two bytes but then you have to pick this can do all the characters, this can do
lots of single lots of character sets, but it turns out that even though you might
instinctively think that like UTF-32 is better than UTF-16 and UTF-8 is the worst.
It turns out that UTF-8 is the best. So UTF-8 basically says, it's either going to
be one, two, three or four characters and there's little marks that tell it when to
go from one to four. The nice thing about it is that UTF overlaps with ASCII.
Right? So, if the only characters you're putting in are the original ASCII or
Latin-1 character set, then UTF-8 and ASCII are literally the same thing. Then, use
a special character that's not part of ASCII to indicate flipping from one byte
characters to two byte characters, or three byte characters, or four byte. So, it's
a variable length. So, you can automatically detect, you can just be reading
through a string and say, "Whoa, I just saw this weird marker character, I must be
in UTF-8." Then if I'm in UTF-8, than I can expand this, and find represent all
those character sets, and all those characters in those character sets. So, what
happened was, is they went through all these things, and as you can see from this
graph, the graph doesn't really say much other than the fact that UTF-8 is awesome,
and getting awesomer, and every other way of representing data is becoming less
awesome, right? This is 2012, so that's a long time ago. So, this is like UTF-8
rocks. That's really because as soon as these ideas came out, it was really clear
that UTF-8 is the best practice for encoding data moving between systems, and
that's why we're talking about this right now. Finally, with this network, we're
doing sockets were moving data between systems. So, you're American computer might
be talking to a computer in Japan, and you got to know a character sets coming out,
right? You might be getting Japanese characters even though everything I've shown
you is non-Japanese characters or orient our Asian characters or whatever, right?
So, UTF-8 turns out to be the best practice. If you're moving a file between ASCII
systems or if you're moving network data between two systems, the world recommends
UTF-8, okay? So, if you think about your computer, inside your computer, the
strings that are inside your Python like x equals hello world, we don't really care
what their syntax is. If there is a file, usually the Python running on the
computer in the file had the same character set, they might be UTF-8 inside Python,
it might be UTF-8 inside. But we don't care, you open a file, and that's why we
didn't have to talk about this we're opening files even though you might someday
encounter a file that's different than your normal character set, it's rare. So,
files are inside the computer, strings are inside the computer, but network
connections are not inside the computer, and we get databases, we're going to see
they're not inside the computer either. So, this is also something that's changed
from Python two to Python three, it was actually a big deal, a big thing. Most
people think it's great, I actually
think it's great. Some people are grumpy about it, but I think those people just
are people that fear change. So, there were two strings in Python. There were a
normal old string and a Unicode string. So, you could see that Python two would be
able to make a string constant, and that's type string, and it would make a Unicode
constant by prefixing U before the quote. That's a separate thing, and then you had
to convert back and forth between Unicode and strings. What we've done in Python
three is, this is a regular string, and this is a Unicode string, but you'll notice
they're both strings. So, it means that inside the world of Python, if we're
pulling stuff and you might have to convert it, but inside Python everything is
Unicode. You don't have to worry about it every string is the same, whether it has
Asian characters, or Latin characters, or Spanish characters, or French characters,
is just fine. So, this simplifies this. But, then there are certain things that
we're going to have to be responsible for. So, the one string that we haven't used
yet, but becomes important, and it's present in both Python two and Python three.
Remember how I said in the old days a character, and a byte are the same thing. So,
there's always been a thing like a byte string, and I knew this by prefixing the b,
and that says, "This is a string of bytes" that mean this character. If you look at
the byte string in Python two, and then you look at a regular string in Python two
they're both type string the bytes are the same as string and a Unicode is
different. So, these two are the same in Python two and these two are different in
Python two. I am not doing a very good picture of that. So, the byte string and the
regular string are the same, and the regular string and the Unicode string are
different. So, what happened is in Python three, the regular string and the Unicode
string are the same, and now, the byte string and the regular string are different,
okay? So, bytes turns out to be wrong on encoded, that might be UTF-8 might be UTF-
16, it might be ASCII. We don't know what it is, it we don't know what it's
encoding is. So, it turns out that this is the thing we have to manage when dealing
with data from the outside. So, in Python three, all the strings internally are
unicode, not UTF-8, not UTF-16 not UTF-32, and if you just open a file pretty much
usually works, if you talk to a network now, we have to understand. Now, the key
thing is is we have to decode this stuff, we have to realize what is the character
set of the stuff we're pulling in. Now, the beauty is, it's because 99 percent or
maybe a 100 percent of stuff you're ever going to run across just uses UTF-8, it
turns out to be relatively simple. So, there's this little decode operation. So, if
you look at this code right here, when we talked to an external resource, we get a
byte array back like the socket gives us an array of bytes which are characters,
but they need to be decoded. So, we know if you have UTF-8, UTF-16 or ASCII. So,
there is this function that's part of byte arrays, so data.decode says, "Figure
this thing out" and the nice thing is as you can tell it what character set it is,
but by default it assumes UTF-8 or ASCII dynamically because ASCII and UTF-8 are up
it's compatible with one another. So, if it's like old data you're probably getting
ASCII if it's newer data your pride getting UTF-8. Literally, it's a law of
diminishing returns like it's one, and it's very rare that you get anything other
than those two. So, you just almost never have to tell it what it is, right? So,
you just say decode it, look at it, it might be ASCII, it might be UTF-8, but
whatever it is, by the time it's done with that, it's a string, it's all Unicode
inside of this. So, this is bytes, and this is Unicode. So, decode goes from bytes
to Unicode. You also can see when we're looking at the sending of the data, we're
going to turn it into bytes. So, encode takes this string, and makes it into bytes.
So, this is going to be bytes that are properly encoded in UTF-8. Again, you could
have put a thing here UTF-8, but it just assumes UTF-8, and this is all ASCII. So,
it actually doesn't do anything. So, but that's, okay. Then, we're sending the
bytes out the commands. So, we have we have to send the stuff out, then we receive
it, we decode it, we send it, we encode it. Now, in this world is where the UTF-8
is. Here, we just have Unicode, and so before we do the send, and before we
receive, we have to encode, and decode this stuff, so that it works out and works
out correctly. So you can look at the documentation for both the encoder then
decode. Decode is a method in a bytes class, and it says, "You can see that the
encoding" We're telling it you can say, "It's not UTF-8". Asking UTF-8 or in the
same thing the default to UTF-8, which is probably all you're ever going to use,
and the same thing is through strings can be encoded using UTF-8 into a byte array,
and then we send that byte array out to the outside world. It sounds more complex
than it is. So, after all that, think of it this way. On the way out, we have an
internal string before we send it. We have to encode it, and then we send it.
Getting stuff back, we receive it. It comes back as bytes, we happen to know it's
UTF-8 or we're letting it automatically detect UTF-8, and decode it, and now we
have a string. Now, internally inside of Python, we can write files, we can do all
stuff in and out of this stuff. In a solar works all together it's just that this
is UTF-8 question mark, question mark this is the outside world. So, you have to
look at your program and say okay, "When am I talking to the outside world?" Well,
in this case, it's when I'm talking to a socket, right? I'm talking to a socket,
so, I have to know enough to encode and decode as I go in and out of the socket.
So, it looks weird when you all started start seeing these n-codes and decodes, but
they actually make sense. They're like this barrier between this outside world, and
our inside world. So, that inside our data is all completely consistent, and we can
mix strings from various sources without regard to the character set of those
strings. So, now we're going to do is, we're going to rewrite that program. It's a
short program, but we're going to make it even shorter.

CS50x 2024 Merged
No ratings yet
CS50x 2024 Merged
244 pages
Notes Theory COMPUTER SCIENCE A LEVELS
100% (2)
Notes Theory COMPUTER SCIENCE A LEVELS
59 pages
Bls Decrypted
100% (2)
Bls Decrypted
264 pages
Lecture01-Introduction and Data Representation 1
No ratings yet
Lecture01-Introduction and Data Representation 1
28 pages
قالب های داده ها
No ratings yet
قالب های داده ها
54 pages
COMS1015 Data Representation A
No ratings yet
COMS1015 Data Representation A
32 pages
ASCII Docx1
No ratings yet
ASCII Docx1
88 pages
Presentation - 12 Character Sets
No ratings yet
Presentation - 12 Character Sets
21 pages
Week 3 Unicode and Windows Architecture
No ratings yet
Week 3 Unicode and Windows Architecture
20 pages
6.0 Bit Operations
No ratings yet
6.0 Bit Operations
22 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Cyber Forensics Module - 2 Notes
No ratings yet
Cyber Forensics Module - 2 Notes
39 pages
C Programming For Everybody-38-55
No ratings yet
C Programming For Everybody-38-55
18 pages
(English (Auto-Generated) ) Representing Numbers and Letters With Binary - Crash Course Computer Science #4 (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Representing Numbers and Letters With Binary - Crash Course Computer Science #4 (DownSub - Com)
11 pages
Lecture 2&3
No ratings yet
Lecture 2&3
30 pages
.$ru60441 1677754382000
No ratings yet
.$ru60441 1677754382000
13 pages
Lecture 0 - CS50x 2024
No ratings yet
Lecture 0 - CS50x 2024
19 pages
ASCII
No ratings yet
ASCII
29 pages
Howto Unicode
No ratings yet
Howto Unicode
13 pages
ASCII1
No ratings yet
ASCII1
12 pages
Character Sets, Encodings, and Unicode
No ratings yet
Character Sets, Encodings, and Unicode
26 pages
Revision Notes - 12 Character Sets
No ratings yet
Revision Notes - 12 Character Sets
9 pages
Lecture 04 Ascii Vs Unicode
No ratings yet
Lecture 04 Ascii Vs Unicode
23 pages
Short Notes On ASCII
100% (1)
Short Notes On ASCII
16 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
13 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
13 pages
Text
No ratings yet
Text
3 pages
Logic Gate - Unicode
No ratings yet
Logic Gate - Unicode
12 pages
Extr 030
No ratings yet
Extr 030
4 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
Data Representation and Storage
No ratings yet
Data Representation and Storage
17 pages
1.4: Data, Its Representation, Structure and Management
No ratings yet
1.4: Data, Its Representation, Structure and Management
27 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
10200
No ratings yet
10200
38 pages
Computer Project Work: Getting Started With C++ Guided By:-Prepared By
No ratings yet
Computer Project Work: Getting Started With C++ Guided By:-Prepared By
35 pages
SS3 Note 2nd Term
No ratings yet
SS3 Note 2nd Term
10 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
Lecture - ASCII and Unicode
No ratings yet
Lecture - ASCII and Unicode
38 pages
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
No ratings yet
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
18 pages
Binary - The Basis of Computing
No ratings yet
Binary - The Basis of Computing
5 pages
HTML Introduction Part 2
No ratings yet
HTML Introduction Part 2
28 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Characters and Strings: Eric Roberts CS 106A April 27, 2012
No ratings yet
Characters and Strings: Eric Roberts CS 106A April 27, 2012
30 pages
Chapter 1 Part 3 Continuation
No ratings yet
Chapter 1 Part 3 Continuation
2 pages
Machine Level Representation of Data Character Representation
No ratings yet
Machine Level Representation of Data Character Representation
14 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Computer Codes
No ratings yet
Computer Codes
22 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
1.3 Data Storage - Part 1
No ratings yet
1.3 Data Storage - Part 1
15 pages
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
No ratings yet
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
26 pages
CFF Explorer Scripting Language
No ratings yet
CFF Explorer Scripting Language
43 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Ascii and Unicode
No ratings yet
Ascii and Unicode
6 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
ASAP2Tool-Set Manual en
No ratings yet
ASAP2Tool-Set Manual en
182 pages
Jinja Docs
No ratings yet
Jinja Docs
123 pages
Chapter 1.4 Data: Its Representation, Structure and Management 1.4 (A) Number Systems and Character Sets
No ratings yet
Chapter 1.4 Data: Its Representation, Structure and Management 1.4 (A) Number Systems and Character Sets
23 pages
Python GTK 3 Tutorial PDF
No ratings yet
Python GTK 3 Tutorial PDF
123 pages
Messaging Standards - Final
No ratings yet
Messaging Standards - Final
48 pages
Linux: Cut & Paste More Linux Commands
100% (1)
Linux: Cut & Paste More Linux Commands
16 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Allegro 4.4.2 Manual
100% (1)
Allegro 4.4.2 Manual
483 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
Lesson Plan Data Representation Characters
No ratings yet
Lesson Plan Data Representation Characters
3 pages
Mod Security For Apache User Guide
No ratings yet
Mod Security For Apache User Guide
59 pages
UA Programming Training
No ratings yet
UA Programming Training
77 pages
Whats New
No ratings yet
Whats New
43 pages
Lecture 1
No ratings yet
Lecture 1
33 pages
Manual PDF
No ratings yet
Manual PDF
123 pages
Lecture 4 - Binary System and Logic Gates
No ratings yet
Lecture 4 - Binary System and Logic Gates
76 pages
Oracle BI Applications Release (7.9.6.4) Certification Matrix
No ratings yet
Oracle BI Applications Release (7.9.6.4) Certification Matrix
25 pages
Release Notes
No ratings yet
Release Notes
27 pages
Extracting Text From Images With LangChain - by Reflections On AI - Nov, 2024 - Python in Plain English
No ratings yet
Extracting Text From Images With LangChain - by Reflections On AI - Nov, 2024 - Python in Plain English
22 pages
Readme Full 12052016
No ratings yet
Readme Full 12052016
3 pages
MS CFB
No ratings yet
MS CFB
46 pages
WD Notes
No ratings yet
WD Notes
79 pages
Dancer2 Odp
No ratings yet
Dancer2 Odp
42 pages
Unicode-Ppt Avi45
No ratings yet
Unicode-Ppt Avi45
24 pages
Message
No ratings yet
Message
53 pages
FluxStore - Universal Mobile Commerce Flutter App For Magento, WooCommerce and Opencart
No ratings yet
FluxStore - Universal Mobile Commerce Flutter App For Magento, WooCommerce and Opencart
19 pages
03 - Worked Example Twfriends Py Chapter 15.en
No ratings yet
03 - Worked Example Twfriends Py Chapter 15.en
4 pages
ROS2 HUMBLE INSTALL v1.0
No ratings yet
ROS2 HUMBLE INSTALL v1.0
4 pages
02 - 1 2 Hardware Overview - en
No ratings yet
02 - 1 2 Hardware Overview - en
3 pages
008 What Is UTF-8 - UTF-8 Character Encoding Tutorial
No ratings yet
008 What Is UTF-8 - UTF-8 Character Encoding Tutorial
4 pages
01 - 1 1 Why Program - en
No ratings yet
01 - 1 1 Why Program - en
4 pages
New Text Documentccc
No ratings yet
New Text Documentccc
5 pages
NetScaler 10.5 String Maps
No ratings yet
NetScaler 10.5 String Maps
9 pages
02 - Fun The Textbook Authors Meet Pycon2015.en
No ratings yet
02 - Fun The Textbook Authors Meet Pycon2015.en
1 page
¿Como Crear Bases Con Codificación LATIN1?
No ratings yet
¿Como Crear Bases Con Codificación LATIN1?
8 pages
01 - Page Rank Overview - en
No ratings yet
01 - Page Rank Overview - en
2 pages
StdOut Java
No ratings yet
StdOut Java
4 pages
The Best F*cking Activity Book Ever: Irreverent (and Slightly Vulgar) Activities for Adults
From Everand
The Best F*cking Activity Book Ever: Irreverent (and Slightly Vulgar) Activities for Adults
Nicole Narvaez
2.5/5 (9)
21St Century Language of Texting: 1St Edition
From Everand
21St Century Language of Texting: 1St Edition
Sophie Jane
No ratings yet
Creating Melodies
From Everand
Creating Melodies
Stefan Hollos
No ratings yet

03 - Unicode Characters and Strings - en

Uploaded by

03 - Unicode Characters and Strings - en

Uploaded by

So, we started this entire course printing hello world and I just said "Hello

You might also like