Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han
1/12/2017 2
How is language represented on a
computer?
1/12/2017 3
The language of computers
At the lowest level, computer language is binary:
Information on a computer is stored in bits
A bit is either: ON (=1, =yes) or OFF (=0, =no)
This language essentially contains
two alphabetic characters
Next level up: byte
A byte is made up of a sequence of 8 bits
ex. 01001101
Historically, a byte was the number of bits used to
encode a single character of text in a computer
Byte is a basic addressable unit in most computer
architecture
1/12/2017 4
Encoding a written language
How to represent a text with 0s and 1s?
Hello world!
01001000011001010110110001101100011011110010000001
1101110110111101110010011011000110010000100001
Each character is mapped to a code point (=character code),
e.g., a unique integer.
H 72dec
e 101dec
Each code point is represented as a binary number, using a
fixed number of bits.
8 bits == 1 byte in the example above
H 72dec 01001000 (26+23 = 64 + 8 = 72)
e 101dec 01100101 (26+ 25 + 22 + 20= 64 + 32 + 4 + 1 = 101)
One byte can represent 256 (=28) different characters
00000000 0dec 11111111 255dec
1/12/2017 5
ASCII encoding for English
How many bits are needed to encode English?
26 lowercase letters: a, b, c, d, e, …
26 uppercase letters: A, B, C, D, E, …
10 Arabic digits: 0, 1, 2, 3, 4, …
Punctuation: . , : ; ? ! ' "
Symbols: ( ) < > & % * $ + -
We are already up to 80
6 bits (26 = 64) is not enough; we will need at least 7 (27 = 128)
ASCII (the American Standard Code for Information
Interchange) did just that
Uses 7-bit code (= 128 characters) for storing English text
Range 0 to 127
1/12/2017 6
The ASCII chart
https://en.wikipedia.org/wiki/ASCII
http://web.alfredstate.edu/weimandn/miscellaneous/ascii/AS
CII%20Conversion%20Chart.pdf
Decimal Binary (7-bit) Character
Decimal Binary (7-bit) Character
0 000 0000 (NULL)
65 100 0001 A
… … …
66 100 0010 B
35 010 0011 #
67 100 0011 C
36 010 0100 &
… … …
… … …
97 110 0001 a
48 011 0000 0
98 110 0010 b
49 011 0001 1
99 110 0011 c
50 011 0010 2
… … …
… … …
127 111 1111 (DEL)
1/12/2017 7
ASCII (the American Standard Code for Information
Interchange)
1/12/2017 8
Practice
What is this English text?
Note: byte (=8-bit) ASCII representation instead of 7-bit
Space provided for your convenience only!
Answer:
Hi!
1/12/2017 9
Extending ASCII: ISO-8859, etc.
ASCII (=7 bit, 128 characters) was sufficient for encoding
English. But what about characters used in other
languages?
Solution: Extend ASCII into 8-bit (=256 characters) and use
the additional 128 slots for non-English characters
ISO-8859: has 16 different implementations!
ISO-8859-1 aka Latin-1: French, German, Spanish, etc.
ISO-8859-7 Greek alphabet
ISO-8859-8 Hebrew alphabet
JIS X 0208: Japanese characters
Problem: overlapping character code space.
224dec means à in Latin-1 but אin ISO-8859-8!
1/12/2017 10
The problem with multiple encoding
systems
Problem: Multiple coding systems map different characters
to the same character code
Solution 1: Provide meta-information on coding system
Ex. MIME (Multipurpose Internet Mail Extensions)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
1/12/2017 12
How big is Unicode?
Version 9.0 (2016) has codes for 128,237 characters
Full Unicode standard uses 32 bits (4 bytes) : it can represent
232 = 4,294,967,296 characters!
In reality, only 21 bits are needed
1/12/2017 13
8-bit, 16-bit, 32-bit
UTF-32 (32 bits/4 bytes): direct representation
UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
UTF-8 (8 bits/1 byte): 28=256 possibilities
Wait! But how do you represent all of 232 (=4 billion) code
points with only one byte (UTF-8: 28 =256 slots)?
You don't.
In reality, only 221 bits are ever utilized for 128K characters.
UTF-8 and UTF-16 use a variable-width encoding.
1/12/2017 14
Variable-width encoding
'H' as 1 byte (8 bits): 01001000
cf. 'H' as 2 bytes (16 bits): 0000000001001000
UTF-8 as a variable-width encoding
ASCII characters get encoded with just 1 byte
ASCII is originally 7-bits, so the highest bit is always 0 in an 8-bit
encoding
All other characters are encoded with multiple bytes
How to tell? The highest bit is used as a flag.
Highest bit 0: single character É
Highest bit 1: part of a multi-byte character
01001000 11001001 10001000 01101001 01101001
Advantage for English: 8-bit ASCII is already a valid UTF-8!
1/12/2017 15
A look at Unicode chart
How to find your Unicode character:
http://www.unicode.org/standard/where/
http://www.unicode.org/charts/
1/12/2017 16
Code point
for M.
But "004D"?
1/12/2017 17
Another representation: hexadecimal
Hexadecimal (hex) = base-16
Utilizes 16 characters: 0123456789ABCDEF
Designed for human readability & easy byte conversion
24=16: 1 hexadecimal digit is equivalent to 4 bits
1 byte (=8 bits) is encoded with just 2 hex chars!
1/12/2017 18