0% found this document useful (0 votes)
32 views

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

1) The document discusses different encoding systems for representing language on computers, from the basic binary representation to ASCII, ISO-8859, and Unicode. 2) ASCII uses 7-bit encoding for English text but other languages required extended 8-bit encodings like ISO-8859. However, these created conflicts due to overlapping character codes. 3) Unicode provides a single universal character encoding that assigns a unique code to every character across all languages, with various encoding forms like UTF-8, UTF-16, and UTF-32.

Uploaded by

Laura Amwayi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

1) The document discusses different encoding systems for representing language on computers, from the basic binary representation to ASCII, ISO-8859, and Unicode. 2) ASCII uses 7-bit encoding for English text but other languages required extended 8-bit encodings like ISO-8859. However, these created conflicts due to overlapping character codes. 3) Unicode provides a single universal character encoding that assigns a unique code to every character across all languages, with various encoding forms like UTF-8, UTF-16, and UTF-32.

Uploaded by

Laura Amwayi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Lecture 1: Encoding Language

LING 1330/2330: Introduction to Computational Linguistics


Na-Rae Han
Objectives
 Understand the fundamentals of how language is
encoded on a computer
 Text encoding systems
 ASCII
 ISO-8859
 Unicode

1/12/2017 2
How is language represented on a
computer?

 Natural ("Human")  The language of


languages: computers:
 Spoken form
 Written form
*Also: sign languages

1/12/2017 3
The language of computers
 At the lowest level, computer language is binary:
Information on a computer is stored in bits
 A bit is either: ON (=1, =yes) or OFF (=0, =no)
 This language essentially contains
two alphabetic characters
 Next level up: byte
 A byte is made up of a sequence of 8 bits
 ex. 01001101 
 Historically, a byte was the number of bits used to
encode a single character of text in a computer
 Byte is a basic addressable unit in most computer
architecture

1/12/2017 4
Encoding a written language
 How to represent a text with 0s and 1s?
 Hello world!
 01001000011001010110110001101100011011110010000001
1101110110111101110010011011000110010000100001
 Each character is mapped to a code point (=character code),
e.g., a unique integer.
 H  72dec
 e  101dec
 Each code point is represented as a binary number, using a
fixed number of bits.
 8 bits == 1 byte in the example above
 H  72dec  01001000 (26+23 = 64 + 8 = 72)
 e  101dec  01100101 (26+ 25 + 22 + 20= 64 + 32 + 4 + 1 = 101)
 One byte can represent 256 (=28) different characters
 00000000  0dec 11111111  255dec
1/12/2017 5
ASCII encoding for English
 How many bits are needed to encode English?
 26 lowercase letters: a, b, c, d, e, …
 26 uppercase letters: A, B, C, D, E, …
 10 Arabic digits: 0, 1, 2, 3, 4, …
 Punctuation: . , : ; ? ! ' "
 Symbols: ( ) < > & % * $ + -
 We are already up to 80
 6 bits (26 = 64) is not enough; we will need at least 7 (27 = 128)
 ASCII (the American Standard Code for Information
Interchange) did just that
 Uses 7-bit code (= 128 characters) for storing English text
 Range 0 to 127

1/12/2017 6
The ASCII chart
 https://en.wikipedia.org/wiki/ASCII
 http://web.alfredstate.edu/weimandn/miscellaneous/ascii/AS
CII%20Conversion%20Chart.pdf
Decimal Binary (7-bit) Character
Decimal Binary (7-bit) Character
0 000 0000 (NULL)
65 100 0001 A
… … …
66 100 0010 B
35 010 0011 #
67 100 0011 C
36 010 0100 &
… … …
… … …
97 110 0001 a
48 011 0000 0
98 110 0010 b
49 011 0001 1
99 110 0011 c
50 011 0010 2
… … …
… … …
127 111 1111 (DEL)
1/12/2017 7
ASCII (the American Standard Code for Information
Interchange)

 The ASCII encoding scheme


 First published in 1963
 Uses 7-bit code (= 128 characters) for storing English text,
ranging from 0 to 127
 In an 8-bit (1 byte) representation, the highest bit is always 0
 Printable characters
 Upper and lower case roman alphabet
 Digits
 Punctuation marks, symbols, and space
 Includes 32 non-printing characters
 Control characters: BELL, ACKNWOLEDGE, BACKSPACE, DELETE, etc. 
originally for typewriters, many obsolete now
 WHITESPACE characters: TAB, LINE FEED, CARRIAGE RETURN

1/12/2017 8
Practice
 What is this English text?
 Note: byte (=8-bit) ASCII representation instead of 7-bit
 Space provided for your convenience only!

01001000 01101001 00100001

 Answer:
Hi!

1/12/2017 9
Extending ASCII: ISO-8859, etc.
 ASCII (=7 bit, 128 characters) was sufficient for encoding
English. But what about characters used in other
languages?
 Solution: Extend ASCII into 8-bit (=256 characters) and use
the additional 128 slots for non-English characters
 ISO-8859: has 16 different implementations!
 ISO-8859-1 aka Latin-1: French, German, Spanish, etc.
 ISO-8859-7 Greek alphabet
 ISO-8859-8 Hebrew alphabet
 JIS X 0208: Japanese characters
 Problem: overlapping character code space.
224dec means à in Latin-1 but ‫ א‬in ISO-8859-8!

1/12/2017 10
The problem with multiple encoding
systems
Problem: Multiple coding systems map different characters
to the same character code
 Solution 1: Provide meta-information on coding system
 Ex. MIME (Multipurpose Internet Mail Extensions)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

 But what if your message contains characters from multiple


coding systems?

 Solution 2: Have a single universal code system for all


writing systems  UNICODE
1/12/2017 11
Unicode
 A character encoding standard developed by the Unicode
Consortium
 Provides a single representation for all world's writing
systems

 "Unicode provides a unique number for every character, no


matter what the platform, no matter what the program, no
matter what the language.”
(http://www.unicode.org)

1/12/2017 12
How big is Unicode?
 Version 9.0 (2016) has codes for 128,237 characters
 Full Unicode standard uses 32 bits (4 bytes) : it can represent
232 = 4,294,967,296 characters!
 In reality, only 21 bits are needed

 Unicode has three encoding versions


 UTF-32 (32 bits/4 bytes): direct representation
 UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
 UTF-8 (8 bits/1 byte): 28=256 possibilities

1/12/2017 13
8-bit, 16-bit, 32-bit
 UTF-32 (32 bits/4 bytes): direct representation
 UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
 UTF-8 (8 bits/1 byte): 28=256 possibilities

 Wait! But how do you represent all of 232 (=4 billion) code
points with only one byte (UTF-8: 28 =256 slots)?
 You don't.
 In reality, only 221 bits are ever utilized for 128K characters.
 UTF-8 and UTF-16 use a variable-width encoding.

 Why UTF-16 and UTF-8?


 They are more compact (more so for certain languages, i.e.,
English)

1/12/2017 14
Variable-width encoding
 'H' as 1 byte (8 bits): 01001000
cf. 'H' as 2 bytes (16 bits): 0000000001001000
 UTF-8 as a variable-width encoding
 ASCII characters get encoded with just 1 byte
 ASCII is originally 7-bits, so the highest bit is always 0 in an 8-bit
encoding
 All other characters are encoded with multiple bytes
 How to tell? The highest bit is used as a flag.
 Highest bit 0: single character É
 Highest bit 1: part of a multi-byte character
01001000 11001001 10001000 01101001 01101001
 Advantage for English: 8-bit ASCII is already a valid UTF-8!
1/12/2017 15
A look at Unicode chart
 How to find your Unicode character:
 http://www.unicode.org/standard/where/
 http://www.unicode.org/charts/

 Basic Latin (ASCII)


 http://www.unicode.org/charts/PDF/U0000.pdf

1/12/2017 16
Code point
for M.
But "004D"?

1/12/2017 17
Another representation: hexadecimal
Hexadecimal (hex) = base-16
 Utilizes 16 characters: 0123456789ABCDEF
 Designed for human readability & easy byte conversion
 24=16: 1 hexadecimal digit is equivalent to 4 bits
 1 byte (=8 bits) is encoded with just 2 hex chars!

Letter Base-10 Base-2 Base-16


(decimal) (binary) (hex)
M 77 0000 0000 0100 1101 004D

 Unicode characters are usually referenced by their hexadecimal code


 Lower-number characters go by their 4-char hex codes (2 bytes), e.g.
U+004D ("M", U+ designates Unicode)
 Higher-number characters go by 5 or 6 hex codes, e.g. U+1D122
(http://www.unicode.org/charts/PDF/U1D100.pdf)

1/12/2017 18

You might also like