0% found this document useful (0 votes)

49 views18 pages

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

1) The document discusses different encoding systems for representing language on computers, from the basic binary representation to ASCII, ISO-8859, and Unicode. 2) ASCII uses 7-bit encoding for English text but other languages required extended 8-bit encodings like ISO-8859. However, these created conflicts due to overlapping character codes. 3) Unicode provides a single universal character encoding that assigns a unique code to every character across all languages, with various encoding forms like UTF-8, UTF-16, and UTF-32.

Uploaded by

Laura Amwayi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views18 pages

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

Uploaded by

Laura Amwayi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Lecture 1: Encoding Language

LING 1330/2330: Introduction to Computational Linguistics

Na-Rae Han
Objectives
 Understand the fundamentals of how language is
encoded on a computer
 Text encoding systems
 ASCII
 ISO-8859
 Unicode

1/12/2017 2
How is language represented on a
computer?

 Natural ("Human")  The language of

languages: computers:
 Spoken form
 Written form
*Also: sign languages

1/12/2017 3
The language of computers
 At the lowest level, computer language is binary:
Information on a computer is stored in bits
 A bit is either: ON (=1, =yes) or OFF (=0, =no)
 This language essentially contains
two alphabetic characters
 Next level up: byte
 A byte is made up of a sequence of 8 bits
 ex. 01001101 
 Historically, a byte was the number of bits used to
encode a single character of text in a computer
 Byte is a basic addressable unit in most computer
architecture

1/12/2017 4
Encoding a written language
 How to represent a text with 0s and 1s?
 Hello world!
 01001000011001010110110001101100011011110010000001
1101110110111101110010011011000110010000100001
 Each character is mapped to a code point (=character code),
e.g., a unique integer.
 H  72dec
 e  101dec
 Each code point is represented as a binary number, using a
fixed number of bits.
 8 bits == 1 byte in the example above
 H  72dec  01001000 (26+23 = 64 + 8 = 72)
 e  101dec  01100101 (26+ 25 + 22 + 20= 64 + 32 + 4 + 1 = 101)
 One byte can represent 256 (=28) different characters
 00000000  0dec 11111111  255dec
1/12/2017 5
ASCII encoding for English
 How many bits are needed to encode English?
 26 lowercase letters: a, b, c, d, e, …
 26 uppercase letters: A, B, C, D, E, …
 10 Arabic digits: 0, 1, 2, 3, 4, …
 Punctuation: . , : ; ? ! ' "
 Symbols: ( ) < > & % * $ + -
 We are already up to 80
 6 bits (26 = 64) is not enough; we will need at least 7 (27 = 128)
 ASCII (the American Standard Code for Information
Interchange) did just that
 Uses 7-bit code (= 128 characters) for storing English text
 Range 0 to 127

1/12/2017 6
The ASCII chart
 https://en.wikipedia.org/wiki/ASCII
 http://web.alfredstate.edu/weimandn/miscellaneous/ascii/AS
CII%20Conversion%20Chart.pdf
Decimal Binary (7-bit) Character
Decimal Binary (7-bit) Character
0 000 0000 (NULL)
65 100 0001 A
… … …
66 100 0010 B
35 010 0011 #
67 100 0011 C
36 010 0100 &
… … …
… … …
97 110 0001 a
48 011 0000 0
98 110 0010 b
49 011 0001 1
99 110 0011 c
50 011 0010 2
… … …
… … …
127 111 1111 (DEL)
1/12/2017 7
ASCII (the American Standard Code for Information
Interchange)

 The ASCII encoding scheme

 First published in 1963
 Uses 7-bit code (= 128 characters) for storing English text,
ranging from 0 to 127
 In an 8-bit (1 byte) representation, the highest bit is always 0
 Printable characters
 Upper and lower case roman alphabet
 Digits
 Punctuation marks, symbols, and space
 Includes 32 non-printing characters
 Control characters: BELL, ACKNWOLEDGE, BACKSPACE, DELETE, etc. 
originally for typewriters, many obsolete now
 WHITESPACE characters: TAB, LINE FEED, CARRIAGE RETURN

1/12/2017 8
Practice
 What is this English text?
 Note: byte (=8-bit) ASCII representation instead of 7-bit
 Space provided for your convenience only!

01001000 01101001 00100001

 Answer:
Hi!

1/12/2017 9
Extending ASCII: ISO-8859, etc.
 ASCII (=7 bit, 128 characters) was sufficient for encoding
English. But what about characters used in other
languages?
 Solution: Extend ASCII into 8-bit (=256 characters) and use
the additional 128 slots for non-English characters
 ISO-8859: has 16 different implementations!
 ISO-8859-1 aka Latin-1: French, German, Spanish, etc.
 ISO-8859-7 Greek alphabet
 ISO-8859-8 Hebrew alphabet
 JIS X 0208: Japanese characters
 Problem: overlapping character code space.
224dec means à in Latin-1 but ‫ א‬in ISO-8859-8!

1/12/2017 10
The problem with multiple encoding
systems
Problem: Multiple coding systems map different characters
to the same character code
 Solution 1: Provide meta-information on coding system
 Ex. MIME (Multipurpose Internet Mail Extensions)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

 But what if your message contains characters from multiple

coding systems?

 Solution 2: Have a single universal code system for all

writing systems  UNICODE
1/12/2017 11
Unicode
 A character encoding standard developed by the Unicode
Consortium
 Provides a single representation for all world's writing
systems

 "Unicode provides a unique number for every character, no

matter what the platform, no matter what the program, no
matter what the language.”
(http://www.unicode.org)

1/12/2017 12
How big is Unicode?
 Version 9.0 (2016) has codes for 128,237 characters
 Full Unicode standard uses 32 bits (4 bytes) : it can represent
232 = 4,294,967,296 characters!
 In reality, only 21 bits are needed

 Unicode has three encoding versions

 UTF-32 (32 bits/4 bytes): direct representation
 UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
 UTF-8 (8 bits/1 byte): 28=256 possibilities

1/12/2017 13
8-bit, 16-bit, 32-bit
 UTF-32 (32 bits/4 bytes): direct representation
 UTF-16 (16 bits/2 bytes): 216=65,536 possibilities
 UTF-8 (8 bits/1 byte): 28=256 possibilities

 Wait! But how do you represent all of 232 (=4 billion) code
points with only one byte (UTF-8: 28 =256 slots)?
 You don't.
 In reality, only 221 bits are ever utilized for 128K characters.
 UTF-8 and UTF-16 use a variable-width encoding.

 Why UTF-16 and UTF-8?

 They are more compact (more so for certain languages, i.e.,
English)

1/12/2017 14
Variable-width encoding
 'H' as 1 byte (8 bits): 01001000
cf. 'H' as 2 bytes (16 bits): 0000000001001000
 UTF-8 as a variable-width encoding
 ASCII characters get encoded with just 1 byte
 ASCII is originally 7-bits, so the highest bit is always 0 in an 8-bit
encoding
 All other characters are encoded with multiple bytes
 How to tell? The highest bit is used as a flag.
 Highest bit 0: single character É
 Highest bit 1: part of a multi-byte character
01001000 11001001 10001000 01101001 01101001
 Advantage for English: 8-bit ASCII is already a valid UTF-8!
1/12/2017 15
A look at Unicode chart
 How to find your Unicode character:
 http://www.unicode.org/standard/where/
 http://www.unicode.org/charts/

 Basic Latin (ASCII)

 http://www.unicode.org/charts/PDF/U0000.pdf

1/12/2017 16
Code point
for M.
But "004D"?

1/12/2017 17
Another representation: hexadecimal
Hexadecimal (hex) = base-16
 Utilizes 16 characters: 0123456789ABCDEF
 Designed for human readability & easy byte conversion
 24=16: 1 hexadecimal digit is equivalent to 4 bits
 1 byte (=8 bits) is encoded with just 2 hex chars!

Letter Base-10 Base-2 Base-16

(decimal) (binary) (hex)
M 77 0000 0000 0100 1101 004D

 Unicode characters are usually referenced by their hexadecimal code

 Lower-number characters go by their 4-char hex codes (2 bytes), e.g.
U+004D ("M", U+ designates Unicode)
 Higher-number characters go by 5 or 6 hex codes, e.g. U+1D122
(http://www.unicode.org/charts/PDF/U1D100.pdf)

1/12/2017 18

Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
From Everand
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
Sherwyn Allibang
5/5 (2)
Kannada Kagunita Full PDF
0% (4)
Kannada Kagunita Full PDF
3 pages
Hodder As & A Level Computer Science
No ratings yet
Hodder As & A Level Computer Science
834 pages
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
No ratings yet
CHARACTER ENCODING: How Do Computers Deal With Multiple Language?
26 pages
TIBCO EMS Guidelines and Standards v1
No ratings yet
TIBCO EMS Guidelines and Standards v1
46 pages
Regular Expressions Cheat Sheet DaveChild PDF
No ratings yet
Regular Expressions Cheat Sheet DaveChild PDF
1 page
Lecture - ASCII and Unicode
No ratings yet
Lecture - ASCII and Unicode
38 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
Lesson Plan Data Representation Characters
No ratings yet
Lesson Plan Data Representation Characters
3 pages
Text Encoding
No ratings yet
Text Encoding
8 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Revision Notes - 12 Character Sets
No ratings yet
Revision Notes - 12 Character Sets
9 pages
Unit-I Class XI - Encoding Schemes
No ratings yet
Unit-I Class XI - Encoding Schemes
10 pages
Encoding Schemes
No ratings yet
Encoding Schemes
9 pages
Character Sets KS4 Presentation
No ratings yet
Character Sets KS4 Presentation
16 pages
Presentation - 12 Character Sets
No ratings yet
Presentation - 12 Character Sets
21 pages
Unicode and Character Sets
No ratings yet
Unicode and Character Sets
2 pages
Character Encoding For Sanskrit and Other Languages
No ratings yet
Character Encoding For Sanskrit and Other Languages
8 pages
Machine Level Representation of Data Character Representation
No ratings yet
Machine Level Representation of Data Character Representation
14 pages
Representation of Text
No ratings yet
Representation of Text
5 pages
Unicode Better Explained
No ratings yet
Unicode Better Explained
5 pages
10.2005.5 Unicode
No ratings yet
10.2005.5 Unicode
4 pages
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
No ratings yet
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
34 pages
Computer Codes
No ratings yet
Computer Codes
24 pages
Ascii and Unicode
No ratings yet
Ascii and Unicode
6 pages
Encoding Schemes
No ratings yet
Encoding Schemes
3 pages
Chars ASCII v2
No ratings yet
Chars ASCII v2
16 pages
Set 11 - 62546805 - 2025 - 06 - 10 - 15 - 23
No ratings yet
Set 11 - 62546805 - 2025 - 06 - 10 - 15 - 23
7 pages
Encoding Scheme
No ratings yet
Encoding Scheme
2 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Lecture 02 Write
No ratings yet
Lecture 02 Write
9 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Ascii Unicode
No ratings yet
Ascii Unicode
6 pages
1521 Lec 9 - Unicode
No ratings yet
1521 Lec 9 - Unicode
46 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
Lesson 4 - Ascii
No ratings yet
Lesson 4 - Ascii
34 pages
Power Point
No ratings yet
Power Point
10 pages
Encoding Schemes
100% (1)
Encoding Schemes
23 pages
1.3 Information Coding Scheme
100% (1)
1.3 Information Coding Scheme
24 pages
Strings - ASCII, UTF8, UTF32, ISCII (Indian Script Code), Unicode-2 PDF
No ratings yet
Strings - ASCII, UTF8, UTF32, ISCII (Indian Script Code), Unicode-2 PDF
30 pages
Encoding Schemes
100% (1)
Encoding Schemes
4 pages
Data Representation - Characters
No ratings yet
Data Representation - Characters
15 pages
1.0 Computer System 1.3 Information Coding Scheme
No ratings yet
1.0 Computer System 1.3 Information Coding Scheme
6 pages
Computer Codes
No ratings yet
Computer Codes
22 pages
Unicode in C++ - McNellis - CppCon 2014
No ratings yet
Unicode in C++ - McNellis - CppCon 2014
125 pages
Programacion Web Parte-4
No ratings yet
Programacion Web Parte-4
4 pages
COMS1000 Data Representation A Second Half
No ratings yet
COMS1000 Data Representation A Second Half
12 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Lecture 2.3 Information Coding Scheme
0% (1)
Lecture 2.3 Information Coding Scheme
10 pages
Multimedia Unit 4
No ratings yet
Multimedia Unit 4
16 pages
Ascii: Ask-Ee, ASCII Is A Code For Representing English
No ratings yet
Ascii: Ask-Ee, ASCII Is A Code For Representing English
2 pages
31 Character Sets SAMPLE A Level
No ratings yet
31 Character Sets SAMPLE A Level
13 pages
T4 Ascii
No ratings yet
T4 Ascii
20 pages
Sample GSCE 12 Character Sets
No ratings yet
Sample GSCE 12 Character Sets
13 pages
DLD Week 2 Class 2
No ratings yet
DLD Week 2 Class 2
29 pages
Unicode®: Character Encodings
No ratings yet
Unicode®: Character Encodings
11 pages
Ascii
No ratings yet
Ascii
1 page
ASCII1
No ratings yet
ASCII1
12 pages
SS3 Note 2nd Term
No ratings yet
SS3 Note 2nd Term
10 pages
Notes 07 Compression PDF
No ratings yet
Notes 07 Compression PDF
193 pages
Dictionary of Computing
From Everand
Dictionary of Computing
Handz Valentin, Sr
No ratings yet
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
Python 3 Tutorial
No ratings yet
Python 3 Tutorial
132 pages
In A Darkened Room Telika Solo
No ratings yet
In A Darkened Room Telika Solo
1 page
Tis 620
No ratings yet
Tis 620
5 pages
Changes PDF
No ratings yet
Changes PDF
47 pages
Making A Perfect Custom Wordlist Using Crunch
No ratings yet
Making A Perfect Custom Wordlist Using Crunch
8 pages
Computer Science: Ascii
No ratings yet
Computer Science: Ascii
15 pages
Proposal To Encode M L C Y Discussion: Alayalam Etter Hillu
No ratings yet
Proposal To Encode M L C Y Discussion: Alayalam Etter Hillu
8 pages
Ascii Codes
No ratings yet
Ascii Codes
7 pages
Dictionary Utility Guide
No ratings yet
Dictionary Utility Guide
18 pages
Chapter - 1 Notes of Tony Gaddis Python Book
No ratings yet
Chapter - 1 Notes of Tony Gaddis Python Book
2 pages
Auth Names Native Lang
No ratings yet
Auth Names Native Lang
1 page
சிலபதிகாரம்1
No ratings yet
சிலபதிகாரம்1
48 pages
Writing With Sonova
No ratings yet
Writing With Sonova
10 pages
Cicada - Ascii Art
No ratings yet
Cicada - Ascii Art
1 page
Unihan RadicalStrokeCounts
No ratings yet
Unihan RadicalStrokeCounts
232 pages
The American Standard Code For Information Interchange
100% (1)
The American Standard Code For Information Interchange
4 pages
Unicode Vs UTF-8
No ratings yet
Unicode Vs UTF-8
2 pages
Overview: Siebel Enterprise Application Integration: April 2005
No ratings yet
Overview: Siebel Enterprise Application Integration: April 2005
52 pages
Admin II Dumps
No ratings yet
Admin II Dumps
24 pages
004 Number System
No ratings yet
004 Number System
45 pages
Melodica Studies #2 Flutter Tongue - Full Score
No ratings yet
Melodica Studies #2 Flutter Tongue - Full Score
10 pages
Tooth I
No ratings yet
Tooth I
4 pages
Multiple Choice Questions On Java Servlets
100% (1)
Multiple Choice Questions On Java Servlets
3 pages
Ascii Decimal Binary Hex Conversion Chart
No ratings yet
Ascii Decimal Binary Hex Conversion Chart
5 pages
HSC ICT 3.1 Lesson MCQ
100% (1)
HSC ICT 3.1 Lesson MCQ
2 pages
Perfect Handwriting Practice Sheet PDF
No ratings yet
Perfect Handwriting Practice Sheet PDF
55 pages

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

Uploaded by

Lecture 1: Encoding Language: LING 1330/2330: Introduction To Computational Linguistics Na-Rae Han

Uploaded by

Lecture 1: Encoding Language

LING 1330/2330: Introduction to Computational Linguistics

 Natural ("Human")  The language of

 The ASCII encoding scheme

01001000 01101001 00100001

 But what if your message contains characters from multiple

 Solution 2: Have a single universal code system for all

 "Unicode provides a unique number for every character, no

 Unicode has three encoding versions

 Why UTF-16 and UTF-8?

 Basic Latin (ASCII)

Letter Base-10 Base-2 Base-16

 Unicode characters are usually referenced by their hexadecimal code

You might also like