Text Encoding
Text Encoding
Encoding
ASCII
• ASCII stands for American Standard Code for Information Interchange. It is a character system that lets
computers and devices to process letters, numbers and characters.
• It is represented by a 7-bit binary number that is either 0's or 1’s.
• HTML (HyperText Markup Language) are based on ASCII (American Standard Code for Information
Interchange). There are 128 characters that can be represented using ASCII.
E-ASCII
• There are also non-standard extensions to ASCII, sometimes referred to as extended ASCII.
• These are schemes where the additional codes that arose from an 8-bit system were
allocated to represent additional characters.
• However, such schemes varied from country to country so were not very useful for global
communications. In modern coding schemes only the first 128 codes are retained allowing
compatibility with the original ASCII coding scheme
Problem with
ASCII
• The problem with ASCII is that it only allows you to represent a small number of characters (128 for standard 7-bit
ASCII).
• This might be enough to represent the characters in the English alphabet, but it is not sufficient to represent all of the
languages and scripts in the world, and all of the possible numbers and symbols.
• For example, ASCII can't possibly store the hundreds of thousands of characters in the below scripts in just 8 bits.
• Chinese characters 汉字
• Japanese characters 漢字
• Cyrillic Кири́ ллица
• GujaraR !ુજરાતી
• Urdu اردو
• Greek ελληνικά
• Nepali
• Moreover, the widespread use of the World Wide Web made it more important to have a universal internaRonal
coding system, as the range of pla`orms and programs has increased dramaRcally, with more developers from
around the world using a much wider range of characters.
Unicode
• The character set that is most commonly used instead is Unicode
• Unicode also added emoji compatibility, so now all emojis can be used as a unicode character. There are currently 2623 emojis
contained within Unicode.
• Each Unicode character can be encoded on a computer with three different encoding standards, which differ based
on the minimum number of bits used:
Name Description
UTF-8
The most common Unicode format is 8-bit.
Characters can use as few as 8 bits, maximizing
compatibility with ASCII. However, UTF-8 also
allows for variable-width encoding, expanding to
16, 24, 32, 40, or 48 bits when dealing with
larger sets of characters.
UTF-16
Like UTF-8, 16-bit allows variable-width
encoding, and can expand to 32 bits.