Understanding Character Encoding
Last Updated :
01 May, 2024
Ever imagined how a computer is able to understand and display what you have written? Ever wondered what a UTF-8 or UTF-16 meant when you were going through some configurations? Just think about how "HeLLo WorlD" should be interpreted by a computer. We all know that a computer stores data in bits and bytes. So, to display a character on screen or map the character as a byte in memory of the computer needs to have a standard. Read the following :
\x48\x65\x4C\x4C\x6F\x20\x57\x6F\x72\x6C\x44
This is something a memory would show you. How do you know what character each memory byte specifies?
Here comes character encoding into the picture:
If you have already not guessed it - Its "HeLLo WorlD" in UTF-8 for you. And yes, we will go ahead and read about UTF-8. But let's start with ASCII. Most of you who have done programming or worked with strings must have known what ASCII is. If you haven't then let’s define what ASCII is.
ASCII:
ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as 'a' or '@' or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose.
Just look at the following -
Hexadecimal | Decimal | Character |
---|
\x48 | 72 | H |
\x65 | 101 | e |
\x4c | 76 | L |
And so on. You can look at the ASCII table and mapping at http://www.asciitable.com/. If you have not already looked at the table, I will recommend that you do it now! You will observe that these are a simple set of English words and punctuations.
Now Suppose I want to write the below characters:
A B?@
This will be interpreted by my decoder as
0x410x0a0x200x420x3f0x40
in hex and
065010032066063064
in decimal, where even a space (0x20) and a next line (0x0a) has a byte value or a memory space.
Different countries, languages but the need that brought them together
Today internet has made the world come close together. And the people all over the world do not speak just English, right? There came a need to expand this space. If you have created an application and you see that people in France want to use it as you see a high potential there. Wouldn't it be nice to just have a change in language but having the same functionality?
Why not create a Universal Code in short Unicode for everyone ??
So, here came the Unicode with a really good idea. It assigned every character, including different languages, a unique number called Code Point. One advantage of Unicode over other possible sets is that its first 256 code points are identical to ASCII. So for a software/browser it is easier to encode and decode characters of majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded. Unicode also is a character set (not an encoding). It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a Code point. It has the ambition to contain all characters (and popular icons) used in the entire world.
Before knowing these let us get a few terminologies straight:
- A character is a minimal unit of text that has semantic value.
- A character set is a collection of characters that might be used by multiple languages. Example: The Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language.
- A coded character set is a character set in which each character corresponds to a unique number.
- A code point of a coded character set is any legal value in the character set.
- A code unit is a bit sequence used to encode each character of a repertoire within a given encoding form.
Ever wondered what is UTF-8 or UTF-16??
UTF-8:
UTF-8 has truly been the dominant character encoding for the World Wide Web since 2009, and as of June 2017 accounts for 89.4% of all Web pages. UTF-8 encodes each of the 1,112,064 valid code points in Unicode using one to four 8-bit bytes. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. So how many bytes give access to what characters in these encodings?
UTF-8:
1 byte: Standard ASCII 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian) 3 bytes: BMP 4 bytes: All Unicode characters
UTF-16:
2 bytes: BMP 4 bytes: All Unicode characters
So I did make a mention about BMP. What is it exactly?
Basic Multilingual Plane (BMP) contains characters for almost all modern languages, and a large number of symbols. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing. UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow. UTF-16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values. UTF-32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), which makes it quite bloated but now in this scenario every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can't do that with UTF-8.This is how it eases to accommodate the entire character set for different languages and help people spread their applications or information to the world just coding/writing in their language rest all is taken care by the Decoder. As this being just the beginning into the world of Character Encoding. I hope this helps you understand Character encoding at a higher level.
Similar Reads
Python Web Scraping Tutorial Web scraping is the process of extracting data from websites automatically. Python is widely used for web scraping because of its easy syntax and powerful libraries like BeautifulSoup, Scrapy, and Selenium. In this tutorial, you'll learn how to use these Python tools to scrape data from websites and
10 min read
Introduction to Web Scraping
Basics of Web Scraping
HTML BasicsHTML (HyperText Markup Language) is the standard markup language used to create and structure web pages. It defines the layout of a webpage using elements and tags, allowing for the display of text, images, links, and multimedia content. As the foundation of nearly all websites, HTML is used in over
7 min read
Tags vs Elements vs Attributes in HTMLIn HTML, tags represent the structural components of a document, such as <h1> for headings. Elements are formed by tags and encompass both the opening and closing tags along with the content. Attributes provide additional information or properties to elements, enhancing their functionality or
2 min read
CSS IntroductionCSS (Cascading Style Sheets) is a language designed to simplify the process of making web pages presentable.It allows you to apply styles to HTML documents by prescribing colors, fonts, spacing, and positioning.The main advantages are the separation of content (in HTML) and styling (in CSS) and the
4 min read
CSS SyntaxCSS is written as a rule set, which consists of a selector and a declaration block. The basic syntax of CSS is as follows:The selector is a targeted HTML element or elements to which we have to apply styling.The Declaration Block or " { } " is a block in which we write our CSS.HTML<html> <h
2 min read
JavaScript Cheat Sheet - A Basic Guide to JavaScriptJavaScript is a lightweight, open, and cross-platform programming language. It is omnipresent in modern development and is used by programmers across the world to create dynamic and interactive web content like applications and browsersJavaScript (JS) is a versatile, high-level programming language
15+ min read
Setting Up the Environment
Extracting Data from Web Pages
Fetching Web Pages
HTTP Request Methods
Searching and Extract for specific tags Beautifulsoup
Scrapy Basics
Scrapy - Command Line ToolsPrerequisite: Implementing Web Scraping in Python with Scrapy Scrapy is a python library that is used for web scraping and searching the contents throughout the web. It uses Spiders which crawls throughout the page to find out the content specified in the selectors. Hence, it is a very handy tool to
5 min read
Scrapy - Item LoadersIn this article, we are going to discuss Item Loaders in Scrapy. Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating
15+ min read
Scrapy - Item PipelineScrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially. In this article, we will be learning throug
10 min read
Scrapy - SelectorsScrapy Selectors as the name suggest are used to select some things. If we talk of CSS, then there are also selectors present that are used to select and apply CSS effects to HTML tags and text. In Scrapy we are using selectors to mention the part of the website which is to be scraped by our spiders
7 min read
Scrapy - ShellScrapy is a well-organized framework, used for large-scale web scraping. Using selectors, like XPath or CSS expressions, one can scrape data seamlessly. It allows systematic crawling, and scraping the data, and storing the content in different file formats. Scrapy comes equipped with a shell, that h
9 min read
Scrapy - SpidersScrapy is a free and open-source web-crawling framework which is written purely in python. Thus, scrapy can be installed and imported like any other python package. The name of the package is self-explanatory. It is derived from the word 'scraping' which literally means extracting desired substance
11 min read
Scrapy - Feed exportsScrapy is a fast high-level web crawling and scraping framework written in Python used to crawl websites and extract structured data from their pages. It can be used for many purposes, from data mining to monitoring and automated testing. This article is divided into 2 sections:Creating a Simple web
5 min read
Scrapy - Link ExtractorsIn this article, we are going to learn about Link Extractors in scrapy. "LinkExtractor" is a class provided by scrapy to extract links from the response we get while fetching a website. They are very easy to use which we'll see in the below post. Scrapy - Link Extractors Basically using the "LinkEx
5 min read
Scrapy - SettingsScrapy is an open-source tool built with Python Framework. It presents us with a strong and robust web crawling framework that can easily extract the info from the online page with the assistance of selectors supported by XPath. We can define the behavior of Scrapy components with the help of Scrapy
7 min read
Scrapy - Sending an E-mailPrerequisites: Scrapy Scrapy provides its own facility for sending e-mails which is extremely easy to use, and itâs implemented using Twisted non-blocking IO, to avoid interfering with the non-blocking IO of the crawler. This article discusses how mail can be sent using scrapy. For this MailSender
2 min read
Scrapy - ExceptionsPython-based Scrapy is a robust and adaptable web scraping platform. It provides a variety of tools for systematic, effective data extraction from websites. It helps us to automate data extraction from numerous websites. Scrapy Python Scrapy describes the spider that browses websites and gathers dat
7 min read
Selenium Python Basics