
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Determine Unicode Code Point at Given Index in Python
A Unicode code point is a unique number that represents a number in the Unicode character set. Unicode is a character encoding standard that is used to assign unique codes to every character in the world. Unicode supports around 130,000 characters including letters, symbols, and emojis.We can determine the Unicode Code point at a specific index using the ord() function,codecs module in Python,unicodedata module, and array module in Python. In this article, we will discuss how we can determine the Unicode code point at a given index using all these methods.
Unicode Code Point
According to the Unicode code point, every character is represented by a unique number. The code point is represented in hexadecimal notation and consists of a "U+" prefix followed by a four or five-digit hexadecimal number.
Python Program to Determine Unicode Code Point
Method 1: Using the ord() function.
We can use the ord() function in Python to get the Unicode code of a character at a given index. The ord() function takes a single character as an argument and returns the Unicode code point for that character.
Syntax
code_point = ord(string[index])
Here,the ord() function takes a single character string as its argument and returns the Unicode code point of that character as an integer.
Example
In the below example, we first get the character at a specific index in the string and then pass that character to the ord() function in Python to get the Unicode code point of that character.
# Get the Unicode code point at a given index def get_unicode_code_point(string, index): char = string[index] code_point = ord(char) return code_point # Test the function string = "Hello, World!" index = 1 code_point = get_unicode_code_point(string, index) print(f"The Unicode code point of the character '{string[index]}' at index {index} is U+{code_point:04X}.")
Output
The Unicode code point of the character 'e' at index 1 is U+0065.
Method 2: Using the codecs module
The codecs module provides a method called codecs.encode() that can be used to encode a string in a specified encoding format. We can use this method to encode a single character in the UTF-8 encoding format and then use the bytearray() function to convert the encoded character to an array of bytes. We can then extract the Unicode code point from the bytes using the struct module.
Syntax
import codecs byte_string = string.encode('utf-8') code_point = int(codecs.encode(byte_string[index:index+1], 'hex'), 16)
Here, we use the codecs.encode() function to encode the byte string in hexadecimal format, which returns a string of the form "XX", where XX is a two-digit hexadecimal representation of the byte. We convert this string to an integer using the int() function with a base of 16 (since the string is in hexadecimal format) to get the Unicode code point of the character.
Example
In the below example, we first encode the character at index 1 of the string "Hello, World!" using the UTF-8 encoding format and store the resulting byte string in the byte_string variable. We then pass the byte_string to the codecs.decode() method, specifying the 'unicode_escape' codec to decode the byte string as a Unicode escape sequence. This produces a Unicode string, which we then encode again using the UTF-16BE encoding format and store in the code_point variable.Finally, we use the int.from_bytes() method to convert the byte string to an integer and print the Unicode code point in hexadecimal notation with a "U+" prefix using a formatted string literal.
import codecs string = "Hello, World!" index = 1 char = string[index] byte_string = char.encode('utf-8') code_point = codecs.decode(byte_string, 'unicode_escape').encode('utf-16be') code_point = int.from_bytes(code_point, byteorder='big') print(f"The Unicode code point of the character '{string[index]}' at index {index} is U+{code_point:04X}.")
Output
The Unicode code point of the character 'e' at index 1 is U+0065.
Method 3: Using the unicodedata module
The unicodedata module provides a function called unicodedata.name() that can be used to get the name of a Unicode character. We can use this function to get the name of the character at a given index and then use the unicodedata.lookup() function to get the Unicode code point of the character.
Syntax
import unicodedata code_point = ord(char) if unicodedata.combining(char): prev_char = string[index - 1] prev_code_point = ord(prev_char) code_point = prev_code_point + (code_point - 0xDC00) + ((prev_code_point - 0xD800) << 10)
Here, we first get the character at the specified index of the string and store it in the char variable. We then use the built-in ord() function to get the Unicode code point of the character.If the character is a combining character (i.e., a character that modifies the appearance of the preceding character, such as an accent mark), we need to use some extra logic to get the full Unicode code point. In this case, we get the previous character in the string and get its Unicode code point using ord(). We then use some bitwise operations to combine the two code points and get the full Unicode code point of the combined character.
Example
In the below example, we used the unicodedata module to get the name of the character 'e' at index 1 of the string "Hello, World!" using the unicodedata.name() function. We then extracted the Unicode code point from the name using the int() function and used formatted string literals (f-strings) to print the code point in hexadecimal notation with a "U+" prefix.
import unicodedata string = "Hello, World!" index = 1 char = string[index] name = unicodedata.name(char) code_point = int(name.split(' ')[-1], 16) print(f"The Unicode code point of the character '{string[index]}' at index {index} is U+{code_point:04X}.")
Output
The Unicode code point of the character 'e' at index 1 is U+000E.
Method 4: Using array Module
The array module provides a class called array.array() that can be used to create arrays of a specified type. We can create an array of unsigned integers and append the Unicode code point of each character in the string to the array. We can then access the Unicode code point of the character at a given index by indexing into the array.
Syntax
import array byte_array = array.array('b', char.encode('utf-8')) code_point = int.from_bytes(byte_array, 'big')
Here, we first encode the character at the specified index of the string using the UTF-8 encoding format and store the resulting byte string in the byte_array variable as a signed byte array. We then use the int.from_bytes() method with a byte order of 'big' to convert the byte array to an integer value and get the Unicode code point of the character.
Example
In the below example, we used the array module to create an array of unsigned integers using the array.array() function. We used a list comprehension to append the Unicode code point of each character in the string "Hello, World!" to the array. We then indexed into the array to get the Unicode code point of the character at index 1. We used formatted string literals (f-strings) to print the code point in hexadecimal notation with a "U+" prefix.
import array string = "Hello, World!" index = 1 code_points = array.array('I', [ord(char) for char in string]) code_point = code_points[index] print(f"The Unicode code point of the character '{string[index]}' at index {index} is U+{code_point:04X}.")
Output
The Unicode code point of the character 'e' at index 1 is U+0065.
Conclusion
In this article, we have discussed how we can determine the Unicode point at a given index. Unicode code points can be determined for each character using the ord() function of Python. A Unicode code point is a unique number given for each character representation.