Determine a Character's Unicode Block in Java



In this article, we will learn to represent the Unicode block containing the given character in Java. Unicode provides a standardized way to represent characters from various writing systems across the world. In Java, characters belong to different Unicode Blocks, which help in categorizing them based on language, symbols, and special characters.

Understanding Unicode Blocks

A Unicode Block is a range of Unicode characters grouped together based on similar properties.
For example:

  • Basic Latin (U+0000 to U+007F) contains English letters and symbols.
  • CJK Unified Ideographs (U+4E00 to U+9FFF) contains Chinese, Japanese, and Korean characters.
  • Arabic (U+0600 to U+06FF) contains Arabic script characters.

Different Approaches 

The following are two different approaches to represent the Unicode block containing the given character in Java?

Using Character.UnicodeBlock.of(char ch)

To determine a Character's Unicode Block, use the Character.UnicodeBlock.of() method in Java. The method returns the object representing the Unicode block containing the given character, or null if the character is not a member of a defined block.

Following are the steps to represent the Unicode block containing the given character using g Character.UnicodeBlock.of() method ?

  • The program checks the Unicode block of '\u5639' (a Chinese character) and prints CJK_UNIFIED_IDEOGRAPHS.
  • It prints Unicode blocks for other characters, such as a space (BASIC_LATIN), an arrow (ARROWS), and an Arabic letter (ARABIC).
  • The method Character.UnicodeBlock.of() efficiently retrieves the block name.
Character.UnicodeBlock block = Character.UnicodeBlock.of(ch);

Example

Below is an example that shows how we can represent the Unicode block containing the given character using Character.UnicodeBlock.of() method?

public class Demo {
   public static void main(String []args) {
      char ch = '\u5639';
      System.out.println(ch);
      Character.UnicodeBlock block = Character.UnicodeBlock.of(ch);
      System.out.println(block);
      System.out.println(Character.UnicodeBlock.of(' '));
      System.out.println(Character.UnicodeBlock.of('\u21ac'));
      System.out.println(Character.UnicodeBlock.of(1565));
   }
}

Time Complexity: O(1), Each lookup is constant time.
Space Complexity: O(1), Uses a few constant variables.

Output

?
CJK_UNIFIED_IDEOGRAPHS
BASIC_LATIN
ARROWS
ARABIC

Using Unicode Code Points

For characters outside the Basic Multilingual Plane (BMP) (U+0000 to U+FFFF), we use code points instead of char. The Character.codePointAt() method helps process multi-byte characters properly.

Following are the steps to represent the Unicode block containing the given character using Unicode code points ?

  • The program determines the Unicode block of "?", a musical symbol.
  • Character.codePointAt(0) retrieves the Unicode code point.
  • The method Character.UnicodeBlock.of(codePoint) correctly identifies the block.
int codePoint = text.codePointAt(0); // Get Unicode code point

Example

Below is an example to represent the Unicode block containing the given character using Unicode code points ?

public class UnicodeBlockFinder {
    public static void main(String[] args) {
        String text = "?"; // A musical symbol (surrogate pair)
        int codePoint = text.codePointAt(0); // Get Unicode code point
        Character.UnicodeBlock block = Character.UnicodeBlock.of(codePoint);
        
        System.out.println("Character: " + text);
        System.out.println("Unicode Block: " + block);
    }
}

Output

Character: ?
Unicode Block: MUSICAL_SYMBOLS

Time Complexity: O(1), Retrieving code point and Unicode block are constant-time operations.
Space Complexity: O(1), Stores a single string and an integer.

Conclusion

Unicode Blocks help categorize characters based on their linguistic, symbolic, or script-based properties. We covered two key methods, First using Character.UnicodeBlock.of(char ch) - This method efficiently retrieves the Unicode block of a given character and works well for characters within the Basic Multilingual Plane (BMP) second using Unicode Code Points - For characters outside the BMP, we use Character.codePointAt() to correctly handle multi-byte characters and determine their Unicode block.

Alshifa Hasnain
Alshifa Hasnain

Converting Code to Clarity

Updated on: 2025-02-14T18:59:17+05:30

454 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements