Python - Filtering text using Enchant

Last Updated : 26 May, 2020

Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves splitting words from the body of the text. But at times not all the words are required to be tokenized. Suppose we are spell checking, then it is customary to ignore email addresses and URLs. This can be achieved by modifying the tokenization process with filters. Currently implemented filters are :

EmailFilter
URLFilter
WikiWordFilter

Example 1 : EmailFilter

Python3 1==

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import EmailFilter

# the text to be tokenized
text = "The email is [email protected]"

# getting tokenizer class
tokenizer = get_tokenizer("en_US")

# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)

# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [EmailFilter])

# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

Output :

Printing tokens without filtering: [('The', 0), ('email', 4), ('is', 10), ('abc', 13), ('gmail', 17), ('com', 23)] Printing tokens after filtering: [('The', 0), ('email', 4), ('is', 10)

Example 2 : URLFilter

Python3 1==

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import URLFilter

# the text to be tokenized
text = "This is an URL: https://www.geeksforgeeks.org/"

# getting tokenizer class
tokenizer = get_tokenizer("en_US")

# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)


# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [URLFilter])

# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

Output :

Printing tokens without filtering: [('This', 0), ('is', 5), ('an', 8), ('URL', 11), ('https', 16), ('www', 24), ('geeksforgeeks', 28), ('org', 42)] Printing tokens after filtering: [('This', 0), ('is', 5), ('an', 8), ('URL', 11)]

Example 3 : WikiWordFilter A WikiWord is a word which consists of two or more words with initial capitals, run together.

Python3 1==

# import the required modules
from enchant.tokenize import get_tokenizer
from enchant.tokenize import WikiWordFilter

# the text to be tokenized
text = "VersionFiveDotThree is an example of WikiWord"

# getting tokenizer class
tokenizer = get_tokenizer("en_US")

# printing tokens without filtering
print("Printing tokens without filtering:")
token_list = []
for words in tokenizer(text):
    token_list.append(words)
print(token_list)

# getting tokenizer class with filter
tokenizer_filter = get_tokenizer("en_US", [WikiWordFilter])

# printing tokens after filtering
print("\nPrinting tokens after filtering:")
token_list_filter = []
for words in tokenizer_filter(text):
    token_list_filter.append(words)
print(token_list_filter)

Output :

Printing tokens without filtering: [('VersionFiveDotThree', 0), ('is', 20), ('an', 23), ('example', 26), ('of', 34), ('WikiWord', 37)] Printing tokens after filtering: [('is', 20), ('an', 23), ('example', 26), ('of', 34)]

enchant.get_enchant_version() in Python

Yash_R

Improve

Article Tags :

Practice Tags :

python

Python - Filtering text using Enchant

Similar Reads

Thank You!

What kind of Experience do you want to share?