Python - Filtering text using Enchant Last Updated : 26 May, 2020 Comments Improve Suggest changes Like Article Like Report Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves splitting words from the body of the text. But at times not all the words are required to be tokenized. Suppose we are spell checking, then it is customary to ignore email addresses and URLs. This can be achieved by modifying the tokenization process with filters. Currently implemented filters are : EmailFilter URLFilter WikiWordFilter Example 1 : EmailFilter Python3 1== # import the required modules from enchant.tokenize import get_tokenizer from enchant.tokenize import EmailFilter # the text to be tokenized text = "The email is [email protected]" # getting tokenizer class tokenizer = get_tokenizer("en_US") # printing tokens without filtering print("Printing tokens without filtering:") token_list = [] for words in tokenizer(text): token_list.append(words) print(token_list) # getting tokenizer class with filter tokenizer_filter = get_tokenizer("en_US", [EmailFilter]) # printing tokens after filtering print("\nPrinting tokens after filtering:") token_list_filter = [] for words in tokenizer_filter(text): token_list_filter.append(words) print(token_list_filter) Output : Printing tokens without filtering: [('The', 0), ('email', 4), ('is', 10), ('abc', 13), ('gmail', 17), ('com', 23)] Printing tokens after filtering: [('The', 0), ('email', 4), ('is', 10) Example 2 : URLFilter Python3 1== # import the required modules from enchant.tokenize import get_tokenizer from enchant.tokenize import URLFilter # the text to be tokenized text = "This is an URL: https://www.geeksforgeeks.org/" # getting tokenizer class tokenizer = get_tokenizer("en_US") # printing tokens without filtering print("Printing tokens without filtering:") token_list = [] for words in tokenizer(text): token_list.append(words) print(token_list) # getting tokenizer class with filter tokenizer_filter = get_tokenizer("en_US", [URLFilter]) # printing tokens after filtering print("\nPrinting tokens after filtering:") token_list_filter = [] for words in tokenizer_filter(text): token_list_filter.append(words) print(token_list_filter) Output : Printing tokens without filtering: [('This', 0), ('is', 5), ('an', 8), ('URL', 11), ('https', 16), ('www', 24), ('geeksforgeeks', 28), ('org', 42)] Printing tokens after filtering: [('This', 0), ('is', 5), ('an', 8), ('URL', 11)] Example 3 : WikiWordFilter A WikiWord is a word which consists of two or more words with initial capitals, run together. Python3 1== # import the required modules from enchant.tokenize import get_tokenizer from enchant.tokenize import WikiWordFilter # the text to be tokenized text = "VersionFiveDotThree is an example of WikiWord" # getting tokenizer class tokenizer = get_tokenizer("en_US") # printing tokens without filtering print("Printing tokens without filtering:") token_list = [] for words in tokenizer(text): token_list.append(words) print(token_list) # getting tokenizer class with filter tokenizer_filter = get_tokenizer("en_US", [WikiWordFilter]) # printing tokens after filtering print("\nPrinting tokens after filtering:") token_list_filter = [] for words in tokenizer_filter(text): token_list_filter.append(words) print(token_list_filter) Output : Printing tokens without filtering: [('VersionFiveDotThree', 0), ('is', 20), ('an', 23), ('example', 26), ('of', 34), ('WikiWord', 37)] Printing tokens after filtering: [('is', 20), ('an', 23), ('example', 26), ('of', 34)] Comment More infoAdvertise with us Next Article enchant.get_enchant_version() in Python Y Yash_R Follow Improve Article Tags : Python Python Enchant-module Practice Tags : python Similar Reads Python - Find the Levenshtein distance using Enchant Levenshtein distance between two strings is defined as the minimum number of characters needed to insert, delete or replace in a given string string1 to transform it to another string string2. Examples : Input : string1 = "geek", string2 = "gesek" Output : 1 Explanation : We can convert string1 into 2 min read Python - Tokenize text using Enchant Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves spl 3 min read Python - Chunking text using Enchant Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves spl 2 min read enchant.get_enchant_version() in Python Enchant is a module in python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. enchant.get_enchant_version() enchant.get_enchant_version() is an inbuilt method of enchant 1 min read enchant.request_dict() in Python Enchant is a module in python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. enchant.request_dict() enchant.request_dict() is an inbuilt method of enchant module. It is 2 min read Clean Web Scraping Data Using clean-text in Python If you like to play with API's or like to scrape data from various websites, you must've come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want. In this arti 2 min read Like