Python - Tokenize text using Enchant Last Updated : 26 May, 2020 Summarize Comments Improve Suggest changes Share Like Article Like Report Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves splitting words from the body of the text. Some terms that will be frequently used are : Corpus – Body of text, singular. Corpora is the plural of this. Lexicon – Words and their meanings. Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words. We will be using get_tokenizer() to tokenize the text. It takes a language code as input and returns the appropriate tokenization class. Then we instantiate this class with some text and it will return an iterator which will yield the words contained in that text. The items produced by the tokenizer are tuples of the form (WORD, POS), where WORD is the tokenized word and POS is the position of string at which that word is located. Python3 # import the module from enchant.tokenize import get_tokenizer # the text to be tokenized text = ("Natural language processing (NLP) is a field " + "of computer science, artificial intelligence " + "and computational linguistics concerned with " + "the interactions between computers and human " + "(natural) languages, and, in particular, " + "concerned with programming computers to " + "fruitfully process large natural language " + "corpora. Challenges in natural language " + "processing frequently involve natural " + "language understanding, natural language" + "generation frequently from formal, machine" + "-readable logical forms), connecting language " + "and machine perception, managing human-" + "computer dialog systems, or some combination " + "thereof.") # getting tokenizer class tokenizer = get_tokenizer("en_US") token_list =[] for words in tokenizer(text): token_list.append(words) # print the words with POS print(token_list) Output : [('Natural', 0), ('language', 8), ('processing', 17), ('NLP', 29), ('is', 34), ('a', 37), ('field', 39), ('of', 45), ('computer', 48), ('science', 57), ('artificial', 66), ('intelligence', 77), ('and', 90), ('computational', 94), ('linguistics', 108), ('concerned', 120), ('with', 130), ('the', 135), ('interactions', 139), ('between', 152), ('computers', 160), ('and', 170), ('human', 174), ('natural', 181), ('languages', 190), ('and', 201), ('in', 206), ('particular', 209), ('concerned', 221), ('with', 231), ('programming', 236), ('computers', 248), ('to', 258), ('fruitfully', 261), ('process', 272), ('large', 280), ('natural', 286), ('language', 294), ('corpora', 303), ('Challenges', 312), ('in', 323), ('natural', 326), ('language', 334), ('processing', 343), ('frequently', 354), ('involve', 365), ('natural', 373), ('language', 381), ('understanding', 390), ('natural', 405), ('languagegeneration', 413), ('frequently', 432), ('from', 443), ('formal', 448), ('machine', 456), ('readable', 464), ('logical', 473), ('forms', 481), ('connecting', 489), ('language', 500), ('and', 509), ('machine', 513), ('perception', 521), ('managing', 533), ('human', 542), ('computer', 548), ('dialog', 557), ('systems', 564), ('or', 573), ('some', 576), ('combination', 581), ('thereof', 593)] To only print the words, not the POS : Python3 1== # print only the words word_list =[] for tokens in token_list: word_list.append(tokens[0]) print(word_list) Output : ['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'field', 'of', 'computer', 'science', 'artificial', 'intelligence', 'and', 'computational', 'linguistics', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'natural', 'languages', 'and', 'in', 'particular', 'concerned', 'with', 'programming', 'computers', 'to', 'fruitfully', 'process', 'large', 'natural', 'language', 'corpora', 'Challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'natural', 'language', 'understanding', 'natural', 'languagegeneration', 'frequently', 'from', 'formal', 'machine', 'readable', 'logical', 'forms', 'connecting', 'language', 'and', 'machine', 'perception', 'managing', 'human', 'computer', 'dialog', 'systems', 'or', 'some', 'combination', 'thereof'] Comment More infoAdvertise with us Next Article Python NLTK | tokenize.regexp() Y Yash_R Follow Improve Article Tags : Python Python Enchant-module Practice Tags : python Similar Reads Python - Chunking text using Enchant Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves spl 2 min read Python - Filtering text using Enchant Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves spl 3 min read Python NLTK | tokenize.regexp() With the help of NLTK tokenize.regexp() module, we are able to extract the tokens from string by using regular expression with RegexpTokenizer() method. Syntax : tokenize.RegexpTokenizer() Return : Return array of tokens using regular expression Example #1 : In this example we are using RegexpTokeni 1 min read Python NLTK | nltk.TweetTokenizer() With the help of NLTK nltk.TweetTokenizer() method, we are able to convert the stream of words into small  tokens so that we can analyse the audio stream with the help of nltk.TweetTokenizer() method. Syntax : nltk.TweetTokenizer() Return : Return the stream of token Example #1 : In this example whe 1 min read Python - Spelling checker using Enchant Enchant is a module in Python, which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in the dictionary or not. Enchant can also be used to check the spelling of words. The check() method returns Tr 2 min read Python NLTK | nltk.tokenize.mwe() With the help of NLTK nltk.tokenize.mwe() method, we can tokenize the audio stream into multi_word expression token which helps to bind the tokens with underscore by using nltk.tokenize.mwe() method. Remember it is case sensitive. Syntax : MWETokenizer.tokenize() Return : Return bind tokens as one i 1 min read Like