Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. In the context of nltk and python, it is simply the process of putting each token in a list so that instead of iterating over each letter at a time, we can iterate over a token.
For example, given the input string −
Hi man, how have you been?
We should get the output −
['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']
We can tokenize this text using the word_tokenize method from NLTK. For example,
Example
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize my_sent = "Hi man, how have you been?" tokens = word_tokenize(my_sent) print(tokens)
Output
This will give the output −
['Hi', 'man', ',', 'how', 'have', 'you', 'been', '?']