Chapter 3
Chapter 3
It may be defined as the process of breaking up a piece of text into smaller parts, such as
sentences and words. These smaller parts are called tokens. For example, a word is a token in a
sentence, and a sentence is a token in a paragraph.
As we know that NLP is used to build applications such as sentiment analysis, QA systems,
language translation, smart chatbots, voice systems, etc., hence, in order to build them, it becomes vital
to understand the pattern in the text. The tokens, mentioned above, are very useful in finding and
understanding these patterns. We can consider tokenization as the base step for other recipes such as
stemming and lemmatization.
NLTK package
nltk.tokenize is the package provided by NLTK module to achieve the process of tokenization.
word_tokenize module
word_tokenize module is used for basic word tokenization. Following example will use this
module to split a sentence into words.
Example
import nltk
from nltk.tokenize import word_tokenize
print(word_tokenize('Tutorialspoint.com provides high quality technical tutorials for
free.'))
Output
TreebankWordTokenizer Class
word_tokenize module, used above is basically a wrapper function that calls tokenize() function
as an instance of the TreebankWordTokenizer class. It will give the same output as we get while using
word_tokenize() module for splitting the sentences into word. Let us see the same example
implemented above −
Example
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −
from nltk.tokenize import TreebankWordTokenizer
Next, create an instance of TreebankWordTokenizer class as follows −
Tokenizer_wrd = TreebankWordTokenizer()
Now, input the sentence you want to convert to tokens −
Tokenizer_wrd.tokenize(
'Tutorialspoint.com provides high quality technical tutorials for free.'
)
Output
[
'Tutorialspoint.com', 'provides', 'high', 'quality',
'technical', 'tutorials', 'for', 'free', '.'
]
Output
Example
import nltk
from nltk.tokenize import word_tokenize
print(word_tokenize('won’t'))
Output
WordPunkcTokenizer Class
An alternative word tokenizer that splits all punctuation into separate tokens. Let us understand it with
the following simple example −
Example
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
print(tokenizer.tokenize(" I can't allow you to go home early"))
Output
Why is it needed?
An obvious question that came in our mind is that when we have word tokenizer then why do
we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to
count average words in sentences, how we can do this? For accomplishing this task, we need both
sentence tokenization and word tokenization.
Let us understand the difference between sentence and word tokenizer with the help of
following simple example −
Example
import nltk
from nltk.tokenize import sent_tokenize
text = "Let us understand the difference between sentence & word tokenizer. It is
going to be a simple example."
print(sent_tokenize(text))
Output
Example 1
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
print(tokenizer.tokenize("won't is a contraction."))
print(tokenizer.tokenize("can't is a contraction."))
Output
In first example, we will be using regular expression to tokenize on whitespace.
Example 2
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = True)
print(tokenizer.tokenize("won't is a contraction."))
Output
From the above output, we can see that the punctuation remains in the tokens. The parameter gaps =
True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use
gaps = False parameter then the pattern would be used to identify the tokens which can be seen in
following example −
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = False)
print(tokenizer.tokenize ("won't is a contraction."))
Output
Reference
https://www.tutorialspoint.com/natural_language_toolkit/
natural_language_toolkit_tokenizing_text.htm