0% found this document useful (0 votes)
14 views4 pages

Chapter 3

NLP - Tokenizing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

Chapter 3

NLP - Tokenizing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

What is Tokenizing?

It may be defined as the process of breaking up a piece of text into smaller parts, such as
sentences and words. These smaller parts are called tokens. For example, a word is a token in a
sentence, and a sentence is a token in a paragraph.
As we know that NLP is used to build applications such as sentiment analysis, QA systems,
language translation, smart chatbots, voice systems, etc., hence, in order to build them, it becomes vital
to understand the pattern in the text. The tokens, mentioned above, are very useful in finding and
understanding these patterns. We can consider tokenization as the base step for other recipes such as
stemming and lemmatization.

NLTK package
nltk.tokenize is the package provided by NLTK module to achieve the process of tokenization.

Tokenizing sentences into words


Splitting the sentence into words or creating a list of words from a string is an essential part of
every text processing activity. Let us understand it with the help of various functions/modules provided
by nltk.tokenize package.

word_tokenize module
word_tokenize module is used for basic word tokenization. Following example will use this
module to split a sentence into words.

Example
import nltk
from nltk.tokenize import word_tokenize
print(word_tokenize('Tutorialspoint.com provides high quality technical tutorials for
free.'))

Output

TreebankWordTokenizer Class
word_tokenize module, used above is basically a wrapper function that calls tokenize() function
as an instance of the TreebankWordTokenizer class. It will give the same output as we get while using
word_tokenize() module for splitting the sentences into word. Let us see the same example
implemented above −

Example
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −
from nltk.tokenize import TreebankWordTokenizer
Next, create an instance of TreebankWordTokenizer class as follows −
Tokenizer_wrd = TreebankWordTokenizer()
Now, input the sentence you want to convert to tokens −
Tokenizer_wrd.tokenize(
'Tutorialspoint.com provides high quality technical tutorials for free.'
)

Output
[
'Tutorialspoint.com', 'provides', 'high', 'quality',
'technical', 'tutorials', 'for', 'free', '.'
]

Complete implementation example


Let us see the complete implementation example below
import nltk
from nltk.tokenize import TreebankWordTokenizer
tokenizer_wrd = TreebankWordTokenizer()
print(tokenizer_wrd.tokenize('Tutorialspoint.com provides high quality technical
tutorials for free.'))

Output

The most significant convention of a tokenizer is to separate contractions.


For example, if we use word_tokenize() module for this purpose, it will give the output as follows −

Example
import nltk
from nltk.tokenize import word_tokenize
print(word_tokenize('won’t'))
Output

Such kind of convention by TreebankWordTokenizer is unacceptable. That’s why we have two


alternative word tokenizers namely PunktWordTokenizer and WordPunctTokenizer.

WordPunkcTokenizer Class
An alternative word tokenizer that splits all punctuation into separate tokens. Let us understand it with
the following simple example −

Example
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
print(tokenizer.tokenize(" I can't allow you to go home early"))
Output

Tokenizing text into sentences


In this section we are going to split text/paragraph into sentences. NLTK provides sent_tokenize module
for this purpose.

Why is it needed?
An obvious question that came in our mind is that when we have word tokenizer then why do
we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to
count average words in sentences, how we can do this? For accomplishing this task, we need both
sentence tokenization and word tokenization.
Let us understand the difference between sentence and word tokenizer with the help of
following simple example −

Example
import nltk
from nltk.tokenize import sent_tokenize
text = "Let us understand the difference between sentence & word tokenizer. It is
going to be a simple example."
print(sent_tokenize(text))
Output

Sentence tokenization using regular expressions


If you feel that the output of word tokenizer is unacceptable and want complete control over
how to tokenize the text, we have regular expression which can be used while doing sentence
tokenization. NLTK provide RegexpTokenizer class to achieve this.
Let us understand the concept with the help of two examples below.
In first example we will be using regular expression for matching alphanumeric tokens plus single quotes
so that we don’t split contractions like “won’t”.

Example 1
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
print(tokenizer.tokenize("won't is a contraction."))
print(tokenizer.tokenize("can't is a contraction."))
Output
In first example, we will be using regular expression to tokenize on whitespace.

Example 2
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = True)
print(tokenizer.tokenize("won't is a contraction."))
Output

From the above output, we can see that the punctuation remains in the tokens. The parameter gaps =
True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use
gaps = False parameter then the pattern would be used to identify the tokens which can be seen in
following example −
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = False)
print(tokenizer.tokenize ("won't is a contraction."))
Output

It will give us the blank output.

Reference
https://www.tutorialspoint.com/natural_language_toolkit/
natural_language_toolkit_tokenizing_text.htm

You might also like