0% found this document useful (0 votes)

14 views4 pages

Chapter 3

NLP - Tokenizing

Uploaded by

cyrelljoyvertudes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views4 pages

Chapter 3

NLP - Tokenizing

Uploaded by

cyrelljoyvertudes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

What is Tokenizing?

It may be defined as the process of breaking up a piece of text into smaller parts, such as
sentences and words. These smaller parts are called tokens. For example, a word is a token in a
sentence, and a sentence is a token in a paragraph.
As we know that NLP is used to build applications such as sentiment analysis, QA systems,
language translation, smart chatbots, voice systems, etc., hence, in order to build them, it becomes vital
to understand the pattern in the text. The tokens, mentioned above, are very useful in finding and
understanding these patterns. We can consider tokenization as the base step for other recipes such as
stemming and lemmatization.

NLTK package
nltk.tokenize is the package provided by NLTK module to achieve the process of tokenization.

Tokenizing sentences into words

Splitting the sentence into words or creating a list of words from a string is an essential part of
every text processing activity. Let us understand it with the help of various functions/modules provided
by nltk.tokenize package.

word_tokenize module
word_tokenize module is used for basic word tokenization. Following example will use this
module to split a sentence into words.

Example
import nltk
from nltk.tokenize import word_tokenize
print(word_tokenize('Tutorialspoint.com provides high quality technical tutorials for
free.'))

Output

TreebankWordTokenizer Class
word_tokenize module, used above is basically a wrapper function that calls tokenize() function
as an instance of the TreebankWordTokenizer class. It will give the same output as we get while using
word_tokenize() module for splitting the sentences into word. Let us see the same example
implemented above −

Example
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −
from nltk.tokenize import TreebankWordTokenizer
Next, create an instance of TreebankWordTokenizer class as follows −
Tokenizer_wrd = TreebankWordTokenizer()
Now, input the sentence you want to convert to tokens −
Tokenizer_wrd.tokenize(
'Tutorialspoint.com provides high quality technical tutorials for free.'
)

Output
[
'Tutorialspoint.com', 'provides', 'high', 'quality',
'technical', 'tutorials', 'for', 'free', '.'
]

Complete implementation example

Let us see the complete implementation example below
import nltk
from nltk.tokenize import TreebankWordTokenizer
tokenizer_wrd = TreebankWordTokenizer()
print(tokenizer_wrd.tokenize('Tutorialspoint.com provides high quality technical
tutorials for free.'))

Output

The most significant convention of a tokenizer is to separate contractions.

For example, if we use word_tokenize() module for this purpose, it will give the output as follows −

Example
import nltk
from nltk.tokenize import word_tokenize
print(word_tokenize('won’t'))
Output

Such kind of convention by TreebankWordTokenizer is unacceptable. That’s why we have two

alternative word tokenizers namely PunktWordTokenizer and WordPunctTokenizer.

WordPunkcTokenizer Class
An alternative word tokenizer that splits all punctuation into separate tokens. Let us understand it with
the following simple example −

Example
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
print(tokenizer.tokenize(" I can't allow you to go home early"))
Output

Tokenizing text into sentences

In this section we are going to split text/paragraph into sentences. NLTK provides sent_tokenize module
for this purpose.

Why is it needed?
An obvious question that came in our mind is that when we have word tokenizer then why do
we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to
count average words in sentences, how we can do this? For accomplishing this task, we need both
sentence tokenization and word tokenization.
Let us understand the difference between sentence and word tokenizer with the help of
following simple example −

Example
import nltk
from nltk.tokenize import sent_tokenize
text = "Let us understand the difference between sentence & word tokenizer. It is
going to be a simple example."
print(sent_tokenize(text))
Output

Sentence tokenization using regular expressions

If you feel that the output of word tokenizer is unacceptable and want complete control over
how to tokenize the text, we have regular expression which can be used while doing sentence
tokenization. NLTK provide RegexpTokenizer class to achieve this.
Let us understand the concept with the help of two examples below.
In first example we will be using regular expression for matching alphanumeric tokens plus single quotes
so that we don’t split contractions like “won’t”.

Example 1
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
print(tokenizer.tokenize("won't is a contraction."))
print(tokenizer.tokenize("can't is a contraction."))
Output
In first example, we will be using regular expression to tokenize on whitespace.

Example 2
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = True)
print(tokenizer.tokenize("won't is a contraction."))
Output

From the above output, we can see that the punctuation remains in the tokens. The parameter gaps =
True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use
gaps = False parameter then the pattern would be used to identify the tokens which can be seen in
following example −
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = False)
print(tokenizer.tokenize ("won't is a contraction."))
Output

It will give us the blank output.

Reference
https://www.tutorialspoint.com/natural_language_toolkit/
natural_language_toolkit_tokenizing_text.htm

NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Lab 2
No ratings yet
Lab 2
49 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP 02
No ratings yet
NLP 02
6 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Lab Manual Final
No ratings yet
NLP Lab Manual Final
25 pages
Exp1 Ananya 66 C NLP
No ratings yet
Exp1 Ananya 66 C NLP
12 pages
NLP Practicals All
No ratings yet
NLP Practicals All
57 pages
NLP-1 (Tokenization)
100% (1)
NLP-1 (Tokenization)
10 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
01 NLP - Merged Vinay
No ratings yet
01 NLP - Merged Vinay
27 pages
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
No ratings yet
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
18 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
NLP m2
No ratings yet
NLP m2
71 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
NLTK
No ratings yet
NLTK
16 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLTK
No ratings yet
NLTK
4 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
English For Adults (Book)
100% (2)
English For Adults (Book)
16 pages
NLTK
No ratings yet
NLTK
3 pages
65 SC Tae1 A3
No ratings yet
65 SC Tae1 A3
3 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
Experiment - 2
No ratings yet
Experiment - 2
3 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
UNIT-5 Quetions - Answers
No ratings yet
UNIT-5 Quetions - Answers
10 pages
NLP Record
No ratings yet
NLP Record
6 pages
Math 10 Quarter 3 Summative Exam
67% (15)
Math 10 Quarter 3 Summative Exam
3 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
3 B Morphology
No ratings yet
3 B Morphology
3 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Application To The Subtitling of The Translation Techniques of Amparo Hurtado Albir
No ratings yet
Application To The Subtitling of The Translation Techniques of Amparo Hurtado Albir
40 pages
Telephoning Vocabulary
100% (1)
Telephoning Vocabulary
16 pages
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
No ratings yet
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
4 pages
ISE II - Society and Living Standards
No ratings yet
ISE II - Society and Living Standards
2 pages
Software Architecture For Developers Sample
No ratings yet
Software Architecture For Developers Sample
93 pages
10 English
No ratings yet
10 English
14 pages
Smart Way Reading and Spelling Lesson #1 Worksheets
No ratings yet
Smart Way Reading and Spelling Lesson #1 Worksheets
34 pages
Unit-27 Romantic Period
100% (1)
Unit-27 Romantic Period
23 pages
Chinese Influence and The Beginnings of Daito-Ryu
No ratings yet
Chinese Influence and The Beginnings of Daito-Ryu
34 pages
Language Handbook 2
No ratings yet
Language Handbook 2
34 pages
Grammar Module
No ratings yet
Grammar Module
127 pages
Speaking Test Questions Right On 8
No ratings yet
Speaking Test Questions Right On 8
5 pages
Present Simple and Continuous
No ratings yet
Present Simple and Continuous
2 pages
LLM Research Paper
No ratings yet
LLM Research Paper
30 pages
Unit 3 Material-Hs3252 - Professional English Ii
No ratings yet
Unit 3 Material-Hs3252 - Professional English Ii
14 pages
Internet Thesis Tagalog
100% (2)
Internet Thesis Tagalog
6 pages
The Construction of Bliss
No ratings yet
The Construction of Bliss
2 pages
2nd PERIODICAL TEST IN ENGLISH
No ratings yet
2nd PERIODICAL TEST IN ENGLISH
5 pages
s2 C Programming KTU 2019 Syllabus
No ratings yet
s2 C Programming KTU 2019 Syllabus
9 pages
UML Interactions
No ratings yet
UML Interactions
64 pages
Ardilla Muhaimina (178820300053)
No ratings yet
Ardilla Muhaimina (178820300053)
35 pages
Subject Description:: St. Anthony's College San Jose, Antique High School Department Curriculum Pacing Guide in English 9
No ratings yet
Subject Description:: St. Anthony's College San Jose, Antique High School Department Curriculum Pacing Guide in English 9
101 pages
Passive Exercises
No ratings yet
Passive Exercises
6 pages
References
No ratings yet
References
5 pages
CH 2 Discourse and Ideology
No ratings yet
CH 2 Discourse and Ideology
2 pages
Mock Test 35 (CSAT Unit Test 8) 31-May-19 1724
No ratings yet
Mock Test 35 (CSAT Unit Test 8) 31-May-19 1724
38 pages
Chapter 6
No ratings yet
Chapter 6
6 pages
Regional Memorandum No. 86 S. 2020: Provision of Learning Resources For The 3 and 4 QUARTER, SY 2020-2021
No ratings yet
Regional Memorandum No. 86 S. 2020: Provision of Learning Resources For The 3 and 4 QUARTER, SY 2020-2021
16 pages
Chapter 8
No ratings yet
Chapter 8
7 pages
Chapter 5
No ratings yet
Chapter 5
7 pages
Chapter 7
No ratings yet
Chapter 7
5 pages
English Task Narative Text: Arrange by Group 2
No ratings yet
English Task Narative Text: Arrange by Group 2
15 pages
THR Greek Elegy PDF
No ratings yet
THR Greek Elegy PDF
29 pages
Quick JavaScript Learning In Just 3 Days: Fast-Track Learning Course
From Everand
Quick JavaScript Learning In Just 3 Days: Fast-Track Learning Course
Vijay K.R.
No ratings yet

Chapter 3

Uploaded by

Chapter 3

Uploaded by

What is Tokenizing?

Tokenizing sentences into words

Complete implementation example

The most significant convention of a tokenizer is to separate contractions.

Such kind of convention by TreebankWordTokenizer is unacceptable. That’s why we have two

Tokenizing text into sentences

Sentence tokenization using regular expressions

It will give us the blank output.

You might also like