0% found this document useful (0 votes)
9 views4 pages

NLTK

The Natural Language Toolkit (NLTK) is a Python library for processing human language data, enabling tasks such as tokenization, stemming, and part of speech tagging. It can be easily installed via pip and allows for various operations on text, including breaking down sentences and words, as well as converting words to their base forms through stemming and lemmatization. NLTK provides tools for both tokenization and tagging, facilitating the analysis of textual data.

Uploaded by

abhishek1338899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

NLTK

The Natural Language Toolkit (NLTK) is a Python library for processing human language data, enabling tasks such as tokenization, stemming, and part of speech tagging. It can be easily installed via pip and allows for various operations on text, including breaking down sentences and words, as well as converting words to their base forms through stemming and lemmatization. NLTK provides tools for both tokenization and tagging, facilitating the analysis of textual data.

Uploaded by

abhishek1338899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

What is the Natural Language Toolkit (NLTK)?

As discussed earlier, NLTK is Python’s API library for performing an array


of tasks in human language. It can perform a variety of operations on textual
data, such as classification, tokenization, stemming, tagging,
Leparsing, semantic reasoning, etc.
Installation:
NLTK can be installed simply using pip or by running the following code.
! pip install nltk
Accessing Additional Resources:
To incorporate the usage of additional resources, such as recourses of
languages other than English – you can run the following in a python
script. It has to be done only once when you are running it for the first time
in your system.
import nltk
nltk.download('all')
Now, having installed NLTK successfully in our system, let’s perform
some basic operations on text data using NLTK.
Tokenization
Tokenization refers to break down the text into smaller units. It entails
splitting paragraphs into sentences and sentences into words. It is one of
the initial steps of any NLP pipeline. Let us have a look at the two major
kinds of tokenization that NLTK provides:
Work Tokenization
It involves breaking down the text into words.
"I study Machine Learning on Nielit." will be word-tokenized as
['I', 'study', 'Machine', 'Learning', 'on', 'Nielit', '.'].
Sentence Tokenization
It involves breaking down the text into individual sentences.
Example:
"I study Machine Learning on Nielit. Currently, I'm studying NLP"
will be sentence-tokenized as
['I study Machine Learning on Nielit.', 'Currently, I'm studying
NLP.']
In Python, both these tokenizations can be implemented in NLTK as
follows:
# Tokenization using NLTK
from nltk import word_tokenize, sent_tokenize
sent = " Nielit is a great learning platform.\
It is one of the best for Computer Science students."
print(word_tokenize(sent))
print(sent_tokenize(sent))
Output:
[' Nielit ', 'is', 'a', 'great', 'learning', 'platform', '.',
'It', 'is', 'one', 'of', 'the', 'best', 'for', 'Computer',
'Science', 'students', '.']
[' Nielit is a great learning platform.',
'It is one of the best for Computer Science students.']
Stemming and Lemmatization

When working with Natural Language, we are not much interested in the
form of words – rather, we are concerned with the meaning that the words
intend to convey. Thus, we try to map every word of the language to its
root/base form. This process is called canonicalization.
E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same
action – hence, we can map them all to their base form i.e. ‘play’.
Now, there are two widely used canonicalization
techniques: Stemming and Lemmatization.

Stemming

Stemming generates the base word from the inflected word by removing
the affixes of the word. It has a set of pre-defined rules that govern the
dropping of these affixes. It must be noted that stemmers might not
always result in semantically meaningful base words. Stemmers are
faster and computationally less expensive than lemmatizers.
In the following code, we will be stemming words using Porter Stemmer –
one of the most widely used stemmers:
from nltk.stem import PorterStemmer

# create an object of class PorterStemmer


porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))
Output:
play
play
play
play
We can see that all the variations of the word ‘play’ have been reduced to
the same word – ‘play’. In this case, the output is a meaningful word,
‘play’. However, this is not always the case. Let us take an example.
Please note that these groups are stored in the lemmatizer; there is no
removal of affixes as in the case of a stemmer.
from nltk.stem import PorterStemmer
# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("Communication"))
Output:
commun
The stemmer reduces the word ‘communication’ to a base word ‘commun’
which is meaningless in itself.

Lemmatization

Lemmatization involves grouping together the inflected forms of the same


word. This way, we can reach out to the base form of any word which will
be meaningful in nature. The base from here is called the Lemma.
Lemmatizers are slower and computationally more expensive than
stemmers.
Example:
'play', 'plays', 'played', and 'playing' have 'play' as the lemma.
In Python, both these tokenizations can be implemented in NLTK as
follows:
from nltk.stem import WordNetLemmatizer
# create an object of class WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("plays", 'v'))
print(lemmatizer.lemmatize("played", 'v'))
print(lemmatizer.lemmatize("play", 'v'))
print(lemmatizer.lemmatize("playing", 'v'))
Output:
play
play
play
play
Please note that in lemmatizers, we need to pass the Part of Speech of the
word along with the word as a function argument.
Also, lemmatizers always result in meaningful base words. Let us take the
same example as we took in the case for stemmers.
from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer


lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("Communication", 'v'))
Output:
Communication
Part of Speech Tagging
Part of Speech (POS) tagging refers to assigning each word of a sentence
to its part of speech. It is significant as it helps to give a better syntactic
overview of a sentence.
Example:
"GeeksforGeeks is a Computer Science platform."
Let's see how NLTK's POS tagger will tag this sentence.
In Python, both these tokenizations can be implemented in NLTK as
follows:
from nltk import pos_tag
from nltk import word_tokenize

text = "GeeksforGeeks is a Computer Science platform."


tokenized_text = word_tokenize(text)
tags = tokens_tag = pos_tag(tokenized_text)
tags
Output:
[('GeeksforGeeks', 'NNP'),
('is', 'VBZ'),
('a', 'DT'),
('Computer', 'NNP'),
('Science', 'NNP'),
('platform', 'NN'),
('.', '.')]

You might also like