NLP Practicals All
NLP Practicals All
Practical No. 01
Roll No : 9070
Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into
smaller units, such as individual words or terms. Each of these smaller units are called tokens.
For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’
Tokenization can be done to either separate words or sentences. If the text is split into words using
some separation technique it is called word tokenization and same separation done for sentences
is called sentence tokenization.
In the process of tokenization, some characters like punctuation marks may be discarded.
Before processing a natural language, we need to identify the words that constitute a string of
characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is
important because the meaning of the text could easily be interpreted by analyzing the words
present in the text.
“This is a cat.”
What do you think will happen after we perform tokenization on this string? We get [‘This’, ‘is’, ‘a’,
cat’].
There are numerous uses of doing this. We can use this tokenized form to:
https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 1/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory
White Space Tokenization is the simplest tokenization technique. Given a sentence or paragraph it
tokenizes into words by splitting the input whenever a white space in encountered. This is the
fastest tokenization technique but will work for languages in which the white space breaks apart
the sentence into meaningful words. Example: English,Hindi.
Tokenisation with NLTK NLTK is a standard python library with prebuilt functions and utilities for the
ease of use and implementation. It is one of the most used libraries for natural language
processing and computational linguistics.
Natural Language toolkit has very important module tokenize which further comprises of sub-
modules
1. word tokenize
2. sentence tokenize
import nltk
nltk.download('all')
'''
Tokenization of words
We use the method word_tokenize() to split a sentence into words. '''
https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 3/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory
print("MWETokenizer",tokenizer.tokenize(text.split()))
word_tokenize ['Hello', 'All', ',', 'This', 'is', 'first', 'practical', 'session', 'in',
TreebankWordTokenizer ['Hello', 'All', ',', 'This', 'is', 'first', 'practical', 'session
WordPunctTokenizer ['Hello', 'All', ',', 'This', 'is', 'first', 'practical', 'session',
MWETokenizer ['Hello', 'All,', 'This', 'is', 'first', 'practical', 'session', 'in', 'NLP
'''
Tokenization of sentences
We use the method sent_tokenize() to split paragrph into sentences. '''
['एनएलपी बढ़िया है!', 'मैंने एक मुक्त कोर्टेरा कपोन जीता !', 'चलो एनएलपी का अध्ययन शुरू करते हैं
['NLP is Great.', 'I won a free Coursera cupon.', 'Lets start studying NLP.']
'\nThe sent_tokenize function uses an instance of PunktSentenceTokenizer from the
nltk.tokenize.punkt module, \nwhich is already been trained and thus very well kn
Student Task:
Write a code to demonstrate Tokenization at word and sentence
level in Hindi Language
https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 4/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory
['मेरा',
'नाम',
'रितेश',
'है',
',',
'मैं',
'पिल्लई',
'कॉलेज',
'में',
'पढ़ता',
'हूँ']
Filteration
One of the key steps in processing language data is to remove noise so that the machine can more
easily detect the patterns in the data. Text data contains a lot of noise, this takes the form of
special characters such as hashtags, punctuation and numbers. All of which are difficult for
computers to understand if they are present in the data. We need to, therefore, process the data to
remove these elements. Additionally, it is also important to apply some attention to the casing of
words. If we include both upper case and lower case versions of the same words then the computer
will see these as different entities, even though they may be the same.
def filter_text(inText,lowerFlag=False,upperFlag=False,numberFlag=False,htmlFlag=False,urlFl
if lowerFlag:
inText = inText.lower()
if upperFlag:
inText = inText.upper()
if numberFlag:
import re
inText = re.sub(r"\d+", '', inText)
if htmlFlag:
import re
inText = re.sub(r'<[^>]*>', '', inText)
if urlFlag:
import re
inText = re.sub(r'(https?|ftp|www)\S+', '', inText)
if punctFlag:
import re
import string
exclist = string.punctuation #removes [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
# remove punctuations and digits from oldtext
https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 5/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory
if spaceFlag:
import re
inText = re.sub(' +'," ",inText).strip()
if hashtagFlag:
pass
# Students Task
if emojiFlag:
pass
# Students Task
return inText
usrText = input()
filter_text(usrText, urlFlag=True)
filter_text(usrText, punctFlag=True,spaceFlag=True)
Student Task:
Modify the above code to demonstrate filteration of hashtag word and certian emojis and keep
certain punctuation if they join two words
def filter_text(inText,lowerFlag=False,upperFlag=False,numberFlag=False,htmlFlag=False,urlFl
if hashtagFlag:
import re
inText = re.sub(r"^[#/^]", '', inText)
return inText
usrText = input()
https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 6/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory
usrText = input()
https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 7/7
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory
Practical No. 02
Write a program to perform Script Validation and identify Stop Words of English and
Hindi Text
Script Validation
In script validation, foreign words (the words which don't belong to the required input language) are detected and removed. In the sentence “
विदेशी को हटाना hoga आज ” the word “hoga” is a word of Hindi language written using English characters. During script validation as per the NLP
application requirement the word hoga will either be removed or transliterated into devanagari script “होगा”
import nltk
nltk.download('all')
https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 1/5
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory
[nltk_data] | Downloading package cmudict to /root/nltk_data...
[nltk_data] | Unzipping corpora/cmudict.zip.
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk_data] | Downloading package conll2007 to /root/nltk_data...
[nltk_data] | Downloading package crubadan to /root/nltk_data...
[nltk_data] | Unzipping corpora/crubadan.zip.
[nltk_data] | Downloading package dependency_treebank to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/dependency_treebank.zip.
[nltk_data] | Downloading package dolch to /root/nltk_data...
[nltk_data] | Unzipping corpora/dolch.zip.
[ ltk d t ] | l di k l t
"""
For Script validation, we use Unicodes of the characters """
def detectLang(inText,charFlag=False,wordFlag=False,sentenceFlag=False,lang="EN"):
if charFlag:
if len(inText)==1 and lang == "EN":
if ord(inText) in list(range(65,123)):
return "EN"
if len(inText)==1 and lang == "HI":
if ord(inText) in list(range(2304,2432)):
return "HI"
if wordFlag:
if len(inText)>1 and lang == "EN":
for x in inText:
if ord(x) not in list(range(65,123)):
return "Not Found"
return "EN"
if len(inText)>1 and lang == "HI":
for x in inText:
if ord(x) not in list(range(2304,2432)):
return "Not Found"
return "HI"
if sentenceFlag:
pass
#https://en.wikipedia.org/wiki/List_of_Unicode_characters
#https://jrgraphix.net/r/Unicode/0020-007F
'Not Found'
detectLang("हेलो",wordFlag=True,lang="EN")
'Not Found'
detectLang("ह",charFlag=True,lang="HI")
'HI'
detectLang("H",charFlag=True,lang="EN")
'EN'
Student Task:
Modify the above code to demonstrate Sentence level script validation for Hindi and English language
https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 2/5
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory
Stopwords
The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-
processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both
when indexing entries for searching and when retrieving them as the result of a search query.
We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily,
by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16
different languages. You can find them in the nltk_data directory. home/pratima/nltk_data/corpora/stopwords is the directory address.(Do not
forget to change your home directory name).
The following program tokenizes the sentence, identifies and removes stop words from a piece of text.
import nltk
nltk.download('all')
https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 3/5
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory
[nltk_data] | Downloading package crubadan to /root/nltk_data...
[nltk_data] | Package crubadan is already up-to-date!
[ ltk d t ] | l di k d d t b k t
"""
Stop Word Identification """
txt = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the inter
#print(stopwords.words('english'))
stop_words = set(stopwords.words('english'))
#print(stop_words)
word_tokens = word_tokenize(txt)
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print("After removing the recognized stopwords, the Tokens of sentence is:", filtered_sentence)
Sentence is: Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned wit
Tokens in the above sentence: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'c
StopWords recognized in the given sentence: ['and', 'then', 'in', 'a', 'of', 'them', 'with', 'how', 'is', 'between', 'can', 'the', 'to',
After removing the recognized stopwords, the Tokens of sentence is: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'subfield',
Student Task:
Identify Stop words in Sentence level for Hindi langauge
"""
Stop Word Identification hindi"""
txt = "राष्ट्र भाषा हिंदी हमारे राष्ट्र की शान है। भारत की समानता की धुरी है। भारत की संस्कृ ति और सभ्यता की मूल चेतना को शुद्धता से अभिव्यक्त करने का माध्यम है । राष्ट्रीय विचारों
#print(stopwords.words('english'))
#stop_words = set(stopwords.words('english'))
#print(stop_words)
word_tokens = word_tokenize(txt)
for w in word_tokens:
if w not in stopwords:
filtered_sentence.append(w)
https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 4/5
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory
print("After removing the recognized stopwords, the Tokens of sentence is:", filtered_sentence)
Sentence is: राष्ट्र भाषा हिंदी हमारे राष्ट्र की शान है। भारत की समानता की धुरी है। भारत की संस्कृ ति और सभ्यता की मूल चेतना को शुद्धता से अभिव्यक्त करने का माध्यम है
Tokens in the above sentence: ['राष्ट्र भाषा', 'हिंदी', 'हमारे ', 'राष्ट्र ', 'की', 'शान', 'है।', 'भारत', 'की', 'समानता', 'की', 'धुरी', 'है।', 'भारत', 'की',
StopWords recognized in the given sentence: ['है', 'होने', 'की', 'को', 'से', 'और', 'करने']
After removing the recognized stopwords, the Tokens of sentence is: ['राष्ट्र भाषा', 'हिंदी', 'हमारे ', 'राष्ट्र ', 'शान', 'है।', 'भारत', 'समानता', 'धुरी',
https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 5/5
3/22/23, 9:16 PM 9070_Exp.3.ibpy - Colaboratory
Practical No. 03
Write a program to identify Stem and Lemma of English and Hindi Text
Stemming
import nltk
nltk.download('all')
https://colab.research.google.com/drive/1fQrQUwQqgAquCL8QQawMaOsjeqI9VzkN?authuser=1#scrollTo=-JrhwZdS-S2n&printMode=true 1/4
3/22/23, 9:16 PM 9070_Exp.3.ibpy - Colaboratory
ps = PorterStemmer()
for w in words:
print(w, " : ", ps.stem(w))
program : program
programs : program
programmer : programm
programming : program
programmers : programm
# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "We wake up early in the morning, and do some good work." "Programmers program with programming languages." "People comes to consu
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
We : we
wake : wake
up : up
early : earli
in : in
the : the
morning : morn
, : ,
and : and
do : do
some : some
good : good
work.Programmers : work.programm
program : program
with : with
programming : program
languages.People : languages.peopl
comes : come
to : to
consultants : consult
office : offic
to : to
consult : consult
the : the
consultant : consult
. : .
Lemmatization
lemmatizer = WordNetLemmatizer()
https://colab.research.google.com/drive/1fQrQUwQqgAquCL8QQawMaOsjeqI9VzkN?authuser=1#scrollTo=-JrhwZdS-S2n&printMode=true 2/4
3/22/23, 9:16 PM 9070_Exp.3.ibpy - Colaboratory
print("\n\n")
sentence = "We like dancing. The dance was really great." "We wake up early in the morning, and do some good work." "Programmers program with
words = word_tokenize(sentence)
for w in words:
print(w, " : ", lemmatizer.lemmatize(w, pos='v'))
rocks : rock
corpora : corpus
better : good
are : are
We : We
like : like
dancing : dance
. : .
The : The
dance : dance
was : be
really : really
great.We : great.We
wake : wake
up : up
early : early
in : in
the : the
morning : morning
, : ,
and : and
do : do
some : some
good : good
work.Programmers : work.Programmers
program : program
with : with
programming : program
languages : languages
. : .
#############################STUDENT TASK############################
#Perform Sentence level Stemming and Lematization for Hindi and English language.
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
porter = PorterStemmer()
lemmatizer=WordNetLemmatizer()
sentence="We like singing a song. The song was beautifully written." #word_list = ["we", "like", "singing", "a", "song"]
words=word_tokenize(sentence)
print("{0:20}{1:20}{2:20}".format("ORIGINAL","STEMMING","LEMMATIZATION"))
for word in words:
print("{0:20}{1:20}{2:20}".format(word,porter.stem(word),lemmatizer.lemmatize(word,pos="v")))
https://colab.research.google.com/drive/1fQrQUwQqgAquCL8QQawMaOsjeqI9VzkN?authuser=1#scrollTo=-JrhwZdS-S2n&printMode=true 3/4
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory
Practical No. 04
https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 1/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory
Imagine listening to someone as they speak and trying to guess the next word that they are
going to say. For example what word is likely to follow this sentence fragment?: I’d like to
make a . . . / Please hand over your
Guessing the next word (or word prediction) is an essential subtask of speech recognition,
hand-writing recognition, augmentative communication for the disabled, and spelling error
detection.
In such tasks, word-identification is difficult because the input is very noisy and ambiguous.
Thus looking at previous words can give us an important cue about what the next ones are
going to be.
N-gram models, which predict the next word from the previous N − 1 words.
Such statistical models of word sequences are also called language models or LMs.
Computing the probability of the next word will turn out to be closely related to computing the
probability of a sequence of words.
The following sequence, for example, has a non-zero probability of appearing in a text: . . . all
of a sudden I notice three guys standing on the sidewalk... " be off close on for joker"
while this same set of words in a different order has a much much lower probability: on guys
all I of notice sidewalk three a sudden standing the
It can also help to make spelling error corrections.
For instance, the sentence “drink cofee” could be corrected to “drink coffee” if you knew that
the word “coffee” had a high probability of occurrence after the word “drink” and also the
overlap of letters between “cofee” and “coffee” is high
Let’s start with equation P(w|h), the probability of word w, given some history, h. For example,
P(The|its water is so transparent that) Here,
w = The
h = its water is so transparent that
And, one way to estimate the above probability function is through the relative frequency
count approach, where you would take a substantially large corpus, count the number of times
you see its water is so transparent that, and then count the number of times it is followed by
the.
In other words, you are answering the question: Out of the times you saw the history h, how
many times did the word w follow it P(the|its water is so transparent that) = C(its water is so
transparent that)/C(its water is so transparent that the)
You can imagine, it is not feasible to perform this over an entire corpus; especially if it is of a
significant size.
This shortcoming and ways to decompose the probability function using the chain rule serves
as the base intuition of the N-gram model. Here, you, instead of computing probability using
the entire corpus, would approximate it by just a few historical words
https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 2/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory
import nltk
nltk.download('all')
https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 3/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory
import nltk
from nltk.util import ngrams
data = 'An n-gram is a contiguous sequence of n items from a given sample of text or speech.'
1-gram: ['An', 'n-gram', 'is', 'a', 'contiguous', 'sequence', 'of', 'n', 'items', 'from
2-gram: ['An n-gram', 'n-gram is', 'is a', 'a contiguous', 'contiguous sequence', 'sequ
3-gram: ['An n-gram is', 'n-gram is a', 'is a contiguous', 'a contiguous sequence', 'co
4-gram: ['An n-gram is a', 'n-gram is a contiguous', 'is a contiguous sequence', 'a con
1-gram: ['शब्दों', 'या', 'वर्णों', 'का', 'एक', 'अनुक्रमिक', 'अवयव', 'हो', 'सकता', 'है', '|']
2-gram: ['शब्दों या', 'या वर्णों', 'वर्णों का', 'का एक', 'एक अनुक्रमिक', 'अनुक्रमिक अवयव', 'अवयव
3-gram: ['शब्दों या वर्णों', 'या वर्णों का', 'वर्णों का एक', 'का एक अनुक्रमिक', 'एक अनुक्रमिक अवयव'
4-gram: ['शब्दों या वर्णों का', 'या वर्णों का एक', 'वर्णों का एक अनुक्रमिक', 'का एक अनुक्रमिक अवयव'
https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 4/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory
A language model learns to predict the probability of a sequence of words. But why do we need to
learn the probability of words? Let’s understand that with an example.
One of the use of language model is in Machine Translation, you take in a bunch of words from a
language and convert these words into another language. Now, there can be many potential
translations that a system might give you and you will want to compute the probability of each of
these translations to understand which one is the most accurate.
In the above example, we know that the probability of the first sentence will be more than the
second, right? That’s how we arrive at the right translation.
This ability to model the rules of a language as a probability gives great power for NLP related
tasks. Language models are used in speech recognition, machine translation, part-of-speech
tagging, parsing, Optical Character Recognition, handwriting recognition, information retrieval, and
many other daily tasks.
Statistical Language Models: These models use traditional statistical techniques like N-
grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability
distribution of words
Neural Language Models: These are new players in the NLP town and have surpassed the
statistical language models in their effectiveness. They use different kinds of Neural
Networks to model language
An N-gram language model predicts the probability of a given N-gram within any sequence of words
in the language. If we have a good N-gram model, we can predict p(w | h) – what is the probability
of seeing the word w given a history of previous words h – where the history contains n-1 words.
https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 5/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory
p(w1...ws) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)
So what is the chain rule? It tells us how to compute the joint probability of a sequence by using the
conditional probability of a word given previous words.
But we do not have access to these conditional probabilities with complex conditions of up to n-1
words. So how do we proceed?
This is where we introduce a simplification assumption. We can assume for all conditions, that:
Here, we approximate the history (the context) of the word wk by looking only at the last word of the
context. This assumption is called the Markov assumption. (We used it here with a simplified
context of length 1 – which corresponds to a bigram model – we could use larger fixed-sized
histories in general).
Building a Basic Language Model Now that we understand what an N-gram is, let’s build a basic
language model using trigrams of the Reuters corpus. Reuters corpus is a collection of 10,788
news documents totaling 1.3 million words. We can build a language model in a few lines of code
using the NLTK package:
In the above We first split our text into trigrams with the help of NLTK and then calculate the
frequency in which each combination of the trigrams occurs in the dataset.
We then use it to calculate probabilities of a word, given the previous two words. That’s essentially
what gives us our Language Model!
https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 6/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory
Let’s make simple predictions with this language model. We will start with two simple words –
“today the”. We want our model to tell us what will be the next word:
model["today","the"]
defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
{'public': 0.05555555555555555,
'European': 0.05555555555555555,
'Bank': 0.05555555555555555,
'price': 0.1111111111111111,
'emirate': 0.05555555555555555,
'overseas': 0.05555555555555555,
'newspaper': 0.05555555555555555,
'company': 0.16666666666666666,
'Turkish': 0.05555555555555555,
'increase': 0.05555555555555555,
'options': 0.05555555555555555,
'Higher': 0.05555555555555555,
'pound': 0.05555555555555555,
'Italian': 0.05555555555555555,
'time': 0.05555555555555555})
So we get predictions of all the possible words that can come next with their respective
probabilities. Now, if we pick up the word “price” and again make a prediction for the words “the”
and “price”:
dict(model["the","price"])
{'yesterday': 0.004651162790697674,
'of': 0.3209302325581395,
'it': 0.05581395348837209,
'effect': 0.004651162790697674,
'cut': 0.009302325581395349,
'for': 0.05116279069767442,
'paid': 0.013953488372093023,
'to': 0.05581395348837209,
'increases': 0.013953488372093023,
'used': 0.004651162790697674,
'climate': 0.004651162790697674,
'.': 0.023255813953488372,
'cuts': 0.009302325581395349,
'reductions': 0.004651162790697674,
'limit': 0.004651162790697674,
'now': 0.004651162790697674,
'moved': 0.004651162790697674,
'per': 0.013953488372093023,
'adjustments': 0.004651162790697674,
'(': 0.009302325581395349,
'slumped': 0.004651162790697674,
'is': 0.018604651162790697,
'move': 0.004651162790697674,
https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 7/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory
'evolution': 0.004651162790697674,
'differentials': 0.009302325581395349,
'went': 0.004651162790697674,
'the': 0.013953488372093023,
'factor': 0.004651162790697674,
'Royal': 0.004651162790697674,
',': 0.018604651162790697,
'again': 0.004651162790697674,
'changes': 0.004651162790697674,
'holds': 0.004651162790697674,
'has': 0.009302325581395349,
'fall': 0.004651162790697674,
'-': 0.004651162790697674,
'from': 0.004651162790697674,
'base': 0.004651162790697674,
'on': 0.004651162790697674,
'review': 0.004651162790697674,
'while': 0.004651162790697674,
'collapse': 0.004651162790697674,
'being': 0.004651162790697674,
'at': 0.023255813953488372,
'outlook': 0.004651162790697674,
'rises': 0.004651162790697674,
'drop': 0.004651162790697674,
'guaranteed': 0.004651162790697674,
',"': 0.004651162790697674,
'stayed': 0.009302325581395349,
'structure': 0.004651162790697674,
'and': 0.004651162790697674,
'could': 0.004651162790697674,
'related': 0.004651162790697674,
'hike': 0.004651162790697674,
'we': 0.004651162790697674,
'adjustment': 0.023255813953488372,
'policy': 0 004651162790697674
Limitations of N-gram approach to Language Modeling N-gram based language models do have a
few drawbacks:
The higher the N, the better is the model usually. But this leads to lots of computation overhead that
requires large computation power in terms of RAM N-grams are a sparse representation of
language. This is because we build the model based on the probability of words co-occurring. It will
give zero probability to all the words that are not present in the training corpus
Ref: https://www.seerinteractive.com/blog/what-are-ngrams-and-uses-case/
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-
code/#:~:text=An%20N%2Dgram%20language%20model%20predicts%20the%20probability%20of%
20a,history%20contains%20n%2D1%20words.
https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 8/9
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
Practical No. 05
In computational linguistics, a frequency list is a sorted list of words (word types) together with
their frequency, where frequency here usually means the number of occurrences in a given corpus,
from which the rank can be derived as the position in the list.
A frequency distribution is an overview of all distinct values in some variable and the number of
times they occur. That is, a frequency distribution tells how frequencies are distributed over values.
Frequency distributions are mostly used for summarizing categorical variables.
Frequency Distribution: values and their frequency (how often each value occurs).
Example: Newspapers These are the numbers of newspapers sold at a local shop over the last 10
days:
A frequency distribution for the outcomes of an experiment. A frequency distribution records the
number of times each outcome of an experiment has occurred. For example, a frequency
distribution could be used to record the frequency of each word type in a document. Formally, a
frequency distribution can be defined as a function mapping from each sample to the number of
times that sample occurred as an outcome.
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 1/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
following code will produce a frequency distribution that encodes how often each word occurs in a
text:
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 2/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
#http://www.nltk.org/api/nltk.html?highlight=freqdist
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
sent = 'This is an example Sentence of another a SENTENCE which is a Sentence The sentence wh
fdist = FreqDist()
for word in word_tokenize(sent):
fdist[word.lower()] += 1
fdist.pprint()
fdist.N()
#Return the total number of sample outcomes that have been recorded by this FreqDist.
19
fdist.plot()
#Plot samples from the frequency distribution displaying the most frequent sample first. If a
<AxesSubplot:xlabel='Samples', ylabel='Counts'>
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 3/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
fdist.tabulate()
#Tabulate the given samples from the frequency distribution (cumulative), displaying the most
Word cloud is a technique for visualising frequent words in a text where the size of the words
represents their frequency.
Word Cloud is a data visualization technique used for representing text data in which the size of
each word indicates its frequency or importance. Significant textual data points can be highlighted
using a word cloud. Word clouds are widely used for analyzing data from social network websites.
Word Clouds are a popular way of displaying how important words are in a collection of texts.
Basically, the more frequent the word is, the greater space it occupies in the image. One of the uses
of Word Clouds is to help us get an intuition about what the collection of texts is about. Here are
some classic examples of when Word Clouds can be useful:
Let’s suppose you want to build a text classification system. If you’d want to see what are the
different frequent words in the different categories, you’d build a Word Cloud for each category and
see what are the most popular words inside each category.
# Import packages
import wikipedia
result= wikipedia.page("natural")
final_result = result.content
print(final_result)
Nature, in the broadest sense, is the physical world or universe. "Nature" can refer t
The concept of nature as a whole, the physical universe, is one of several expansions
During the advent of modern scientific method in the last several centuries, nature b
== Earth ==
Earth is the only planet known to support life, and its natural features are the subj
Earth has evolved through geological and biological processes that have left traces of
The atmospheric conditions have been significantly altered from the original conditio
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 5/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
Geology is the science and study of the solid and liquid matter that constitutes the
The geology of an area evolves through time as rock units are deposited and inserted
Rock units are first emplaced either by deposition onto the surface or intrude into t
After the initial sequence of rocks has been deposited, the rock units can be deforme
Earth is estimated to have formed 4.54 billion years ago from the solar nebula, along
Continents formed, then broke up and reformed as the surface of Earth reshaped over h
The present era is classified as part of a mass extinction event, the Holocene extinct
The Earth's atmosphere is a key factor in sustaining the ecosystem. The thin layer of
Terrestrial weather occurs almost exclusively in the lower part of the atmosphere, an
Weather can have both beneficial and harmful effects. Extremes in weather, such as to
Climate is a measure of the long-term trends in the weather. Various factors are know
Water is a chemical substance that is composed of hydrogen and oxygen (H2O) and is vit
An ocean is a major body of saline water, and a principal component of the hydrospher
l k (f ti d l ) i t i f t ( h i l f t ) b d f
Nature, in the broadest sense, is the physical world or universe. "Nature" can refer to
The concept of nature as a whole, the physical universe, is one of several expansions of
During the advent of modern scientific method in the last several centuries, nature beca
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 6/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
Student Tasks:
Task1: Genrate word cloud from the wikpedia page of Mumbai in Marathi / Hindi
OR
Task2: Genrate word cloud from the wikpedia page of Mumbai in English where word cloud will not
have any stop words and all the words will be transfered to base word using lemmatization.
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 7/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
# Import packages
import wikipedia
result= wikipedia.page("mumbai")
final_result = result.content
print(final_result)
Mumbai (English: (listen), Marathi: [ˈmumbəi]; also known as Bombay — the official n
== Etymology ==
The name Mumbai (Marathi: मुंबई, Gujarati: મુંબઈ, Hindi: मुंबई) derived from Mumbā or Mahā
The oldest known names for the city are Kakamuchee and Galajunkja; these are sometime
== History ==
Mumbai is built on what was once an archipelago of seven islands: Isle of Bombay, Par
King Bhimdev founded his kingdom in the region in the late 13th century and establish
The Mughal Empire, founded in 1526, was the dominant power in the Indian subcontinent
The Portuguese were actively involved in the foundation and growth of their Roman Cat
In accordance with the Royal Charter of 27 March 1668, England leased these islands to
By the middle of the 18th century, Bombay began to grow into a major trading town, an
From 1782 onwards, the city was reshaped with large-scale civil engineering projects
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 8/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
After India's independence in 1947, the territory of the Bombay Presidency retained by
Following protests during the movement in which 105 people died in clashes with the po
== Geography ==
Mumbai is on a narrow peninsula on the southwest of Salsette Island, which lies betwe
Mumbai lies at the mouth of the Ulhas River on the western coast of India, in the coa
eastern to Madh Marve on the western front. The eastern coast of Salsette Island is co
Mumbai has a tropical climate, specifically a tropical wet and dry climate (Aw) under
Mumbai is prone to monsoon floods, caused due to climate change that is affected by h
Mumbai (English: (listen), Marathi: [ˈmumbəi]; also known as Bombay — the official name
#Lemmatization
from nltk import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word_list = word_tokenize(result)
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print("After the lemmatization:",lemmatized_output)
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 9/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 10/10
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
Practical No. 06
Roll No - 9070
WordNet is a significant semantic network interlinking word or group of words employing lexical or
conceptual relationships by labelled arcs. Wordnets are lexical structures composed of synsets and
semantic relations. In wordnet creation, the focus shifts from words to concepts. Each member of a
synset represents the same concept though not all synset members are interchangeable in context.
Synsets contain definition including sentences to describe synonym usage. The membership of
words in multiple synsets or concepts reflects polysemy or multiplicity of meaning.
There are three principles the synset construction process must adhere to:
Minimality: This principle insists on capturing that minimal set of the words in the synset which
uniquely identifies the concept. For example, {family, house} uniquely identifies a concept (e.g. “he
is from the house of the King of Jaipur”}.
Coverage: This principle then stresses on the completion of the synset, i.e., capturing all the words
that stand for the concept expressed by the synset. Within the synset, the words should be ordered
according to their frequency in the corpus.
Replaceability: Within the synset, the words should be ordered according to their frequency in the
corpus. Replaceability demands that the most common words in the synset, i.e., words towards the
beginning of the synset should be able to replace one another in the example sentence associated
with the synset.
WordNet contains synsets with words coming from the critical, open class, syntactic categories
like:
a) Noun
b) Adjective
c) Verbs
e) Adverbs
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 1/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
Lexical Relationships:
Antonymy: It is a lexical relation indicating ‘opposites’. It majorly originates from descriptive
adjectives. Further, each member of a direct antonym pair is associated with some semantically
similar adjectives. e.g. fat opposite is thin; obese antonym is also thin as obese and fat belong to
the same synset.
Gradation: It is a lexical relation that represents possible intermediate states between two
antonyms. Eg. Morning, noon, evening.
Hypernymy and Hyponymy encode lexical relations between a more general term and specific
instances of it. They build a hierarchical tree with increasingly concrete/ particular concepts
growing out from the abstract root.
Meronymy expresses the part-of relationship. Synsets denoting parts, components or members to
synsets indicating the whole are called meronyms.
Entailment: It is a semantic relationship between two verbs. A verb C entails a verb B, if the
meaning of B follows logically and is strictly included in the meaning of C. This relation is
unidirectional. For instance, snoring entails sleeping, but sleeping does not entail snoring.
Troponymy: It is a semantic relation between two verbs when one is a specific ‘manner’ elaboration
of another.
WordNet is a tool for solving Word Sense Disambiguation. It can also be used to find abstract or
concrete concepts. Its unique semantic network helps us find word relations, synonyms, grammars,
etc. This helps support NLP tasks such as sentiment analysis, automatic language translation, text
similarity, and more.
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 2/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
In WordNet terminology, each group of synonyms is a synset, and a synonym that forms part of a
synset is a lexical variant of the same concept. For example, in the network above, Word of God,
Word, Scripture, Holy Writ, Holy Scripture, Good Book, Christian Bible and Bible make up the synset
that corresponds to the concept Bible, and each of these forms is a lexical variant.
The NLTK module includes the English WordNet with 155,287 words and 117,659 synonym sets.
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 3/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
Words
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 4/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
We can look up words which are a part of the WordNet lexicon using the synset() and
synsets()function.
synset = WORD.POS.NN
where:
Part of Speech (POS) — a particular part of speech (noun, verb, adjective, adverb, pronoun,
preposition, conjunction, interjection, numeral, article, or determiner), in which a word
corresponds to based on both its definition and its context.
NN — a sense key. A word can have multiple meanings or definitions. Therefore, “cake.n.03” is
the third noun sense of the word “cake”.
print(wn.synsets('dog'))
print("\n")
print(wn.synsets('run'))
print("\n")
print(wn.synset('dog.n.01'))
print("\n")
print(wn.synset('run.v.01'))
Synset('dog.n.01')
Synset('run.v.01')
Definitions and Examples From the previous example, we can see that dog and run have several
possible contexts. To help understand the meaning of each one we can view their definitions using
the definition() function.
dog.n.01 -- a member of the genus Canis (probably descended from the common wolf) that
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 5/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
run.n.01 -- a score in baseball made by a runner touching all four bases safely
Likewise, if we needed to clarify some examples of the noun dog and verb run in context, we can
use the examples() function.
print("dog.n.01 -- ",wn.synset('dog.n.01').examples())
print("\n")
print("run.v.01 --",wn.synset('run.v.01').examples())
run.v.01 -- ["Don't run--you'll be out of breath", 'The children ran to the store']
Hypernymy and Hyponymy encode lexical relations between a more general term and specific
instances of it.
A hypernym is described a being a word that is more general than a given word. That is, it is its
superordinate term: if X is a hypernym of Y, then all Y are X. For example, animal is a hypernym of
dog.
Whereas hyponymy is the relation between two concepts, where concept B is a type of concept A.
For example, beef is a hyponym of meat.
print("\n")
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 6/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
print("\n")
#Let us determine the hyponyms of the term "cat", and store that into a variable `types_of_ca
cat = wn.synset('cat.n.01')
types_of_cats = cat.hyponyms()
print("hyponyms of cat")
# Now, let us loop through the hyponyms and see the lemmas for each synset:
for synset in types_of_cats:
for lemma in synset.lemmas():
print(lemma.name())
# Example:
# Cat <- hypernym
# house_cat <- hyponym
print("house cat hpyernym",wn.synset('house_cat.n.01').hypernyms())
hyponyms of cat
domestic_cat
house_cat
Felis_domesticus
Felis_catus
wildcat
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 7/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
Knowledge of the hypernymy and hyponymy relations is useful for tasks such as question
answering, where a model may be built to understand very general concepts, but is asked specific
questions.
Programmatically identifying accurate synonyms and antonyms is more difficult than it should be.
However, WordNet covers this quite well.
Synonyms are words or expressions of the same language which have the same or a very similar
meaning in some, or all, senses. For example, the synonyms in the WordNet network which
surround the word car are automobile, machine, motorcar, etc.
Antonymy can be defined as the lexical relation which indicates ‘opposites’. Further, each member
of a direct antonym pair is associated with some semantically similar adjectives. e.g. fat is the
opposite of thin; obese’s antonym is also thin as obese and fat belong to the same synset. Naturally,
some words do not have antonyms and other words like recommend just don’t have enough
information in WordNet.
synonyms = []
antonyms = []
print("synonyms",set(synonyms))
print("\n")
print("antonyms",set(antonyms))
antonyms {'unhappy'}
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 8/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
Meronymy expresses the ‘components-of’ relationship. That is, a relation between two concepts,
where concept A makes up a part of concept B. For meronyms, we can take advantage of two NLTK
functions:
tree = wn.synset('tree.n.01')
print(tree.part_meronyms())
print('\n')
print(tree.substance_meronyms())
[Synset('heartwood.n.01'), Synset('sapwood.n.01')]
Student task
print(wn.synsets('tiger'))
print("\n")
print(wn.synsets('hunt'))
print("\n")
print(wn.synset('tiger.n.01'))
print("\n")
print(wn.synset('hunt.v.01'))
[Synset('tiger.n.01'), Synset('tiger.n.02')]
Synset('tiger.n.01')
Synset('hunt.v.01')
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 9/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
print("tiger.n.01 -- ",wn.synset('tiger.n.01').examples())
print("\n")
print("hunt.v.01 --",wn.synset('hunt.v.01').examples())
tiger.n.01 -- ["he's a tiger on the tennis court", 'it aroused the tiger in me']
hunt.v.01 -- ['Goering often hunted wild boars in Poland', 'The dogs are running deer',
print("\n")
print("\n")
#Let us determine the hyponyms of the term "fish", and store that into a variable `types_of_s
snake = wn.synset('snake.n.01')
types_of_snake = snake.hyponyms()
print("hyponyms of snake")
# Now, let us loop through the hyponyms and see the lemmas for each synset:
for synset in types_of_snake:
for lemma in synset.lemmas():
print(lemma.name())
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 10/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
print("\n")
hyponyms of snake
blind_snake
worm_snake
colubrid_snake
colubrid
constrictor
elapid
elapid_snake
sea_snake
viper
synonyms = []
antonyms = []
print("synonyms",set(synonyms))
print("\n")
print("antonyms",set(antonyms))
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 11/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory
tree = wn.synset('car.n.01')
print(tree.part_meronyms())
print('\n')
print(tree.substance_meronyms())
[]
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 12/12
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory
Practical No. 07
The process of classifying words into their parts of speech and labeling them accordingly is known
as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word
classes or lexical categories. The collection of tags used for a particular task is known as a tagset.
text = word_tokenize("Many Competitive Exams in India like UPSC Civil Services Exams, Bank Ex
print(nltk.pos_tag(text))
[('Many', 'JJ'), ('Competitive', 'NNP'), ('Exams', 'NNP'), ('in', 'IN'), ('India', 'NNP
https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 2/5
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory
Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
JJ This NLTK POS Tag is an adjective (large)
JJR adjective, comparative (larger)
JJS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
PDT predeterminer (all, both, half)
POS possessive ending (parent\ 's)
PRP personal pronoun (hers, herself, him,himself)
PRP$ possessive pronoun (her, his, mine, my, our )
RB adverb (occasionally, swiftly)
RBR adverb, comparative (greater)
RBS adverb, superlative (biggest)
RP particle (about)
TO infinite marker (to)
UH interjection (goodbye)
VB verb (ask)
VBG verb gerund (judging)
VBD verb past tense (pleaded)
VBN verb past participle (reunified)
VBP verb, present tense not 3rd person singular(wrap)
VBZ verb, present tense with 3rd person singular (bases)
WDT wh-determiner (that, what)
WP wh- pronoun (who)
WRB wh- adverb (how)
https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 3/5
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory
NLTK provides documentation for each tag, which can be queried using the tag, e.g.
nltk.help.upenn_tagset('RB')
nltk.help.upenn_tagset('NNP')
nltk.help.upenn_tagset('JJ')
nltk.help.upenn_tagset('NNS')
Why POS Tagging is Useful? POS tagging can be really useful, particularly if you have words or
tokens that can have multiple POS tags. For instance, the word "google" can be used as both a noun
and verb, depending upon the context. While processing natural language, it is important to identify
this difference.
[('Can', 'MD'), ('you', 'PRP'), ('google', 'VB'), ('it', 'PRP'), ('?', '.'), ('.', '.')]
[('Google', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('tech', 'JJ'), ('leader', 'NN'), ('.',
Student Task
POS tagger for Hindi Text
https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 4/5
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory
train_data = indian.tagged_sents('hindi.pos')
tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data) #Training the tnt Part of speech tagger with hindi data
text = "भारत में कई प्रतियोगी परीक्षाओं जैसे यूपीएससी सिविल सेवा परीक्षा, बैंक परीक्षा आदि में भी निबंध लेखन पर एक
tagged_words = (tnt_pos_tagger.tag(nltk.word_tokenize(text)))
print(tagged_words)
[('भारत', 'NNP'), ('में', 'PREP'), ('कई', 'QF'), ('प्रतियोगी', 'Unk'), ('परीक्षाओं', 'Unk'), ('
https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 5/5
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory
Practical No. 08
WordNet is a significant semantic network interlinking word or group of words employing lexical or
conceptual relationships by labelled arcs. Wordnets are lexical structures composed of synsets and
semantic relations. In wordnet creation, the focus shifts from words to concepts. Each member of a
synset represents the same concept though not all synset members are interchangeable in context.
Synsets contain definition including sentences to describe synonym usage. The membership of
words in multiple synsets or concepts reflects polysemy or multiplicity of meaning.
WordNet contains synsets with words coming from the critical, open class, syntactic categories
like:
a) Noun
b) Adjective
c) Verbs
d) Adverbs
WordNet is a tool for solving Word Sense Disambiguation. It can also be used to find abstract or
concrete concepts. Its unique semantic network helps us find word relations, synonyms, grammars,
etc. This helps support NLP tasks such as sentiment analysis, automatic language translation, text
similarity, and more.
In WordNet terminology, each group of synonyms is a synset, and a synonym that forms part of a
synset is a lexical variant of the same concept. For example, in the network above, Word of God,
Word, Scripture, Holy Writ, Holy Scripture, Good Book, Christian Bible and Bible make up the synset
that corresponds to the concept Bible, and each of these forms is a lexical variant.
The NLTK module includes the English WordNet with 155,287 words and 117,659 synonym sets.
https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 1/7
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory
https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 2/7
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory
Word Similarity
How Related are Two Words? Let us take the terms we have learned thus far along with what
WordNet provides to us to define some metric as to how two words are related to one another.
There are a few ways in which to calculate the similarities between words.
The path_similarity() function returns a score which denotes how similar two words are by
traversing through the paths that connects them in the WordNet network.
The path_similarity function returns a score denoting how similar two words are in terms of the
distance between hypernyms/hyponyms.
# We see that "textbook" and "book" have the highest similarity possible,
# with a score of 0.5.
# We see a lower number here. This again makes sense, since the traversal
https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 3/7
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory
There are actually many more ways in which to define distances between words.
1. Wu-Palmer Similarity
2. Resnik Similarity
3. Jiang-Conrath Similarity
4. Lin Similarity
Wu & Palmer’s similarity calculates similarity by considering the depths of the two synsets in the
network, as well as where their most specific ancestor node (Least Common Subsumer (LCS)).
The similarity score is measured between 0 < score and ≤ 1, where 1 indicates that the words are
the same. The score can never be 0 because the depth of the LCS is never 0 (the depth of the root
of taxonomy is 1).
black = wn.synset('black.n.01')
white = wn.synset('white.n.01')
rainbow = wn.synset('rainbow.n.01')
colours = wn.synset('colours.n.01')
Language is flexible, and people will use a variety of different words to describe the same thing. So,
if you had a large dataset of customer reviews and you wanted to extract those which discuss the
same aspects of the product, finding which are similar will help narrow that search.
Student Task
Perform similarity checking at Sentence level for English language.
sentences = [
"The Moon is a barren, rocky world without air and water.",
"It has dark lava plain on its surface. The Moon is filled wit craters.",
"It has no light of its own.",
"It gets its light from the Sun. "
]
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = model.encode(sentences)
cosine_similarity(
[sentence_embeddings[0]],
sentence_embeddings[1:]
)
https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 6/7