0% found this document useful (0 votes)
67 views57 pages

NLP Practicals All

Uploaded by

Ritesh Nimbalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views57 pages

NLP Practicals All

Uploaded by

Ritesh Nimbalkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

INDEX

Sr. Title Date Sign


No.
1. Write a program to perform
Tokenization and Filtration of
English and Hindi Text

2. Write a program to perform


Script Validation and identify
Stop Words of English and
Hindi Text
3. Write a program to perform
Stemming and Lemmatization

4. Write a program to generate n-


gram (bigram, trigram, etc) of
English and Hindi Text

5. Write a program to identify


word frequency and generate
word cloud of English and Hindi
Text
6. Write a program Get word
definition, examples, synonyms,
antonyms using English
WordNet
7. Write a program to identify Part
of Speech of English Text

8. Write a program to check Word


Similarity using English
WordNet
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory

Practical No. 01

Write a program to perform Tokenization and Filteration of


English and Hindi Text

Name : Ritesh Santosh Nimbalkar

Roll No : 9070

Tokenization

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into
smaller units, such as individual words or terms. Each of these smaller units are called tokens.
For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’

Tokenization can be done to either separate words or sentences. If the text is split into words using
some separation technique it is called word tokenization and same separation done for sentences
is called sentence tokenization.

In the process of tokenization, some characters like punctuation marks may be discarded.

Before processing a natural language, we need to identify the words that constitute a string of
characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is
important because the meaning of the text could easily be interpreted by analyzing the words
present in the text.

Let’s take an example. Consider the below string:

“This is a cat.”

What do you think will happen after we perform tokenization on this string? We get [‘This’, ‘is’, ‘a’,
cat’].

There are numerous uses of doing this. We can use this tokenized form to:

Count the number of words in the text


Count the frequency of the word, that is, the number of times a particular word is present

https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 1/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory

White Space Tokenization is the simplest tokenization technique. Given a sentence or paragraph it
tokenizes into words by splitting the input whenever a white space in encountered. This is the
fastest tokenization technique but will work for languages in which the white space breaks apart
the sentence into meaningful words. Example: English,Hindi.

Tokenisation with NLTK NLTK is a standard python library with prebuilt functions and utilities for the
ease of use and implementation. It is one of the most used libraries for natural language
processing and computational linguistics.

Natural Language toolkit has very important module tokenize which further comprises of sub-
modules

1. word tokenize
2. sentence tokenize

!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ


Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packages
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from nltk
Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from nl
Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from nlt

import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'


[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 2/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory

[nltk_data] | Downloading package bllip_wsj_no_aux to


[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.
[nltk_data] | Downloading package cmudict to /root/nltk_data...
[nltk_data] | Unzipping corpora/cmudict.zip.
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk_data] | Downloading package conll2007 to /root/nltk_data...
[nltk_data] | Downloading package crubadan to /root/nltk_data...
[nltk_data] | Unzipping corpora/crubadan.zip.
[nltk_data] | Downloading package dependency_treebank to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/dependency_treebank.zip.
[nltk_data] | Downloading package dolch to /root/nltk_data...
[nltk_data] | Unzipping corpora/dolch.zip.
[nltk_data] | Downloading package europarl_raw to

'''
Tokenization of words
We use the method word_tokenize() to split a sentence into words. '''

from nltk.tokenize import word_tokenize


text = "Hello All, This is first practical session in NLP for MSC IT Part 2. MES New Panvel. A
print("word_tokenize",word_tokenize(text))

from nltk.tokenize import TreebankWordTokenizer


#tokenizers work by separating the words using punctuation and spaces.
tokenizer = TreebankWordTokenizer()
print("TreebankWordTokenizer",tokenizer.tokenize(text))

from nltk.tokenize import WordPunctTokenizer

https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 3/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory

#It seperates the punctuation from the words.


tokenizer = WordPunctTokenizer()
print("WordPunctTokenizer",tokenizer.tokenize(text) )

#Multi-Word Expression Tokenizer(MWETokenizer): A MWETokenizer takes a string and merges mult


from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer([('New', 'Panvel')],separator=' ')

print("MWETokenizer",tokenizer.tokenize(text.split()))

word_tokenize ['Hello', 'All', ',', 'This', 'is', 'first', 'practical', 'session', 'in',
TreebankWordTokenizer ['Hello', 'All', ',', 'This', 'is', 'first', 'practical', 'session
WordPunctTokenizer ['Hello', 'All', ',', 'This', 'is', 'first', 'practical', 'session',
MWETokenizer ['Hello', 'All,', 'This', 'is', 'first', 'practical', 'session', 'in', 'NLP

'''
Tokenization of sentences
We use the method sent_tokenize() to split paragrph into sentences. '''

from nltk.tokenize import sent_tokenize


textHindi = "एनएलपी बढ़िया है! मैंने एक मुक्त कोर्टेरा कपोन जीता ! चलो एनएलपी का अध्ययन शुरू करते हैं!"
text = "NLP is Great. I won a free Coursera cupon. Lets start studying NLP."
print(sent_tokenize(textHindi))
print(sent_tokenize(text))
'''
The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.
which is already been trained and thus very well knows to mark the end and beginning of sente
characters and punctuation
'''

['एनएलपी बढ़िया है!', 'मैंने एक मुक्त कोर्टेरा कपोन जीता !', 'चलो एनएलपी का अध्ययन शुरू करते हैं
['NLP is Great.', 'I won a free Coursera cupon.', 'Lets start studying NLP.']
'\nThe sent_tokenize function uses an instance of PunktSentenceTokenizer from the
nltk.tokenize.punkt module, \nwhich is already been trained and thus very well kn

Student Task:
Write a code to demonstrate Tokenization at word and sentence
level in Hindi Language

from nltk.tokenize import word_tokenize


name = "मेरा नाम रितेश है, मैं पिल्लई कॉलेज में पढ़ता हूँ"
word_tokenize(name)

https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 4/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory

['मेरा',
'नाम',
'रितेश',
'है',
',',
'मैं',
'पिल्लई',
'कॉलेज',
'में',
'पढ़ता',
'हूँ']

Filteration

One of the key steps in processing language data is to remove noise so that the machine can more
easily detect the patterns in the data. Text data contains a lot of noise, this takes the form of
special characters such as hashtags, punctuation and numbers. All of which are difficult for
computers to understand if they are present in the data. We need to, therefore, process the data to
remove these elements. Additionally, it is also important to apply some attention to the casing of
words. If we include both upper case and lower case versions of the same words then the computer
will see these as different entities, even though they may be the same.

def filter_text(inText,lowerFlag=False,upperFlag=False,numberFlag=False,htmlFlag=False,urlFl
if lowerFlag:
inText = inText.lower()

if upperFlag:
inText = inText.upper()

if numberFlag:
import re
inText = re.sub(r"\d+", '', inText)

if htmlFlag:
import re
inText = re.sub(r'<[^>]*>', '', inText)

if urlFlag:
import re
inText = re.sub(r'(https?|ftp|www)\S+', '', inText)

if punctFlag:
import re
import string
exclist = string.punctuation #removes [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
# remove punctuations and digits from oldtext
https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 5/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory

table_ = inText.maketrans('', '', exclist)


inText = inText.translate(table_)

if spaceFlag:
import re
inText = re.sub(' +'," ",inText).strip()

if hashtagFlag:
pass
# Students Task
if emojiFlag:
pass
# Students Task
return inText

usrText = input()

Hello Everyone !!! I am a #developer. "I work as a Python Developer".

filter_text(usrText, urlFlag=True)

'Hello Everyone !!! I am a #developer. "I work as a Python Developer".'

filter_text(usrText, punctFlag=True,spaceFlag=True)

'Hello Everyone I am a developer I work as a Python Developer'

filter_text(usrText, lowerFlag= True)

'hello everyone !!! i am a #developer. "i work as a python developer".'

Student Task:
Modify the above code to demonstrate filteration of hashtag word and certian emojis and keep
certain punctuation if they join two words

def filter_text(inText,lowerFlag=False,upperFlag=False,numberFlag=False,htmlFlag=False,urlFl
if hashtagFlag:
import re
inText = re.sub(r"^[#/^]", '', inText)

return inText

usrText = input()

https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 6/7
3/22/23, 9:01 PM 9070_Exp.1.ipynb - Colaboratory

usrText = input()

Hello Everyone !!! I am a #developer. "I work as a Python Developer".

https://colab.research.google.com/drive/1lwti2FrA21UmyfzczRZlapFW0zRl89uU?authuser=1#scrollTo=AyvlTG93Egm0&printMode=true 7/7
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory

Practical No. 02

Write a program to perform Script Validation and identify Stop Words of English and
Hindi Text

Name :Ritesh Santosh Nimbalkar

Roll No: 9070

Script Validation

In script validation, foreign words (the words which don't belong to the required input language) are detected and removed. In the sentence “
विदेशी को हटाना hoga आज ” the word “hoga” is a word of Hindi language written using English characters. During script validation as per the NLP
application requirement the word hoga will either be removed or transliterated into devanagari script “होगा”

!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from nltk) (1.2.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packages (from nltk) (2022.6.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from nltk) (4.65.0)
Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from nltk) (8.1.3)

import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'


[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.

https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 1/5
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory
[nltk_data] | Downloading package cmudict to /root/nltk_data...
[nltk_data] | Unzipping corpora/cmudict.zip.
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk_data] | Downloading package conll2007 to /root/nltk_data...
[nltk_data] | Downloading package crubadan to /root/nltk_data...
[nltk_data] | Unzipping corpora/crubadan.zip.
[nltk_data] | Downloading package dependency_treebank to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/dependency_treebank.zip.
[nltk_data] | Downloading package dolch to /root/nltk_data...
[nltk_data] | Unzipping corpora/dolch.zip.
[ ltk d t ] | l di k l t
"""
For Script validation, we use Unicodes of the characters """

def detectLang(inText,charFlag=False,wordFlag=False,sentenceFlag=False,lang="EN"):
if charFlag:
if len(inText)==1 and lang == "EN":
if ord(inText) in list(range(65,123)):
return "EN"
if len(inText)==1 and lang == "HI":
if ord(inText) in list(range(2304,2432)):
return "HI"

if wordFlag:
if len(inText)>1 and lang == "EN":
for x in inText:
if ord(x) not in list(range(65,123)):
return "Not Found"
return "EN"
if len(inText)>1 and lang == "HI":
for x in inText:
if ord(x) not in list(range(2304,2432)):
return "Not Found"
return "HI"

if sentenceFlag:
pass

return "Not Found"

#https://en.wikipedia.org/wiki/List_of_Unicode_characters
#https://jrgraphix.net/r/Unicode/0020-007F

detectLang("प्राकृ तिक प्रतिस्पर्धा",wordFlag=True,lang="HI")

'Not Found'

detectLang("हेलो",wordFlag=True,lang="EN")

'Not Found'

detectLang("ह",charFlag=True,lang="HI")

'HI'

detectLang("H",charFlag=True,lang="EN")

'EN'

Student Task:
Modify the above code to demonstrate Sentence level script validation for Hindi and English language

Double-click (or enter) to edit

https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 2/5
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory

Stopwords
The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-
processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

What are Stop words?

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both
when indexing entries for searching and when retrieving them as the result of a search query.

We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily,
by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16
different languages. You can find them in the nltk_data directory. home/pratima/nltk_data/corpora/stopwords is the directory address.(Do not
forget to change your home directory name).

The following program tokenizes the sentence, identifies and removes stop words from a piece of text.

Double-click (or enter) to edit

import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'


[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Package abc is already up-to-date!
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Package alpino is already up-to-date!
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Package averaged_perceptron_tagger is already up-
[nltk_data] | to-date!
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Package averaged_perceptron_tagger_ru is already
[nltk_data] | up-to-date!
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Package basque_grammars is already up-to-date!
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Package bcp47 is already up-to-date!
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Package biocreative_ppi is already up-to-date!
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /root/nltk_data...
[nltk_data] | Package bllip_wsj_no_aux is already up-to-date!
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Package book_grammars is already up-to-date!
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Package brown is already up-to-date!
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Package brown_tei is already up-to-date!
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Package cess_cat is already up-to-date!
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Package cess_esp is already up-to-date!
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Package chat80 is already up-to-date!
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Package city_database is already up-to-date!
[nltk_data] | Downloading package cmudict to /root/nltk_data...
[nltk_data] | Package cmudict is already up-to-date!
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Package comparative_sentences is already up-to-
[nltk_data] | date!
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Package comtrans is already up-to-date!
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Package conll2000 is already up-to-date!
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Package conll2002 is already up-to-date!
[nltk_data] | Downloading package conll2007 to /root/nltk_data...
[nltk_data] | Package conll2007 is already up-to-date!

https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 3/5
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory
[nltk_data] | Downloading package crubadan to /root/nltk_data...
[nltk_data] | Package crubadan is already up-to-date!
[ ltk d t ] | l di k d d t b k t
"""
Stop Word Identification """

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize, sent_tokenize

txt = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the inter

#print(stopwords.words('english'))
stop_words = set(stopwords.words('english'))
#print(stop_words)
word_tokens = word_tokenize(txt)

print("Sentence is:", txt,"\n")


print("Tokens in the above sentence:", word_tokens,"\n")
stop = [w for w in stop_words if w in word_tokens]
print("StopWords recognized in the given sentence:", stop,"\n")

filtered_sentence = [w for w in word_tokens if not w in stop_words]


filtered_sentence = []

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print("After removing the recognized stopwords, the Tokens of sentence is:", filtered_sentence)

Sentence is: Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned wit

Tokens in the above sentence: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'c

StopWords recognized in the given sentence: ['and', 'then', 'in', 'a', 'of', 'them', 'with', 'how', 'is', 'between', 'can', 'the', 'to',

After removing the recognized stopwords, the Tokens of sentence is: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'subfield',

Double-click (or enter) to edit

Student Task:
Identify Stop words in Sentence level for Hindi langauge

"""
Stop Word Identification hindi"""

from nltk.tokenize import word_tokenize, sent_tokenize


stopwords = ["रहे","थी","थे","होना","गया","है","पडा","होने","करना","किया","रही","लेकिन","जाता","अगर","या","क्यूंकि","की","पर","साथ","किया","ऊपर","नीचे","

txt = "राष्ट्र भाषा हिंदी हमारे राष्ट्र की शान है। भारत की समानता की धुरी है। भारत की संस्कृ ति और सभ्यता की मूल चेतना को शुद्धता से अभिव्यक्त करने का माध्यम है । राष्ट्रीय विचारों

#print(stopwords.words('english'))
#stop_words = set(stopwords.words('english'))
#print(stop_words)
word_tokens = word_tokenize(txt)

print("Sentence is:", txt,"\n")


print("Tokens in the above sentence:", word_tokens,"\n")
stop = [w for w in stopwords if w in word_tokens]
print("StopWords recognized in the given sentence:", stop,"\n")

filtered_sentence = [w for w in word_tokens if not w in stopwords]


filtered_sentence = []

for w in word_tokens:
if w not in stopwords:
filtered_sentence.append(w)

https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 4/5
3/22/23, 9:11 PM 9070_Exp.2.ipynb - Colaboratory

print("After removing the recognized stopwords, the Tokens of sentence is:", filtered_sentence)

Sentence is: राष्ट्र भाषा हिंदी हमारे राष्ट्र की शान है। भारत की समानता की धुरी है। भारत की संस्कृ ति और सभ्यता की मूल चेतना को शुद्धता से अभिव्यक्त करने का माध्यम है

Tokens in the above sentence: ['राष्ट्र भाषा', 'हिंदी', 'हमारे ', 'राष्ट्र ', 'की', 'शान', 'है।', 'भारत', 'की', 'समानता', 'की', 'धुरी', 'है।', 'भारत', 'की',

StopWords recognized in the given sentence: ['है', 'होने', 'की', 'को', 'से', 'और', 'करने']

After removing the recognized stopwords, the Tokens of sentence is: ['राष्ट्र भाषा', 'हिंदी', 'हमारे ', 'राष्ट्र ', 'शान', 'है।', 'भारत', 'समानता', 'धुरी',

https://colab.research.google.com/drive/1394QLnepmlBMs8qOki1apP0rA0gbQzp7?authuser=1#scrollTo=dvL1rB6J0s21&printMode=true 5/5
3/22/23, 9:16 PM 9070_Exp.3.ibpy - Colaboratory

Practical No. 03

Write a program to identify Stem and Lemma of English and Hindi Text

Name: Ritesh Santosh Nimbalkar

Roll no: 9070

Stemming

import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'


[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.
[nltk_data] | Downloading package cmudict to /root/nltk_data...
[nltk_data] | Unzipping corpora/cmudict.zip.
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk_data] | Downloading package conll2007 to /root/nltk_data...
[nltk_data] | Downloading package crubadan to /root/nltk_data...
[nltk_data] | Unzipping corpora/crubadan.zip.
[nltk_data] | Downloading package dependency_treebank to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/dependency_treebank.zip.
[nltk_data] | Downloading package dolch to /root/nltk_data...
[nltk_data] | Unzipping corpora/dolch.zip.
[nltk_data] | Downloading package europarl_raw to
|

https://colab.research.google.com/drive/1fQrQUwQqgAquCL8QQawMaOsjeqI9VzkN?authuser=1#scrollTo=-JrhwZdS-S2n&printMode=true 1/4
3/22/23, 9:16 PM 9070_Exp.3.ibpy - Colaboratory

# import these modules


from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

# choose some words to be stemmed


words = ["program", "programs", "programmer", "programming", "programmers"]

for w in words:
print(w, " : ", ps.stem(w))

program : program
programs : program
programmer : programm
programming : program
programmers : programm

# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

sentence = "We wake up early in the morning, and do some good work." "Programmers program with programming languages." "People comes to consu
words = word_tokenize(sentence)

for w in words:
print(w, " : ", ps.stem(w))

We : we
wake : wake
up : up
early : earli
in : in
the : the
morning : morn
, : ,
and : and
do : do
some : some
good : good
work.Programmers : work.programm
program : program
with : with
programming : program
languages.People : languages.peopl
comes : come
to : to
consultants : consult
office : offic
to : to
consult : consult
the : the
consultant : consult
. : .

Lemmatization

from nltk import wordnet


from nltk.stem.wordnet import WordNetLemmatizer

# import these modules

#from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))


print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"


print("better :", lemmatizer.lemmatize("better", pos ="a"))
print("are :", lemmatizer.lemmatize("are", pos ="r"))

https://colab.research.google.com/drive/1fQrQUwQqgAquCL8QQawMaOsjeqI9VzkN?authuser=1#scrollTo=-JrhwZdS-S2n&printMode=true 2/4
3/22/23, 9:16 PM 9070_Exp.3.ibpy - Colaboratory

print("\n\n")
sentence = "We like dancing. The dance was really great." "We wake up early in the morning, and do some good work." "Programmers program with
words = word_tokenize(sentence)

for w in words:
print(w, " : ", lemmatizer.lemmatize(w, pos='v'))

rocks : rock
corpora : corpus
better : good
are : are

We : We
like : like
dancing : dance
. : .
The : The
dance : dance
was : be
really : really
great.We : great.We
wake : wake
up : up
early : early
in : in
the : the
morning : morning
, : ,
and : and
do : do
some : some
good : good
work.Programmers : work.Programmers
program : program
with : with
programming : program
languages : languages
. : .

#############################STUDENT TASK############################
#Perform Sentence level Stemming and Lematization for Hindi and English language.
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
porter = PorterStemmer()
lemmatizer=WordNetLemmatizer()
sentence="We like singing a song. The song was beautifully written." #word_list = ["we", "like", "singing", "a", "song"]
words=word_tokenize(sentence)
print("{0:20}{1:20}{2:20}".format("ORIGINAL","STEMMING","LEMMATIZATION"))
for word in words:
print("{0:20}{1:20}{2:20}".format(word,porter.stem(word),lemmatizer.lemmatize(word,pos="v")))

ORIGINAL STEMMING LEMMATIZATION


We we We
like like like
singing sing sing
a a a
song song song
. . .
The the The
song song song
was wa be
beautifully beauti beautifully
written written write
. . .

https://colab.research.google.com/drive/1fQrQUwQqgAquCL8QQawMaOsjeqI9VzkN?authuser=1#scrollTo=-JrhwZdS-S2n&printMode=true 3/4
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory

Practical No. 04

Write a program to generate n-gram (bigram,trigram,etc) of


English and Hindi Text

Name: Ritesh Santosh Nimbalkar

Roll no: 9070

What Are N-Grams?


N-Grams are words, or combinations of words, broken out by the number of words in that
combination. As an outline:

Unigrams: one word


Bigrams: two words
Trigrams: three words
And so forth

To further explore n-grams, we can break down the sentence below:


Hi there everyone, we’re exploring n-grams today.

Unigram: hi | there | everyone, etc…


Bigram: hi there | exploring n-grams | etc…
Trigram: hi there everyone | exploring n-grams today | etc…
Note that the words must follow sequentially to be an n-gram.

https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 1/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory

Imagine listening to someone as they speak and trying to guess the next word that they are
going to say. For example what word is likely to follow this sentence fragment?: I’d like to
make a . . . / Please hand over your
Guessing the next word (or word prediction) is an essential subtask of speech recognition,
hand-writing recognition, augmentative communication for the disabled, and spelling error
detection.
In such tasks, word-identification is difficult because the input is very noisy and ambiguous.
Thus looking at previous words can give us an important cue about what the next ones are
going to be.
N-gram models, which predict the next word from the previous N − 1 words.
Such statistical models of word sequences are also called language models or LMs.
Computing the probability of the next word will turn out to be closely related to computing the
probability of a sequence of words.
The following sequence, for example, has a non-zero probability of appearing in a text: . . . all
of a sudden I notice three guys standing on the sidewalk... " be off close on for joker"
while this same set of words in a different order has a much much lower probability: on guys
all I of notice sidewalk three a sudden standing the
It can also help to make spelling error corrections.
For instance, the sentence “drink cofee” could be corrected to “drink coffee” if you knew that
the word “coffee” had a high probability of occurrence after the word “drink” and also the
overlap of letters between “cofee” and “coffee” is high
Let’s start with equation P(w|h), the probability of word w, given some history, h. For example,
P(The|its water is so transparent that) Here,
w = The
h = its water is so transparent that
And, one way to estimate the above probability function is through the relative frequency
count approach, where you would take a substantially large corpus, count the number of times
you see its water is so transparent that, and then count the number of times it is followed by
the.
In other words, you are answering the question: Out of the times you saw the history h, how
many times did the word w follow it P(the|its water is so transparent that) = C(its water is so
transparent that)/C(its water is so transparent that the)
You can imagine, it is not feasible to perform this over an entire corpus; especially if it is of a
significant size.
This shortcoming and ways to decompose the probability function using the chain rule serves
as the base intuition of the N-gram model. Here, you, instead of computing probability using
the entire corpus, would approximate it by just a few historical words

https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 2/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory

!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ


Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)
Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from nlt
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packages
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from nltk
Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from nl

import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'


[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.
[nltk_data] | Downloading package cmudict to /root/nltk_data...

https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 3/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory

[nltk_data] | Unzipping corpora/cmudict.zip.


[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk_data] | Downloading package conll2007 to /root/nltk_data...
[nltk_data] | Downloading package crubadan to /root/nltk_data...
[nltk_data] | Unzipping corpora/crubadan.zip.
[nltk_data] | Downloading package dependency_treebank to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/dependency_treebank.zip.
[nltk_data] | Downloading package dolch to /root/nltk_data...
[nltk_data] | Unzipping corpora/dolch.zip.
[nltk data] | Downloading package europarl raw to

import nltk
from nltk.util import ngrams

# Function to generate n-grams from sentences.


def extract_ngrams(data, num):
n_grams = ngrams(nltk.word_tokenize(data), num)
return [ ' '.join(grams) for grams in n_grams]

data = 'An n-gram is a contiguous sequence of n items from a given sample of text or speech.'

print("1-gram: ", extract_ngrams(data, 1))


print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

1-gram: ['An', 'n-gram', 'is', 'a', 'contiguous', 'sequence', 'of', 'n', 'items', 'from
2-gram: ['An n-gram', 'n-gram is', 'is a', 'a contiguous', 'contiguous sequence', 'sequ
3-gram: ['An n-gram is', 'n-gram is a', 'is a contiguous', 'a contiguous sequence', 'co
4-gram: ['An n-gram is a', 'n-gram is a contiguous', 'is a contiguous sequence', 'a con

data = 'शब्दों या वर्णों का एक अनुक्रमिक अवयव हो सकता है | '

print("1-gram: ", extract_ngrams(data, 1))


print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

1-gram: ['शब्दों', 'या', 'वर्णों', 'का', 'एक', 'अनुक्रमिक', 'अवयव', 'हो', 'सकता', 'है', '|']
2-gram: ['शब्दों या', 'या वर्णों', 'वर्णों का', 'का एक', 'एक अनुक्रमिक', 'अनुक्रमिक अवयव', 'अवयव
3-gram: ['शब्दों या वर्णों', 'या वर्णों का', 'वर्णों का एक', 'का एक अनुक्रमिक', 'एक अनुक्रमिक अवयव'
4-gram: ['शब्दों या वर्णों का', 'या वर्णों का एक', 'वर्णों का एक अनुक्रमिक', 'का एक अनुक्रमिक अवयव'

https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 4/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory

What is a Language Model in NLP?

A language model learns to predict the probability of a sequence of words. But why do we need to
learn the probability of words? Let’s understand that with an example.

One of the use of language model is in Machine Translation, you take in a bunch of words from a
language and convert these words into another language. Now, there can be many potential
translations that a system might give you and you will want to compute the probability of each of
these translations to understand which one is the most accurate.

In the above example, we know that the probability of the first sentence will be more than the
second, right? That’s how we arrive at the right translation.

This ability to model the rules of a language as a probability gives great power for NLP related
tasks. Language models are used in speech recognition, machine translation, part-of-speech
tagging, parsing, Optical Character Recognition, handwriting recognition, information retrieval, and
many other daily tasks.

There are primarily two types of Language Models:

Statistical Language Models: These models use traditional statistical techniques like N-
grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability
distribution of words
Neural Language Models: These are new players in the NLP town and have surpassed the
statistical language models in their effectiveness. They use different kinds of Neural
Networks to model language

How do N-gram Language Models work?

An N-gram language model predicts the probability of a given N-gram within any sequence of words
in the language. If we have a good N-gram model, we can predict p(w | h) – what is the probability
of seeing the word w given a history of previous words h – where the history contains n-1 words.

We must estimate this probability to construct an N-gram model.

We compute this probability in two steps:

1. Apply the chain rule of probability


2. We then apply a very strong simplification assumption to allow us to compute p(w1…ws) in an
easy manner

https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 5/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory

The chain rule of probability is:

p(w1...ws) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)

So what is the chain rule? It tells us how to compute the joint probability of a sequence by using the
conditional probability of a word given previous words.

But we do not have access to these conditional probabilities with complex conditions of up to n-1
words. So how do we proceed?

This is where we introduce a simplification assumption. We can assume for all conditions, that:

p(wk | w1...wk-1) = p(wk | wk-1)

Here, we approximate the history (the context) of the word wk by looking only at the last word of the
context. This assumption is called the Markov assumption. (We used it here with a simplified
context of length 1 – which corresponds to a bigram model – we could use larger fixed-sized
histories in general).

Building a Basic Language Model Now that we understand what an N-gram is, let’s build a basic
language model using trigrams of the Reuters corpus. Reuters corpus is a collection of 10,788
news documents totaling 1.3 million words. We can build a language model in a few lines of code
using the NLTK package:

from nltk.corpus import reuters


from nltk import bigrams, trigrams
from collections import Counter, defaultdict

# Create a placeholder for model


model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurance


for sentence in reuters.sents():
for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
model[(w1, w2)][w3] += 1

# Let's transform the counts to probabilities


for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count

In the above We first split our text into trigrams with the help of NLTK and then calculate the
frequency in which each combination of the trigrams occurs in the dataset.

We then use it to calculate probabilities of a word, given the previous two words. That’s essentially
what gives us our Language Model!

https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 6/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory

Let’s make simple predictions with this language model. We will start with two simple words –
“today the”. We want our model to tell us what will be the next word:

model["today","the"]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
{'public': 0.05555555555555555,
'European': 0.05555555555555555,
'Bank': 0.05555555555555555,
'price': 0.1111111111111111,
'emirate': 0.05555555555555555,
'overseas': 0.05555555555555555,
'newspaper': 0.05555555555555555,
'company': 0.16666666666666666,
'Turkish': 0.05555555555555555,
'increase': 0.05555555555555555,
'options': 0.05555555555555555,
'Higher': 0.05555555555555555,
'pound': 0.05555555555555555,
'Italian': 0.05555555555555555,
'time': 0.05555555555555555})

So we get predictions of all the possible words that can come next with their respective
probabilities. Now, if we pick up the word “price” and again make a prediction for the words “the”
and “price”:

dict(model["the","price"])

{'yesterday': 0.004651162790697674,
'of': 0.3209302325581395,
'it': 0.05581395348837209,
'effect': 0.004651162790697674,
'cut': 0.009302325581395349,
'for': 0.05116279069767442,
'paid': 0.013953488372093023,
'to': 0.05581395348837209,
'increases': 0.013953488372093023,
'used': 0.004651162790697674,
'climate': 0.004651162790697674,
'.': 0.023255813953488372,
'cuts': 0.009302325581395349,
'reductions': 0.004651162790697674,
'limit': 0.004651162790697674,
'now': 0.004651162790697674,
'moved': 0.004651162790697674,
'per': 0.013953488372093023,
'adjustments': 0.004651162790697674,
'(': 0.009302325581395349,
'slumped': 0.004651162790697674,
'is': 0.018604651162790697,
'move': 0.004651162790697674,

https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 7/9
3/22/23, 9:18 PM 9070_Exp4.ipynb - Colaboratory

'evolution': 0.004651162790697674,
'differentials': 0.009302325581395349,
'went': 0.004651162790697674,
'the': 0.013953488372093023,
'factor': 0.004651162790697674,
'Royal': 0.004651162790697674,
',': 0.018604651162790697,
'again': 0.004651162790697674,
'changes': 0.004651162790697674,
'holds': 0.004651162790697674,
'has': 0.009302325581395349,
'fall': 0.004651162790697674,
'-': 0.004651162790697674,
'from': 0.004651162790697674,
'base': 0.004651162790697674,
'on': 0.004651162790697674,
'review': 0.004651162790697674,
'while': 0.004651162790697674,
'collapse': 0.004651162790697674,
'being': 0.004651162790697674,
'at': 0.023255813953488372,
'outlook': 0.004651162790697674,
'rises': 0.004651162790697674,
'drop': 0.004651162790697674,
'guaranteed': 0.004651162790697674,
',"': 0.004651162790697674,
'stayed': 0.009302325581395349,
'structure': 0.004651162790697674,
'and': 0.004651162790697674,
'could': 0.004651162790697674,
'related': 0.004651162790697674,
'hike': 0.004651162790697674,
'we': 0.004651162790697674,
'adjustment': 0.023255813953488372,
'policy': 0 004651162790697674

Limitations of N-gram approach to Language Modeling N-gram based language models do have a
few drawbacks:

The higher the N, the better is the model usually. But this leads to lots of computation overhead that
requires large computation power in terms of RAM N-grams are a sparse representation of
language. This is because we build the model based on the probability of words co-occurring. It will
give zero probability to all the words that are not present in the training corpus

Ref: https://www.seerinteractive.com/blog/what-are-ngrams-and-uses-case/

https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-
code/#:~:text=An%20N%2Dgram%20language%20model%20predicts%20the%20probability%20of%
20a,history%20contains%20n%2D1%20words.

https://colab.research.google.com/drive/1hNp9uC-u2Q8K_zE38F83kz0fhlKCGX9z?authuser=1#scrollTo=3CAMm0JKB1PW&printMode=true 8/9
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

Practical No. 05

Write a program to identify word frequency and generate


word cloud of English and Hindi Text

Name : Ritesh Santosh Nimbalkar

Roll No: 9070

In computational linguistics, a frequency list is a sorted list of words (word types) together with
their frequency, where frequency here usually means the number of occurrences in a given corpus,
from which the rank can be derived as the position in the list.

A frequency distribution is an overview of all distinct values in some variable and the number of
times they occur. That is, a frequency distribution tells how frequencies are distributed over values.
Frequency distributions are mostly used for summarizing categorical variables.

Frequency Distribution: values and their frequency (how often each value occurs).

Example: Newspapers These are the numbers of newspapers sold at a local shop over the last 10
days:

22, 20, 18, 23, 20, 25, 22, 20, 18, 20

Let us count how many of each number there is:

Papers Sold Frequency 18 2 19 0 20 4 21 0 22 2 23 1 24 0 25 1 It is also possible to group the


values. Here they are grouped in 5s:

Papers Sold Frequency 15-19 2 20-24 7 25-29 1

A frequency distribution for the outcomes of an experiment. A frequency distribution records the
number of times each outcome of an experiment has occurred. For example, a frequency
distribution could be used to record the frequency of each word type in a document. Formally, a
frequency distribution can be defined as a function mapping from each sample to the number of
times that sample occurred as an outcome.

Frequency distributions are generally constructed by running a number of experiments, and


incrementing the count for a sample every time it is an outcome of an experiment. For example, the

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 1/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

following code will produce a frequency distribution that encodes how often each word occurs in a
text:

!pip install nltk


import nltk
nltk.download('all')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/p


Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)
Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from n
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packag
[nltk_data] Downloading collection 'all'
[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.
[nltk_data] | Downloading package cmudict to /root/nltk_data...
[nltk_data] | Unzipping corpora/cmudict.zip.

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 2/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

[nltk_data] | Downloading package comparative_sentences to


[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk data] | Downloading package conll2007 to /root/nltk data

#http://www.nltk.org/api/nltk.html?highlight=freqdist
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
sent = 'This is an example Sentence of another a SENTENCE which is a Sentence The sentence wh
fdist = FreqDist()
for word in word_tokenize(sent):
fdist[word.lower()] += 1

fdist.pprint()

FreqDist({'sentence': 4, 'is': 2, 'a': 2, 'which': 2, 'this': 1, 'an': 1, 'example': 1,

fdist.N()
#Return the total number of sample outcomes that have been recorded by this FreqDist.

19

fdist.plot()
#Plot samples from the frequency distribution displaying the most frequent sample first. If a

<AxesSubplot:xlabel='Samples', ylabel='Counts'>

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 3/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

fdist.tabulate()
#Tabulate the given samples from the frequency distribution (cumulative), displaying the most

sentence is a which this an example of another th


4 2 2 2 1 1 1 1 1

# To find the frequency of top 10 words


fdist.most_common(4)

[('sentence', 4), ('is', 2), ('a', 2), ('which', 2)]

Word cloud is a technique for visualising frequent words in a text where the size of the words
represents their frequency.

Word Cloud is a data visualization technique used for representing text data in which the size of
each word indicates its frequency or importance. Significant textual data points can be highlighted
using a word cloud. Word clouds are widely used for analyzing data from social network websites.

Word Clouds are a popular way of displaying how important words are in a collection of texts.
Basically, the more frequent the word is, the greater space it occupies in the image. One of the uses
of Word Clouds is to help us get an intuition about what the collection of texts is about. Here are
some classic examples of when Word Clouds can be useful:

Take a quick peek at the word distribution of a collection of texts


Clean the texts and want to see what are some frequent stopwords you want to filter out
See the differences between frequent words between two or more collections of texts

Let’s suppose you want to build a text classification system. If you’d want to see what are the
different frequent words in the different categories, you’d build a Word Cloud for each category and
see what are the most popular words inside each category.

!pip install matplotlib


!pip install pandas
!pip install wordcloud
!pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ


Requirement already satisfied: matplotlib in /usr/local/lib/python3.9/dist-packages (3.5
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packag
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-pac
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.9/dist-packages (
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (f
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.9/dist-package
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.9/dist-packages (fr
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packag
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from
https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 4/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ


Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.3.5)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.9/dist-packages (f
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.9/dist-p
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.9/dist-packages (
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ
Requirement already satisfied: wordcloud in /usr/local/lib/python3.9/dist-packages (1.8
Requirement already satisfied: matplotlib in /usr/local/lib/python3.9/dist-packages (fro
Requirement already satisfied: pillow in /usr/local/lib/python3.9/dist-packages (from wo
Requirement already satisfied: numpy>=1.6.1 in /usr/local/lib/python3.9/dist-packages (f
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.9/dist-package
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-pac
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packag
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (f
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packag
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ
Collecting wikipedia
Downloading wikipedia-1.4.0.tar.gz (27 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.9/dist-packages
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /usr/local/lib/python3.9/dist-p
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packag
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-pa
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (f
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packa
Building wheels for collected packages: wikipedia
Building wheel for wikipedia (setup.py) ... done
Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha2
Stored in directory: /root/.cache/pip/wheels/c2/46/f4/caa1bee71096d7b0cdca2f2a2af45cac
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0

# Import packages
import wikipedia
result= wikipedia.page("natural")
final_result = result.content
print(final_result)

Nature, in the broadest sense, is the physical world or universe. "Nature" can refer t
The concept of nature as a whole, the physical universe, is one of several expansions
During the advent of modern scientific method in the last several centuries, nature b

== Earth ==

Earth is the only planet known to support life, and its natural features are the subj
Earth has evolved through geological and biological processes that have left traces of
The atmospheric conditions have been significantly altered from the original conditio

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 5/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

=== Geology ===

Geology is the science and study of the solid and liquid matter that constitutes the

==== Geological evolution ====

The geology of an area evolves through time as rock units are deposited and inserted
Rock units are first emplaced either by deposition onto the surface or intrude into t
After the initial sequence of rocks has been deposited, the rock units can be deforme

=== Historical perspective ===

Earth is estimated to have formed 4.54 billion years ago from the solar nebula, along

Continents formed, then broke up and reformed as the surface of Earth reshaped over h
The present era is classified as part of a mass extinction event, the Holocene extinct

== Atmosphere, climate, and weather ==

The Earth's atmosphere is a key factor in sustaining the ecosystem. The thin layer of
Terrestrial weather occurs almost exclusively in the lower part of the atmosphere, an

Weather can have both beneficial and harmful effects. Extremes in weather, such as to
Climate is a measure of the long-term trends in the weather. Various factors are know

The climate of a region depends on a number of factors, especially latitude. A latitu


Weather is a chaotic system that is readily modified by small changes to the environm

== Water on the Earth ==

Water is a chemical substance that is composed of hydrogen and oxygen (H2O) and is vit

=== Oceans ===

An ocean is a major body of saline water, and a principal component of the hydrospher

=== Lakes ===

l k (f ti d l ) i t i f t ( h i l f t ) b d f

result= wikipedia.summary("natural", sentences=50)


print(result)

Nature, in the broadest sense, is the physical world or universe. "Nature" can refer to
The concept of nature as a whole, the physical universe, is one of several expansions of
During the advent of modern scientific method in the last several centuries, nature beca

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 6/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

from wordcloud import WordCloud


import matplotlib.pyplot as plt
def plot_cloud(wordcloud):
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud)
plt.axis("off");
wordcloud = WordCloud(width = 500, height = 500, background_color='white',colormap='Set2',ran
plot_cloud(wordcloud)

Student Tasks:
Task1: Genrate word cloud from the wikpedia page of Mumbai in Marathi / Hindi

OR

Task2: Genrate word cloud from the wikpedia page of Mumbai in English where word cloud will not
have any stop words and all the words will be transfered to base word using lemmatization.

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 7/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

Double-click (or enter) to edit

Task 2: Genrate word cloud from the wikipedia page of


Mumbai in English where word cloud will not have any stop
words and all the words will be transfered to base word using
lemmatization.

# Import packages
import wikipedia
result= wikipedia.page("mumbai")
final_result = result.content
print(final_result)

Mumbai (English: (listen), Marathi: [ˈmumbəi]; also known as Bombay — the official n

== Etymology ==
The name Mumbai (Marathi: मुंबई, Gujarati: મુંબઈ, Hindi: मुंबई) derived from Mumbā or Mahā

The oldest known names for the city are Kakamuchee and Galajunkja; these are sometime

=== People from Mumbai ===


A resident of Mumbai is called Mumbaikar ( pronounced [mumbəikəɾ] ) in Marathi, in wh

== History ==

=== Early history ===

Mumbai is built on what was once an archipelago of seven islands: Isle of Bombay, Par

King Bhimdev founded his kingdom in the region in the late 13th century and establish

=== Portuguese and British rule ===

The Mughal Empire, founded in 1526, was the dominant power in the Indian subcontinent

The Portuguese were actively involved in the foundation and growth of their Roman Cat

In accordance with the Royal Charter of 27 March 1668, England leased these islands to
By the middle of the 18th century, Bombay began to grow into a major trading town, an

From 1782 onwards, the city was reshaped with large-scale civil engineering projects

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 8/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

=== Independent India ===

After India's independence in 1947, the territory of the Bombay Presidency retained by
Following protests during the movement in which 105 people died in clashes with the po

== Geography ==

Mumbai is on a narrow peninsula on the southwest of Salsette Island, which lies betwe
Mumbai lies at the mouth of the Ulhas River on the western coast of India, in the coa
eastern to Madh Marve on the western front. The eastern coast of Salsette Island is co

=== Climate ===

Mumbai has a tropical climate, specifically a tropical wet and dry climate (Aw) under
Mumbai is prone to monsoon floods, caused due to climate change that is affected by h

=== Air pollution ===


Air pollution is a major issue in Mumbai. According to the 2016 World Health Organizat

result= wikipedia.summary("mumbai", sentences=50)


print(result)

Mumbai (English: (listen), Marathi: [ˈmumbəi]; also known as Bombay — the official name

#Lemmatization
from nltk import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word_list = word_tokenize(result)
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print("After the lemmatization:",lemmatized_output)

After the lemmatization: Mumbai ( English : ( listen ) , Marathi : [ ˈmumbəi ] ; also kn

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 9/10
3/22/23, 9:19 PM 9070_Exp-5.ipynb - Colaboratory

#word cloud after removing stopwords.


from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def plot_cloud(wordcloud):
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud)
plt.axis("off");
wordcloud = WordCloud(stopwords = stop_words,width = 500, height = 500, background_color='whi
plot_cloud(wordcloud)

Colab paid products - Cancel contracts here

https://colab.research.google.com/drive/1uRIjhwD8TYkP1WhfGSZhQD7V7P1a77dT?authuser=1#scrollTo=KqbyBpw5c_JJ&printMode=true 10/10
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

Practical No. 06

Write a program Get word definition, examples, synonyms,


antonyms using English WordNet

Name: Ritesh Santosh Nimbalkar

Roll No - 9070

WordNet is a significant semantic network interlinking word or group of words employing lexical or
conceptual relationships by labelled arcs. Wordnets are lexical structures composed of synsets and
semantic relations. In wordnet creation, the focus shifts from words to concepts. Each member of a
synset represents the same concept though not all synset members are interchangeable in context.
Synsets contain definition including sentences to describe synonym usage. The membership of
words in multiple synsets or concepts reflects polysemy or multiplicity of meaning.

There are three principles the synset construction process must adhere to:

Minimality: This principle insists on capturing that minimal set of the words in the synset which
uniquely identifies the concept. For example, {family, house} uniquely identifies a concept (e.g. “he
is from the house of the King of Jaipur”}.

Coverage: This principle then stresses on the completion of the synset, i.e., capturing all the words
that stand for the concept expressed by the synset. Within the synset, the words should be ordered
according to their frequency in the corpus.

Replaceability: Within the synset, the words should be ordered according to their frequency in the
corpus. Replaceability demands that the most common words in the synset, i.e., words towards the
beginning of the synset should be able to replace one another in the example sentence associated
with the synset.

WordNet contains synsets with words coming from the critical, open class, syntactic categories
like:
a) Noun
b) Adjective
c) Verbs
e) Adverbs
https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 1/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

Lexical Relationships:
Antonymy: It is a lexical relation indicating ‘opposites’. It majorly originates from descriptive
adjectives. Further, each member of a direct antonym pair is associated with some semantically
similar adjectives. e.g. fat opposite is thin; obese antonym is also thin as obese and fat belong to
the same synset.

Gradation: It is a lexical relation that represents possible intermediate states between two
antonyms. Eg. Morning, noon, evening.

Hypernymy and Hyponymy encode lexical relations between a more general term and specific
instances of it. They build a hierarchical tree with increasingly concrete/ particular concepts
growing out from the abstract root.

Meronymy expresses the part-of relationship. Synsets denoting parts, components or members to
synsets indicating the whole are called meronyms.

Holonymy is inverse of meronymy

Entailment: It is a semantic relationship between two verbs. A verb C entails a verb B, if the
meaning of B follows logically and is strictly included in the meaning of C. This relation is
unidirectional. For instance, snoring entails sleeping, but sleeping does not entail snoring.

Troponymy: It is a semantic relation between two verbs when one is a specific ‘manner’ elaboration
of another.

WordNet is a tool for solving Word Sense Disambiguation. It can also be used to find abstract or
concrete concepts. Its unique semantic network helps us find word relations, synonyms, grammars,
etc. This helps support NLP tasks such as sentiment analysis, automatic language translation, text
similarity, and more.

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 2/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

In WordNet terminology, each group of synonyms is a synset, and a synonym that forms part of a
synset is a lexical variant of the same concept. For example, in the network above, Word of God,
Word, Scripture, Holy Writ, Holy Scripture, Good Book, Christian Bible and Bible make up the synset
that corresponds to the concept Bible, and each of these forms is a lexical variant.

The NLTK module includes the English WordNet with 155,287 words and 117,659 synonym sets.

!pip install nltk


import nltk
nltk.download('all')

from nltk.corpus import wordnet

# Or, for more compact code:

from nltk.corpus import wordnet as wn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/p


Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packag
Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from n

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 3/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

[nltk_data] Downloading collection 'all'


[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.
[nltk_data] | Downloading package cmudict to /root/nltk_data...
[nltk_data] | Unzipping corpora/cmudict.zip.
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk data] | Downloading package conll2007 to /root/nltk data...

Words

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 4/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

We can look up words which are a part of the WordNet lexicon using the synset() and
synsets()function.

A synset may be defined with a 3-part name of the following form:

synset = WORD.POS.NN

where:

Word — the word you are searching for.

Part of Speech (POS) — a particular part of speech (noun, verb, adjective, adverb, pronoun,
preposition, conjunction, interjection, numeral, article, or determiner), in which a word
corresponds to based on both its definition and its context.

NN — a sense key. A word can have multiple meanings or definitions. Therefore, “cake.n.03” is
the third noun sense of the word “cake”.

The POS and NN parameters are optional.

print(wn.synsets('dog'))
print("\n")
print(wn.synsets('run'))
print("\n")
print(wn.synset('dog.n.01'))
print("\n")
print(wn.synset('run.v.01'))

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synse

[Synset('run.n.01'), Synset('test.n.05'), Synset('footrace.n.01'), Synset('streak.n.01')

Synset('dog.n.01')

Synset('run.v.01')

Definitions and Examples From the previous example, we can see that dog and run have several
possible contexts. To help understand the meaning of each one we can view their definitions using
the definition() function.

print("dog.n.01 -- ", wn.synset('dog.n.01').definition())


print("\n")
print("run.n.01 --", wn.synset('run.n.01').definition())

dog.n.01 -- a member of the genus Canis (probably descended from the common wolf) that

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 5/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

run.n.01 -- a score in baseball made by a runner touching all four bases safely

Likewise, if we needed to clarify some examples of the noun dog and verb run in context, we can
use the examples() function.

print("dog.n.01 -- ",wn.synset('dog.n.01').examples())
print("\n")
print("run.v.01 --",wn.synset('run.v.01').examples())

dog.n.01 -- ['the dog barked all night']

run.v.01 -- ["Don't run--you'll be out of breath", 'The children ran to the store']

# One may obtain the lemmas for a given synset as follows:


print(wn.synset('internet.n.01').lemmas())

# For a given lemma, we can also get the synsets corresponding


# to that lemma.
print(wn.lemma('internet.n.01.net').synset())

[Lemma('internet.n.01.internet'), Lemma('internet.n.01.net'), Lemma('internet.n.01.cyber


Synset('internet.n.01')

Hypernyms and Hyponyms

Hypernymy and Hyponymy encode lexical relations between a more general term and specific
instances of it.

A hypernym is described a being a word that is more general than a given word. That is, it is its
superordinate term: if X is a hypernym of Y, then all Y are X. For example, animal is a hypernym of
dog.

Whereas hyponymy is the relation between two concepts, where concept B is a type of concept A.
For example, beef is a hyponym of meat.

print ("\nSynset abstract term for 'dog': ", wn.synset('dog.n.01').hypernyms())

print ("\nSynset specific term for 'dog': ", wn.synset('dog.n.01').hypernyms()[0].hyponyms())

print ("\nSynset root hypernerm for 'dog': ", wn.synset('dog.n.01').root_hypernyms())

print("\n")

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 6/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

print ("\nSynset abstract term for 'run': ", wn.synset('run.v.01').hypernyms())

print ("\nSynset specific term for 'run': ", wn.synset('run.v.01').hypernyms()[0].hyponyms())

print ("\nSynset root hypernerm term for 'run': ", wn.synset('run.v.01').root_hypernyms())

print("\n")

#Let us determine the hyponyms of the term "cat", and store that into a variable `types_of_ca
cat = wn.synset('cat.n.01')
types_of_cats = cat.hyponyms()

print("hyponyms of cat")
# Now, let us loop through the hyponyms and see the lemmas for each synset:
for synset in types_of_cats:
for lemma in synset.lemmas():
print(lemma.name())

# Note that terms like "domestic_cat" and "house_cat" are


# more specific terms with respect to the term "cat", that is,
# they are hyponyms of the word "cat".
print("\n")

# Example:
# Cat <- hypernym
# house_cat <- hyponym
print("house cat hpyernym",wn.synset('house_cat.n.01').hypernyms())

Synset abstract term for 'dog': [Synset('canine.n.02'), Synset('domestic_animal.n.01')]

Synset specific term for 'dog': [Synset('bitch.n.04'), Synset('dog.n.01'), Synset('fox

Synset root hypernerm for 'dog': [Synset('entity.n.01')]

Synset abstract term for 'run': [Synset('travel_rapidly.v.01')]

Synset specific term for 'run': [Synset('flit.v.01'), Synset('run.v.01'), Synset('zoom

Synset root hypernerm term for 'run': [Synset('travel.v.01')]

hyponyms of cat
domestic_cat
house_cat
Felis_domesticus
Felis_catus
wildcat

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 7/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

house cat hpyernym [Synset('cat.n.01'), Synset('domestic_animal.n.01')]

Why is this useful in NLP?

Knowledge of the hypernymy and hyponymy relations is useful for tasks such as question
answering, where a model may be built to understand very general concepts, but is asked specific
questions.

Synonyms and Antonyms

Programmatically identifying accurate synonyms and antonyms is more difficult than it should be.
However, WordNet covers this quite well.

Synonyms are words or expressions of the same language which have the same or a very similar
meaning in some, or all, senses. For example, the synonyms in the WordNet network which
surround the word car are automobile, machine, motorcar, etc.

Antonymy can be defined as the lexical relation which indicates ‘opposites’. Further, each member
of a direct antonym pair is associated with some semantically similar adjectives. e.g. fat is the
opposite of thin; obese’s antonym is also thin as obese and fat belong to the same synset. Naturally,
some words do not have antonyms and other words like recommend just don’t have enough
information in WordNet.

synonyms = []
antonyms = []

for syn in wn.synsets("happy"):


for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())

print("synonyms",set(synonyms))
print("\n")
print("antonyms",set(antonyms))

synonyms {'felicitous', 'well-chosen', 'glad', 'happy'}

antonyms {'unhappy'}

Meronyms and Holonyms

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 8/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

Meronymy expresses the ‘components-of’ relationship. That is, a relation between two concepts,
where concept A makes up a part of concept B. For meronyms, we can take advantage of two NLTK
functions:

part_meronyms()and susbstance_meronyms(). For example, part meronyms of the noun hat


include crown and substance meronyms of the noun water include hydrogen.

tree = wn.synset('tree.n.01')

print(tree.part_meronyms())
print('\n')
print(tree.substance_meronyms())

[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), S

[Synset('heartwood.n.01'), Synset('sapwood.n.01')]

Student task

print(wn.synsets('tiger'))
print("\n")
print(wn.synsets('hunt'))
print("\n")
print(wn.synset('tiger.n.01'))
print("\n")
print(wn.synset('hunt.v.01'))

[Synset('tiger.n.01'), Synset('tiger.n.02')]

[Synset('hunt.n.01'), Synset('hunt.n.02'), Synset('hunt.n.03'), Synset('hunt.n.04'), Syn

Synset('tiger.n.01')

Synset('hunt.v.01')

print("tiger.n.01 -- ", wn.synset('tiger.n.01').definition())


print("\n")
print("hunt.n.01 --", wn.synset('hunt.n.01').definition())

tiger.n.01 -- a fierce or audacious person

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 9/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

hunt.n.01 -- Englishman and Pre-Raphaelite painter (1827-1910)

print("tiger.n.01 -- ",wn.synset('tiger.n.01').examples())
print("\n")
print("hunt.v.01 --",wn.synset('hunt.v.01').examples())

tiger.n.01 -- ["he's a tiger on the tennis court", 'it aroused the tiger in me']

hunt.v.01 -- ['Goering often hunted wild boars in Poland', 'The dogs are running deer',

# One may obtain the lemmas for a given synset as follows:


print(wn.synset('internet.n.01').lemmas())

# For a given lemma, we can also get the synsets corresponding


# to that lemma.
print(wn.lemma('internet.n.01.net').synset())

[Lemma('internet.n.01.internet'), Lemma('internet.n.01.net'), Lemma('internet.n.01.cyber


Synset('internet.n.01')

print ("\nSynset abstract term for 'tiger': ", wn.synset('tiger.n.01').hypernyms())

print ("\nSynset specific term for 'tiger': ", wn.synset('tiger.n.01').hypernyms()[0].hyponym

print ("\nSynset root hypernerm for 'tiger': ", wn.synset('tiger.n.01').root_hypernyms())

print("\n")

print ("\nSynset abstract term for 'hunt': ", wn.synset('hunt.v.01').hypernyms())

print ("\nSynset specific term for 'hunt': ", wn.synset('hunt.v.01').hypernyms()[0].hyponyms(

print ("\nSynset root hypernerm term for 'hunt': ", wn.synset('hunt.v.01').root_hypernyms())

print("\n")

#Let us determine the hyponyms of the term "fish", and store that into a variable `types_of_s
snake = wn.synset('snake.n.01')
types_of_snake = snake.hyponyms()

print("hyponyms of snake")
# Now, let us loop through the hyponyms and see the lemmas for each synset:
for synset in types_of_snake:
for lemma in synset.lemmas():
print(lemma.name())

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 10/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

print("\n")

print("house snake hpyernym",wn.synset('house_snake.n.01').hypernyms())

Synset abstract term for 'tiger': [Synset('person.n.01')]

Synset specific term for 'tiger': [Synset('abator.n.01'), Synset('abjurer.n.01'), Synse

Synset root hypernerm for 'tiger': [Synset('entity.n.01')]

Synset abstract term for 'hunt': [Synset('capture.v.06')]

Synset specific term for 'hunt': [Synset('bag.v.01'), Synset('batfowl.v.01'), Synset('f

Synset root hypernerm term for 'hunt': [Synset('get.v.01')]

hyponyms of snake
blind_snake
worm_snake
colubrid_snake
colubrid
constrictor
elapid
elapid_snake
sea_snake
viper

house snake hpyernym [Synset('king_snake.n.01')]

synonyms = []
antonyms = []

for syn in wn.synsets("brave"):


for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())

print("synonyms",set(synonyms))
print("\n")
print("antonyms",set(antonyms))

synonyms {'endure', 'dauntless', 'courageous', 'weather', 'unfearing', 'brave', 'hardy',

antonyms {'timid', 'cowardly'}

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 11/12
3/22/23, 9:21 PM 9070_Exp6.ipynb - Colaboratory

tree = wn.synset('car.n.01')

print(tree.part_meronyms())
print('\n')
print(tree.substance_meronyms())

[Synset('accelerator.n.01'), Synset('air_bag.n.01'), Synset('auto_accessory.n.01'), Syns

[]

https://colab.research.google.com/drive/1QAWdXcbqoh4C_SI6l6EMdIflXzKZA7aD?authuser=1#scrollTo=XXw3y-Gkkh_2&printMode=true 12/12
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory

Practical No. 07

Write a program to identify Part of Speech of English Text

Name: Ritesh Santosh Nimbalkar

Roll No: 9070

The process of classifying words into their parts of speech and labeling them accordingly is known
as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word
classes or lexical categories. The collection of tags used for a particular task is known as a tagset.

Part of Speech example:

Input: Everything to permit us.

Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of


speech tag to each word

!pip install nltk


import nltk
nltk.download('all')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/p


Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packag
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from n
Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from
[nltk_data] Downloading collection 'all'
[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 1/5
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory

[nltk_data] | Downloading package basque_grammars to


[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.
[nltk_data] | Downloading package cmudict to /root/nltk_data...
[nltk_data] | Unzipping corpora/cmudict.zip.
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk_data] | Downloading package conll2007 to /root/nltk_data...

from nltk.tokenize import word_tokenize


from nltk.tokenize import sent_tokenize

text = word_tokenize("Many Competitive Exams in India like UPSC Civil Services Exams, Bank Ex
print(nltk.pos_tag(text))

[('Many', 'JJ'), ('Competitive', 'NNP'), ('Exams', 'NNP'), ('in', 'IN'), ('India', 'NNP

Steps Involved in the POS tagging example:

Tokenize text (word_tokenize)


apply pos_tag to above step that is nltk.pos_tag(tokenize_text) NLTK POS Tags Examples are
as Below:

https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 2/5
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory

Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
JJ This NLTK POS Tag is an adjective (large)
JJR adjective, comparative (larger)
JJS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
PDT predeterminer (all, both, half)
POS possessive ending (parent\ 's)
PRP personal pronoun (hers, herself, him,himself)
PRP$ possessive pronoun (her, his, mine, my, our )
RB adverb (occasionally, swiftly)
RBR adverb, comparative (greater)
RBS adverb, superlative (biggest)
RP particle (about)
TO infinite marker (to)
UH interjection (goodbye)
VB verb (ask)
VBG verb gerund (judging)
VBD verb past tense (pleaded)
VBN verb past participle (reunified)
VBP verb, present tense not 3rd person singular(wrap)
VBZ verb, present tense with 3rd person singular (bases)
WDT wh-determiner (that, what)
WP wh- pronoun (who)
WRB wh- adverb (how)

https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 3/5
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory

NLTK provides documentation for each tag, which can be queried using the tag, e.g.
nltk.help.upenn_tagset('RB')

nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular


Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...

nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal


third ill-mannered pre-war regrettable oiled calamitous first separable
ectoplasmic battery-powered participatory fourth still-to-be-named
multilingual multi-disciplinary ...

nltk.help.upenn_tagset('NNS')

NNS: noun, common, plural


undergraduates scotches bric-a-brac products bodyguards facets coasts
divestitures storehouses designs clubs fragrances averages
subjectivists apprehensions muses factory-jobs ...

Why POS Tagging is Useful? POS tagging can be really useful, particularly if you have words or
tokens that can have multiple POS tags. For instance, the word "google" can be used as both a noun
and verb, depending upon the context. While processing natural language, it is important to identify
this difference.

text = word_tokenize("Can you google it?.")


print(nltk.pos_tag(text))

[('Can', 'MD'), ('you', 'PRP'), ('google', 'VB'), ('it', 'PRP'), ('?', '.'), ('.', '.')]

text = word_tokenize("Google is a tech leader.")


print(nltk.pos_tag(text))

[('Google', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('tech', 'JJ'), ('leader', 'NN'), ('.',

Student Task
POS tagger for Hindi Text

https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 4/5
3/22/23, 9:22 PM 9070_Exp 7.ipynb - Colaboratory

from nltk.corpus import indian


from nltk.tag import tnt

train_data = indian.tagged_sents('hindi.pos')
tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data) #Training the tnt Part of speech tagger with hindi data
text = "भारत में कई प्रतियोगी परीक्षाओं जैसे यूपीएससी सिविल सेवा परीक्षा, बैंक परीक्षा आदि में भी निबंध लेखन पर एक
tagged_words = (tnt_pos_tagger.tag(nltk.word_tokenize(text)))
print(tagged_words)

[('भारत', 'NNP'), ('में', 'PREP'), ('कई', 'QF'), ('प्रतियोगी', 'Unk'), ('परीक्षाओं', 'Unk'), ('

https://colab.research.google.com/drive/1LQs2elw5dwEw2Mgm-PSJ_596N9O4WJUI?authuser=1#scrollTo=CyYL1Tfm6WQu&printMode=true 5/5
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory

Practical No. 08

Write a program to check Word Similarity using English


WordNet

Name: Ritesh Santosh Nimbalkar

Roll No: 9070

WordNet is a significant semantic network interlinking word or group of words employing lexical or
conceptual relationships by labelled arcs. Wordnets are lexical structures composed of synsets and
semantic relations. In wordnet creation, the focus shifts from words to concepts. Each member of a
synset represents the same concept though not all synset members are interchangeable in context.
Synsets contain definition including sentences to describe synonym usage. The membership of
words in multiple synsets or concepts reflects polysemy or multiplicity of meaning.

WordNet contains synsets with words coming from the critical, open class, syntactic categories
like:
a) Noun
b) Adjective
c) Verbs
d) Adverbs

WordNet is a tool for solving Word Sense Disambiguation. It can also be used to find abstract or
concrete concepts. Its unique semantic network helps us find word relations, synonyms, grammars,
etc. This helps support NLP tasks such as sentiment analysis, automatic language translation, text
similarity, and more.

In WordNet terminology, each group of synonyms is a synset, and a synonym that forms part of a
synset is a lexical variant of the same concept. For example, in the network above, Word of God,
Word, Scripture, Holy Writ, Holy Scripture, Good Book, Christian Bible and Bible make up the synset
that corresponds to the concept Bible, and each of these forms is a lexical variant.

The NLTK module includes the English WordNet with 155,287 words and 117,659 synonym sets.

https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 1/7
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory

!pip install nltk


import nltk
nltk.download('all')

from nltk.corpus import wordnet

# Or, for more compact code:

from nltk.corpus import wordnet as wn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/p


Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)
Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from n
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packag
[nltk_data] Downloading collection 'all'
[nltk_data] |
[nltk_data] | Downloading package abc to /root/nltk_data...
[nltk_data] | Unzipping corpora/abc.zip.
[nltk_data] | Downloading package alpino to /root/nltk_data...
[nltk_data] | Unzipping corpora/alpino.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] | Downloading package averaged_perceptron_tagger_ru to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping
[nltk_data] | taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data] | Downloading package basque_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/basque_grammars.zip.
[nltk_data] | Downloading package bcp47 to /root/nltk_data...
[nltk_data] | Downloading package biocreative_ppi to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/biocreative_ppi.zip.
[nltk_data] | Downloading package bllip_wsj_no_aux to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data] | Downloading package book_grammars to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping grammars/book_grammars.zip.
[nltk_data] | Downloading package brown to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown.zip.
[nltk_data] | Downloading package brown_tei to /root/nltk_data...
[nltk_data] | Unzipping corpora/brown_tei.zip.
[nltk_data] | Downloading package cess_cat to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_cat.zip.
[nltk_data] | Downloading package cess_esp to /root/nltk_data...
[nltk_data] | Unzipping corpora/cess_esp.zip.
[nltk_data] | Downloading package chat80 to /root/nltk_data...
[nltk_data] | Unzipping corpora/chat80.zip.
[nltk_data] | Downloading package city_database to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/city_database.zip.

https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 2/7
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory

[nltk_data] | Downloading package cmudict to /root/nltk_data...


[nltk_data] | Unzipping corpora/cmudict.zip.
[nltk_data] | Downloading package comparative_sentences to
[nltk_data] | /root/nltk_data...
[nltk_data] | Unzipping corpora/comparative_sentences.zip.
[nltk_data] | Downloading package comtrans to /root/nltk_data...
[nltk_data] | Downloading package conll2000 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2000.zip.
[nltk_data] | Downloading package conll2002 to /root/nltk_data...
[nltk_data] | Unzipping corpora/conll2002.zip.
[nltk data] | Downloading package conll2007 to /root/nltk data...

Word Similarity

We can also determine the similarity between words.

How Related are Two Words? Let us take the terms we have learned thus far along with what
WordNet provides to us to define some metric as to how two words are related to one another.

There are a few ways in which to calculate the similarities between words.

The path_similarity() function returns a score which denotes how similar two words are by
traversing through the paths that connects them in the WordNet network.

The path_similarity function returns a score denoting how similar two words are in terms of the
distance between hypernyms/hyponyms.

# Let us calculate this metric of similarity between words


# "car" and "automobile".

# First, define the synsets for these terms:


textbook = wn.synset('textbook.n.01')
book = wn.synset('book.n.01')

# Now, call the `path_similarity` function. This function returns a score


# between 0 and 1, where 0 is no similarity between the hypernym/hyponym
# tree and a distance of 1 is the node which houses both of the words
# in terms of hypernyms/hyponyms is identical.
print("Similarity between novel and book",textbook.path_similarity(book))

# We see that "textbook" and "book" have the highest similarity possible,
# with a score of 0.5.

# Let us now take a look at the term "magazine" and "book":


magazine = wn.synset('magazine.n.01')
print("Similarity between magazine and book",magazine.path_similarity(book))

# We see a lower number here. This again makes sense, since the traversal

https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 3/7
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory

# with respect to hypernyms/hyponyms from magazine to book is certainly at least


# below 0.5.

Similarity between novel and book 0.5


Similarity between magazine and book 0.3333333333333333

There are actually many more ways in which to define distances between words.

1. Wu-Palmer Similarity
2. Resnik Similarity
3. Jiang-Conrath Similarity
4. Lin Similarity

Wu & Palmer’s similarity calculates similarity by considering the depths of the two synsets in the
network, as well as where their most specific ancestor node (Least Common Subsumer (LCS)).

The similarity score is measured between 0 < score and ≤ 1, where 1 indicates that the words are
the same. The score can never be 0 because the depth of the LCS is never 0 (the depth of the root
of taxonomy is 1).

from numpy.lib.function_base import blackman


cat = wn.synset('cat.v.01')
dog = wn.synset('dog.v.01')

print("Similarity between cat and dog using WUP",cat.wup_similarity(dog))

black = wn.synset('black.n.01')
white = wn.synset('white.n.01')
rainbow = wn.synset('rainbow.n.01')
colours = wn.synset('colours.n.01')

print("Similarity between black and white using WUP",black.wup_similarity(white))


print("Similarity between black and white using PathSim",black.path_similarity(white))

print("Similarity between black and dog using WUP",black.wup_similarity(dog))

print("Similarity between rainbow and colour using WUP",rainbow.wup_similarity(colours))


https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 4/7
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory

print("Similarity between rainbow and colour using PathSim",rainbow.path_similarity(colours))

print("Similarity between rainbow and black using WUP",rainbow.wup_similarity(black))

print("Similarity between rainbow and black using PathSim",rainbow.path_similarity(black))

Similarity between cat and dog using WUP 0.25


Similarity between black and white using WUP 0.15384615384615385
Similarity between black and white using PathSim 0.08333333333333333
Similarity between black and dog using WUP 0.15384615384615385
Similarity between rainbow and colour using WUP 0.11764705882352941
Similarity between rainbow and colour using PathSim 0.0625
Similarity between rainbow and black using WUP 0.375
Similarity between rainbow and black using PathSim 0.09090909090909091

Why is this useful in NLP?

Language is flexible, and people will use a variety of different words to describe the same thing. So,
if you had a large dataset of customer reviews and you wanted to extract those which discuss the
same aspects of the product, finding which are similar will help narrow that search.

Student Task
Perform similarity checking at Sentence level for English language.

!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/publ


Collecting sentence-transformers
Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.0/86.0 KB 6.9 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting transformers<5.0.0,>=4.6.0
Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 81.2 MB/s eta 0:00:00
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from sent
Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.9/dist-packages (f
Requirement already satisfied: torchvision in /usr/local/lib/python3.9/dist-packages (fr
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from sen
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (f
Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from sen
Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (from sent
Collecting sentencepiece
Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 70.3 MB/s eta 0:00:00
Collecting huggingface-hub>=0.4.0
Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.2/199.2 KB 13.4 MB/s eta 0:00:00
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.9/dist-packages
https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 5/7
3/22/23, 9:28 PM 9070_Exp 8.ipynb - Colaboratory

Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/di


Requirement already satisfied: filelock in /usr/local/lib/python3.9/dist-packages (from
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.9/dist-packages (fr
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.9/dist-packag
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 87.7 MB/s eta 0:00:00
Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from nlt
Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from nl
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-pac
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.9/dist-pa
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packag
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-pa
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packa
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (f
Building wheels for collected packages: sentence-transformers
Building wheel for sentence-transformers (setup.py) ... done
Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none
Stored in directory: /root/.cache/pip/wheels/71/67/06/162a3760c40d74dd40bc855d527008d2
Successfully built sentence-transformers
Installing collected packages: tokenizers, sentencepiece, huggingface-hub, transformers,
Successfully installed huggingface-hub-0.13.1 sentence-transformers-2.2.2 sentencepiece-

sentences = [
"The Moon is a barren, rocky world without air and water.",
"It has dark lava plain on its surface. The Moon is filled wit craters.",
"It has no light of its own.",
"It gets its light from the Sun. "
]

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')

sentence_embeddings = model.encode(sentences)

from sklearn.metrics.pairwise import cosine_similarity


#Let's calculate cosine similarity for sentence 0:

cosine_similarity(
[sentence_embeddings[0]],
sentence_embeddings[1:]
)

array([[0.63135767, 0.6837643 , 0.09040618]], dtype=float32)array([[0.63135767,


0.6837643 , 0.09040618]], dtype=float32)

https://colab.research.google.com/drive/1XQ17wgsrJJPZsdO9GCeF6J9zzlajHK4M?authuser=1#scrollTo=jsVBPszMr3xU&printMode=true 6/7

You might also like