0% found this document useful (0 votes)
135 views17 pages

Shubham Jade MSC It 31031420010 NLP Practical Journal

The Python code implements term frequency and inverse document frequency (TF-IDF) for three documents. It first tokenizes and preprocesses the text by removing special characters, stopwords, and selected words. It then generates bigrams and trigrams using CountVectorizer. Finally, it calculates TF-IDF scores using TfidfVectorizer and prints the top ranking features.

Uploaded by

Shubham Jade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views17 pages

Shubham Jade MSC It 31031420010 NLP Practical Journal

The Python code implements term frequency and inverse document frequency (TF-IDF) for three documents. It first tokenizes and preprocesses the text by removing special characters, stopwords, and selected words. It then generates bigrams and trigrams using CountVectorizer. Finally, it calculates TF-IDF scores using TfidfVectorizer and prints the top ranking features.

Uploaded by

Shubham Jade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

SHUBHAM JADE

MSc IT
31031420010
NLP practical Journal
Index

Practical Title Teacher’s


No. Sign

1 Generating Root Words

2 Sentence and Word Tokenization

3 Part of Speech Tagging

4 Generating Parse Tree using Chunk Parser

5 Finding Term Frequency and Inverse Document Frequency

6 Removing Stop Words

7 Using probabilistic model to predict the next word

8 Word Similarity

9 Named Entity Recognition

10 Using Synset and Wordnet database


Practical 1
Implement a python code that will generate the root words in the given sentences.

Stemming Example 1
from nltk.stem import PorterStemmer as ps
text4=(“I am a Student of Somaiya University.”).split()
print(text4)
for w in text4:
rootWord=ps().stem(w)
print(rootWord)

Stemming Example 2
words=["Unexpected", "disagreement", "disagree", "agreement",
"quirkiness", "historical", "canonical", "happiness", "unkind",
"dogs", "expected"]
for w in words:
stemPrint=ps.stem(w)
print(w,” -Stem- ”,stemPrint)
Lemmatization Example 1
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text5 = "I am Studying in Part 2."
tokenization = nltk.word_tokenize(text5)
for w in tokenization:
print("Lemma for {} is {}".format(w,
wordnet_lemmatizer.lemmatize(w))

Lemmatization Example 2
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
words2=["Unexpected", "disagreement", "disagree", "agreement",
"quirkiness", "historical", "canonical", "happiness", "unkind",
"dogs", "expected", “studies”,”cries”,”applies”]
for w in words2:
print("Lemma for {} is {}".format(w,
wordnet_lemmatizer.lemmatize(w)))
Practical 2
Implement the python program that splits the words and displays both splitted words and count
of words in the given sentences using the tokenizer function.

#Word Tokenization using split() function


ExText="I am a Student of Somaiya University."
SplitText=ExText.split()
print(SplitText)
print(“The number of words in given sentence are “+ len(SplitText))
#Sentence Tokenization using split() function
ExText="I am a Student. My College is Somaiya University."
SplitText=ExText.split('.')
print(SplitText)
print(“The number of sentences in given text are “+len(SplitText))
#Using Sent Tokenizer and word Tokenizer Modules
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag_sents
from nltk.tokenize import word_tokenize, sent_tokenize

#Assign Example Text


ExText='Natural language processing (NLP) refers to the branch of computer
science—and more specifically. The branch of artificial intelligence or
AI—concerned with giving computers the ability to understand text and
spoken words in much the same way human beings can.'
#Sentence Tokenization
text_sentence_tokens = sent_tokenize(ExText)
print(text_sentence_tokens)
#Word Tokenization
text_word_tokens = []
for sentence_token in text_sentence_tokens:
text_word_tokens.append(word_tokenize(sentence_token))
print(text_word_tokens)
#POS Tag Word Tokens
text_tagged = pos_tag_sents(text_word_tokens)
print (text_tagged)
#Tokenizing Contradiction
word_tokenize("can't")
['ca', "n't"]
#TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('Hello World.')
#WordPunctTokenizer
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Can't is a contraction.")
#RegexpTokenizer
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")
#RegexpTokenizer
tokenizer = RegexpTokenizer('\s+', gaps=True)
tokenizer.tokenize("Can't is a contraction.")
#Tokenizing webtext corpus Text - overheard.txt
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('overheard.txt')
sent_tokenizer = PunktSentenceTokenizer(text)
sents1 = sent_tokenizer.tokenize(text)
#Display only first two sentences of sent1
sents1[0:2]

#sent at index 500


sents1[500]
#Tokenize Encoded Text
with open('/root/nltk_data/corpora/webtext/overheard.txt',
encoding='ISO-8859-2') as f:
text = f.read()
sent_tokenizer = PunktSentenceTokenizer(text)
sents = sent_tokenizer.tokenize(text)
sents[0]
Practical 3
Write a python program to read the paragraph and generate the tokens from the paragraph
using sentence tokenizer. Also find the part of speech for each word in the individual tokens that
have been generated.

#Assign Example Text


ExText='Natural language processing (NLP) refers to the branch of computer
science—and more specifically. The branch of artificial intelligence or
AI—concerned with giving computers the ability to understand text and
spoken words in much the same way human beings can.'
#Sentence Tokenization
text_sentence_tokens = sent_tokenize(ExText)
print(text_sentence_tokens)
#POS Tag the Sentences
SentTokens=nltk.sent_tokenize(ExText)
print(nltk.pos_tag(SentTokens))
#Word Tokenization

text_word_tokens = []
for sentence_token in text_sentence_tokens:
text_word_tokens.append(word_tokenize(sentence_token))
print(text_word_tokens)
#POS Tag Word Tokens
text_tagged = pos_tag_sents(text_word_tokens)
print (text_tagged)
#Default tagging
from nltk.tag import DefaultTagger
tagger = DefaultTagger('NN')
tagger.tag(['Hello', 'World'])
#Evaluating Accuracy
import nltk
nltk.download('treebank')
from nltk.corpus import treebank
test_sents = treebank.tagged_sents()[3000:]
tagger.evaluate(test_sents)
#Tagging Sentence
tagger.tag_sents([['Hello', 'world', '.'], ['How', 'are', 'you','?']])
#Untagging a tagged sentence
from nltk.tag import untag

untag([('Hello', 'NN'), ('World', 'NN')])


untag([('Hello', 'DD'), ('World', 'DD')])
untag([('Hello', 'NN'), ('World', 'JJ')])
#Regular Expression Tagger
from nltk.corpus import brown from nltk.tag
import RegexpTagger
test_sent = brown.sents(categories='news')[0]
regexp_tagger = RegexpTagger(
[(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'(The|the|A|a|An|an)$', 'AT'), # articles
(r'.*able$', 'JJ'), # adjectives
(r'.*ness$', 'NN'), # nouns formed from adjectives
(r'.*ly$', 'RB'), # adverbs
(r'.*s$', 'NNS'), # plural nouns
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # past tense
verbs (r'.*', 'NN') # nouns (default)
])
print(regexp_tagger)
print(regexp_tagger.tag(test_sent))
print(regexp_tagger.tag(str))

str= ”asd40 500 running ended”


str1=[‘asd40’,‘500’, ‘running’, ‘ended’]
print(regexp_tagger.tag(str))
print(regexp_tagger.tag(str1))
Practical 4
Draw a Parse tree using python for any given sentence in a required grammar rule using the
chunk parsing.

grammar1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")
sent = "Mary saw Bob".split()
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(sent):
print(tree)

Ex 2
import nltk
from nltk.parse import RecursiveDescentParser
Prod_rule=nltk.CFG.fromstring("""
S -> NP VP
NP -> N
NP -> Det N
VP -> V NP
VP -> V
N -> 'Person Name' | 'He' | 'She' | 'Boy' | 'Girl' | 'It' | 'cricket' |
'song' | 'book'
V -> 'likes' | 'reads' | 'sings'
""")
sent='He likes cricket'
sent1=sent.split()
sent1
parser = nltk.RecursiveDescentParser(Prod_rule)
parser
for t in parser.parse(sent1):
print(t)

Ex 3
Simple_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']
sent
parser = nltk.ChartParser(Simple_grammar)
for tree in parser.parse(sent):
print(tree)
Practical 5
Write a python code to find the term frequency and inverse document frequency for three
documents. (Consider 3 documents as 3 paragraphs)

Ex 1
# Getting bigrams
vectorizer = CountVectorizer(ngram_range =(2, 2))
X1 = vectorizer.fit_transform(txt1)
features = (vectorizer.get_feature_names())
print("\n\nX1 : \n", X1.toarray())
# Applying TFIDF
# You can still get n-grams here

vectorizer = TfidfVectorizer(ngram_range = (2, 2))


X2 = vectorizer.fit_transform(txt1)
scores = (X2.toarray())
print("\n\nScores : \n", scores)
# Getting top ranking features
sums = X2.sum(axis = 0)
data1 = []
for col, term in enumerate(features):
data1.append( (term, sums[0, col] ))
ranking = pd.DataFrame(data1, columns = ['term', 'rank'])
words = (ranking.sort_values('rank', ascending = False))
print ("\n\nWords : \n", words.head(7))

Ex 2
# Importing libraries
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd

# Input the file


txt1 = []
with open('C:\\Users\\DELL\\Desktop\\MachineLearning1.txt') as file:
txt1 = file.readlines()
# Preprocessing
def remove_string_special_characters(s):
# removes special characters with ' '
stripped = re.sub('[^a-zA-z\s]', '', s)
stripped = re.sub('_', '', stripped)
# Change any white space to one space
stripped = re.sub('\s+', ' ', stripped)
# Remove start and end white spaces
stripped = stripped.strip()
if stripped != '':
return stripped.lower()
# Stopword removal
stop_words = set(stopwords.words('english'))
your_list = ['skills', 'ability', 'job', 'description']
for i, line in enumerate(txt1):
txt1[i] = ' '.join([x for
x in nltk.word_tokenize(line) if
( x not in stop_words ) and ( x not in your_list )])
# Getting trigrams
vectorizer = CountVectorizer(ngram_range = (3,3))
X1 = vectorizer.fit_transform(txt1)
features = (vectorizer.get_feature_names())
print("\n\nFeatures : \n", features)
print("\n\nX1 : \n", X1.toarray())
# Applying TFIDF
vectorizer = TfidfVectorizer(ngram_range = (3,3))
X2 = vectorizer.fit_transform(txt1)
scores = (X2.toarray())
print("\n\nScores : \n", scores)
# Getting top ranking features
sums = X2.sum(axis = 0)
data1 = []
for col, term in enumerate(features):
data1.append( (term, sums[0,col] ))
ranking = pd.DataFrame(data1, columns = ['term','rank'])
words = (ranking.sort_values('rank', ascending = False))
print ("\n\nWords head : \n", words.head(7))
Practical 6
Implement a python code to remove stop words and identify Part of Speech for a given
paragraph.

from nltk.corpus import stopwords


nltk.download('stopwords')
stopwords.fileids()
stopwords.words('english')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = """This is a sample sentence,
showing off the stop words filtration."""
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
Practical 7
Find the probability for a given sentence and also all the words present in the sentence must be
in the toy_pcfg1 or toy_pcfg2 using Viterbi pcfg parsing.

#Sentence Generation
import itertools
from nltk.grammar import CFG
from nltk.parse import generate
demo_grammar = """
S -> NP VP
NP -> Det N
PP -> P NP
VP -> 'slept' | 'saw' NP | 'walked' PP
Det -> 'the' | 'a'
N -> 'man' | 'park' | 'dog'
P -> 'in' | 'with'
"""
grammar = CFG.fromstring(demo_grammar)
for n, sent in enumerate(generate.generate(grammar, n=10), 1):
print('%3d. %s' % (n, ' '.join(sent)))

Ex 2
from nltk.grammar import Nonterminal
from nltk.grammar import toy_pcfg2
from nltk.probability import DictionaryProbDist
productions = toy_pcfg2.productions()
# Get all productions with LHS=NP
np_productions = toy_pcfg2.productions(Nonterminal('NP'))
dict = {}
for pr in np_productions: dict[pr.rhs()] = pr.prob()
np_probDist = DictionaryProbDist(dict)
# Each time you call, you get a random sample
print(np_probDist.generate())
(Det, N)
print(np_probDist.generate())
(Name,)
print(np_probDist.generate())
(Name,)
pcfg_generate(grammar) -- return a tree sampled from the language described by the PCFG
grammar
Practical 8
Given two words, calculate the similarity between the words
a. By using path similarity.
b. By using Wu-Palmer Similarity.

#Synsets
from nltk.corpus import wordnet
syn1 = wordnet.synsets('hello')[0]
syn2 = wordnet.synsets('selling')[0]
print ("hello name : ", syn1.name())
print ("selling name : ", syn2.name())
a. By using path similarity.
ref = syn1.hypernyms()[0]
print ("Self comparison : ",
syn1.shortest_path_distance(ref))
print ("Distance of hello from greeting : ",
syn1.shortest_path_distance(syn2))
print ("Distance of greeting from hello : ",
syn2.shortest_path_distance(syn1))
b. By using Wu-Palmer Similarity.
syn1.wup_similarity(syn2)
Practical 9
Consider a sentence and do the following.
a. Import the libraries.
b. Then apply word tokenization and Part-Of-Speech tagging to the sentence.
c. Create a chunk parser and test it on the sentence.
d. Identify nationalities or religions or political groups, organization, date and money in the
given sentence.
(Select sentence appropriately)

Import the libraries.


import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
· Then apply word tokenization and Part-Of-Speech tagging to the sentence.
sent= '''Prime Minister Jacinda Ardern has claimed that New Zealand had
won a big battle over the spread of coronavirus. Her words came as the
country begins to exit from its lockdown.'''
words= word_tokenize(sent)
postags=pos_tag(words)
postags
● Create a chunk parser and test it on the sentence.
nltk.download('maxent_ne_chunker')
nltk.download('words')
ne_tree = nltk.ne_chunk(postags,binary=False)
print(ne_tree)
● Identify nationalities or religions or political groups, organization, date and money in
the
given sentence.(Select sentence appropriately)
locs = [('Omnicom', 'IN', 'New York'),
... ('DDB Needham', 'IN', 'New York'),
... ('Kaplan Thaler Group', 'IN', 'New York'),
... ('BBDO South', 'IN', 'Atlanta'),
... ('Georgia-Pacific', 'IN', 'Atlanta')]
query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']
print(query)
from nltk.chunk import tree2conlltags
#from pprint import pprint
iob_tagged = tree2conlltags(ne_tree)
print(iob_tagged)
Practical 10
Write down the syntax for the following:
a. Import wordnet, use the term “hello” to find synsets.
b. Using Synset, find the element in the 0th index, just the word (using lemmas).
c. Name, Definition of that first (0th index) Synset and examples of the word.
d. Discern synonyms and antonyms in synset.
e. Discern Hypernyms and Hyponyms in Synset.

#Working with wordnet and synset


nltk.download('wordnet')
from nltk.corpus import wordnet
syn = wordnet.synsets('hello')[0]
syn.name()
syn.definition()
2. Using Synset, find the element in the 0th index, just the word (using lemmas).
# Just the word:
print(syn[0].lemmas()[0].name())
3. Name, Definition of that first (0th index) Synset and examples of the word.
# Examples of the word in use in sentences:
print(syn[0].examples())
4. Discern synonyms and antonyms in synset.

import nltk
from nltk.corpus import wordnet
synonyms = []
antonyms = []
for syn in wordnet.synsets("good"):
for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
5. Discern Hypernyms and Hyponyms in Synset.
#hypernym of synset
syn.hypernyms()
#Similar synsets
syn.hypernyms()[0].hyponyms()
#Tree path of synset
syn.hypernym_paths()

#POS of synset
syn.pos()
len(wordnet.synsets('great'))
len(wordnet.synsets('great', pos='n'))
len(wordnet.synsets('great', pos='a'))
f. Compare the similarity index of any two words
import nltk
from nltk.corpus import wordnet
# Let's compare the noun of "ship" and "boat:"
w1 = wordnet.synset('run.v.01') # v here denotes the tag verb
w2 = wordnet.synset('sprint.v.01')
print(w1.wup_similarity(w2))

You might also like