0% found this document useful (0 votes)

58 views14 pages

Soundarya 256 NLP Practs

The document describes a student's natural language processing practical assignments. It includes code to implement sentence segmentation, word tokenization, stemming, lemmatization, and a trigram language model. The code segments tokenize text, perform various stemming and lemmatization algorithms, build n-grams, calculate n-gram probabilities, and make next word predictions using the probabilistic models. The output of running the code is also described.

Uploaded by

Kajal Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views14 pages

Soundarya 256 NLP Practs

Uploaded by

Kajal Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

PRACTICAL NO: 1
AIM:
Write a program to implement sentence segmentation and word tokenization

THEORY:

CODES:
import nltk
from nltk.tokenize import word_tokenize

with open('soundarya.txt') as f:
lines = f.readlines()

1
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

for content in lines:

line = nltk.sent_tokenize(content)
print("Sentence is:",content)
print("Tokens are:",word_tokenize(content))
print()

TEXT DOCUMENT:

OUTPUT:

2
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

PRACTICAL NO: 2
AIM:
Write a program to Implement stemming and lemmatization.

THEORY:

3
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

A] Stemming:

CODES:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem import LancasterStemmer
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

def portstemming(words):
ps=PorterStemmer()
print("Porter Stemmer")
for word in words:
print(word,"--->",ps.stem(word))

def snowballstemming(words):
snowball = SnowballStemmer(language='english')
print("Snowball Stemmer")
for word in words:
print(word,"--->",snowball.stem(word))

def lancasterstemming(words):
lancaster = LancasterStemmer()
print("Lancaster Stemmer")
for word in words:
print(word,"--->",lancaster.stem(word))

4
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

print("Select operation.")
print("1.Porter Stemmer")
print("2.Snowball Stemmer")
print("3.Lancaster Stemmer")

while True:
choice = input("Enter choice(1/2/3): ")
if choice in ('1', '2', '3'):
if choice == '1':
print(portstemming(words))
elif choice == '2':
print(snowballstemming(words))
elif choice == '3':
print(lancasterstemming(words))

next_calculation = input("Do you want to do stemming again? (yes/no): ")

if next_calculation == "no":
break

else:
print("Invalid Input")

5
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

OUTPUT:

6
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

B] Lemmatization:

CODES:

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))

OUTPUT:

7
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

PRACTICAL NO: 3
AIM:
Write a program to Implement a tri-gram model.

THEORY:

CODES:
import nltk
from nltk.corpus import inaugural
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
f = open("corpus.txt", "r")
corpus = f.read()

from nltk.tokenize import word_tokenize,sent_tokenize

sents = nltk.sent_tokenize(corpus)
print("The number of sentences is", len(sents))
words = nltk.word_tokenize(corpus)
print("The number of tokens is", len(words))

8
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

average_tokens = round(len(words)/len(sents))
print("The average number of tokens per sentence is",average_tokens)
unique_tokens = set(words)
print("The number of unique tokens are", len(unique_tokens))
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
final_tokens = []
for each in words:
if each not in stop_words:
final_tokens.append(each)
print("The number of total tokens after removing stopwords are", len((final_tokens)))

sents = nltk.sent_tokenize (corpus)

stop_words = set (stopwords.words ('english'))
unigram=[]
bigram=[]
trigram=[]
fourgram=[]
tokenized_text = []
for sentence in sents:
sentence = sentence.lower()
sequence = word_tokenize (sentence)
for word in sequence:
if (word =="."):
sequence. remove (word)
else:
unigram.append (word)
tokenized_text.append(sequence)

9
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

bigram.extend (list(ngrams (sequence, 2)))

trigram.extend (list(ngrams (sequence, 3)))
fourgram.extend(list(ngrams (sequence, 4)))

def removal (x):

y =[]
for pair in x:
count = 0
for word in pair:
if word in stop_words:
count = count or 0
else:
count = count or 1
if (count==1):
y.append(pair)
return (y)

bigram = removal (bigram)

trigram = removal (trigram)
fourgram = removal (fourgram)
freq_bi = nltk.FreqDist(bigram)
freq_tri = nltk.FreqDist(trigram)
freq_four = nltk.FreqDist(fourgram)

print ("Most common n-grams without stopword removal and without add-1 smoothing: \n")
print("Most common bigrams:", freq_bi.most_common (5))
print("InMost common trigrams:",freq_tri.most_common (5))
print (" InMost common fourgrams:",freq_four.most_common (5))

10
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

str1 = 'after that alice said the' # use your sentence

str2 = 'alice felt so desperate that she was' # use your sentence

token_1 = word_tokenize(str1)
token_2 = word_tokenize(str2)
ngram_1 = {1:[], 2:[], 3:[]} #to store the n-grams formed
ngram_2 = {1:[], 2:[], 3:[]}
for i in range(3):
ngram_1[i+1] = list(ngrams(token_1, i+1))[-1]
ngram_2[i+1] = list(ngrams(token_2, i+1))[-1]
print("String 1: ", ngram_1,"\nString 2: ",ngram_2)

#Add-1 smoothing is performed here.

ngrams_all = {1:[], 2:[], 3:[], 4:[]}

for i in range(4):
for each in tokenized_text:
for j in ngrams(each, i+1):
ngrams_all[i+1].append(j);
ngrams_voc = {1:set([]), 2:set([]), 3:set([]), 4:set([])}
for i in range(4):
for gram in ngrams_all[i+1]:
if gram not in ngrams_voc[i+1]:
ngrams_voc[i+1].add(gram)
total_ngrams = {1:-1, 2:-1, 3:-1, 4:-1}
total_voc = {1:-1, 2:-1, 3:-1, 4:-1}
for i in range(4):

11
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

total_ngrams[i+1] = len(ngrams_all[i+1])
total_voc[i+1] = len(ngrams_voc[i+1])

ngrams_prob = {1:[], 2:[], 3:[], 4:[]}

for i in range(4):
for ngram in ngrams_voc[i+1]:
tlist = [ngram]
tlist.append(ngrams_all[i+1].count(ngram))
ngrams_prob[i+1].append(tlist)

for i in range(4):
for ngram in ngrams_prob[i+1]:
ngram[-1] = (ngram[-1]+1)/(total_ngrams[i+1]+total_voc[i+1]) #add-1 smoothing
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

pred_1 = {1:[], 2:[], 3:[]}

for i in range(3):
count = 0
for each in ngrams_prob[i+2]:
if each[0][:-1] == ngram_1[i+1]:
#to find predictions based on highest probability of n-grams

count +=1
pred_1[i+1].append(each[0][-1])
if count ==5:
break
if count<5:

12
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

while(count!=5):
pred_1[i+1].append("NOT FOUND")
#if no word prediction is found, replace with NOT FOUND
count +=1
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

pred_2 = {1:[], 2:[], 3:[]}

for i in range(3):
count = 0
for each in ngrams_prob[i+2]:
if each[0][:-1] == ngram_2[i+1]:
count +=1
pred_2[i+1].append(each[0][-1])
if count ==5:
break
if count<5:
while(count!=5):
pred_2[i+1].append("\0")
count +=1

print("Next word predictions for the strings using the probability models of bigrams, trigrams,
and fourgrams\n")
print(str1)
print("Bigram model predictions: {}\nTrigram model predictions: {}\nFourgram model
predictions: {}\n" .format(pred_1[1], pred_1[2], pred_1[3]))
print(str2)
print("Bigram model predictions: {}\nTrigram model predictions: {}\nFourgram model
predictions: {}" .format(pred_2[1], pred_2[2], pred_2[3]))

13
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

OUTPUT:

Community Based Fisheries Management PDF
100% (1)
Community Based Fisheries Management PDF
2 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
Final NLP Lab File
No ratings yet
Final NLP Lab File
28 pages
A7 NLP Exp2
No ratings yet
A7 NLP Exp2
11 pages
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
No ratings yet
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
13 pages
x0 Process
No ratings yet
x0 Process
4 pages
7 Exp
No ratings yet
7 Exp
6 pages
NLP Expts
No ratings yet
NLP Expts
41 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
7 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Record
No ratings yet
NLP Record
23 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
NLP - Practical List
No ratings yet
NLP - Practical List
14 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
Batch 2
No ratings yet
Batch 2
13 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
Record
No ratings yet
Record
6 pages
Sahil NLP
No ratings yet
Sahil NLP
16 pages
N Gram Presentation
No ratings yet
N Gram Presentation
29 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
All Practicals
No ratings yet
All Practicals
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
Sanjey nlp10 Merged
No ratings yet
Sanjey nlp10 Merged
11 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
NLP Op
No ratings yet
NLP Op
16 pages
NLP Projects
No ratings yet
NLP Projects
4 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
123 NLP 456
No ratings yet
123 NLP 456
4 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Sree017 NLP
No ratings yet
Sree017 NLP
3 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
01 NLP - Merged Vinay
No ratings yet
01 NLP - Merged Vinay
27 pages
Machine Learning NLP LAB Sayak Mallick
No ratings yet
Machine Learning NLP LAB Sayak Mallick
4 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
AP19110010110 Lab Assignment-2 - Jupyter Notebook
No ratings yet
AP19110010110 Lab Assignment-2 - Jupyter Notebook
18 pages
H7 W5 NLP - Merged
No ratings yet
H7 W5 NLP - Merged
17 pages
Prog 2 NLP
No ratings yet
Prog 2 NLP
3 pages
7 Idf
No ratings yet
7 Idf
5 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
From Import From Import From Import From Import Import
No ratings yet
From Import From Import From Import From Import Import
3 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
No ratings yet
Lab-1 - Tokenization, Stemming, Stopwords - Jupyter Notebook
15 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Programs Code
No ratings yet
Programs Code
7 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Professional Ed-WPS Office
100% (2)
Professional Ed-WPS Office
127 pages
Quality Control Analysis of Cube Fish With Fault Tree Analysis (FTA) Method in ALJB A Case Study
No ratings yet
Quality Control Analysis of Cube Fish With Fault Tree Analysis (FTA) Method in ALJB A Case Study
6 pages
Structural Design of Spillway
No ratings yet
Structural Design of Spillway
10 pages
TOEFL Reading Practice
No ratings yet
TOEFL Reading Practice
142 pages
Moba Compaction Assistance
No ratings yet
Moba Compaction Assistance
12 pages
FS 380
No ratings yet
FS 380
1 page
Abhishek Dhiman
No ratings yet
Abhishek Dhiman
3 pages
Chapter Xi Correlation Coefficient
No ratings yet
Chapter Xi Correlation Coefficient
7 pages
Hyd Pressure Spek
No ratings yet
Hyd Pressure Spek
3 pages
Troubleshooting GEFANUC 90 30
No ratings yet
Troubleshooting GEFANUC 90 30
18 pages
Bio IA Template
100% (2)
Bio IA Template
4 pages
Job and Business Opportunities
No ratings yet
Job and Business Opportunities
6 pages
Confirmation and Itinerar1
No ratings yet
Confirmation and Itinerar1
6 pages
Duplicate Cleaner Log
No ratings yet
Duplicate Cleaner Log
183 pages
Chapter-1 Group7MMM
No ratings yet
Chapter-1 Group7MMM
4 pages
Electrostatic Lens (10 Points) : Theory
No ratings yet
Electrostatic Lens (10 Points) : Theory
4 pages
AFS Pro700 Brochure AFS-8018-10
No ratings yet
AFS Pro700 Brochure AFS-8018-10
2 pages
BACTERIAL QUALITY AND DEPURATION OF THE GREEN MUSSEL Perna Viridis From Natural Beds
No ratings yet
BACTERIAL QUALITY AND DEPURATION OF THE GREEN MUSSEL Perna Viridis From Natural Beds
6 pages
Eternity of Sound and The Science of Mantras
100% (2)
Eternity of Sound and The Science of Mantras
115 pages
CE6603-Design of Steel Structures
No ratings yet
CE6603-Design of Steel Structures
12 pages
Claves
No ratings yet
Claves
4 pages
Resource Utilization & Optimization in Quran: Synopsis For PHD Usulddin
No ratings yet
Resource Utilization & Optimization in Quran: Synopsis For PHD Usulddin
8 pages
Rizal Course - Instructions For The Required Terminal Paper
No ratings yet
Rizal Course - Instructions For The Required Terminal Paper
2 pages
WIT-Color Ultra 9000 High Definition Printer Operations Manual
100% (1)
WIT-Color Ultra 9000 High Definition Printer Operations Manual
95 pages
Acknowledgementslip S1365262679000
No ratings yet
Acknowledgementslip S1365262679000
1 page
Moodular Coordination
No ratings yet
Moodular Coordination
10 pages
CTR-12 - FPSO Firenze - Clarification Report - Ph-1 Presv Items
100% (1)
CTR-12 - FPSO Firenze - Clarification Report - Ph-1 Presv Items
3 pages
My First Project
No ratings yet
My First Project
7 pages
Activity in STS 101
No ratings yet
Activity in STS 101
3 pages

Soundarya 256 NLP Practs

Uploaded by

Soundarya 256 NLP Practs

Uploaded by

NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

for content in lines:

next_calculation = input("Do you want to do stemming again? (yes/no): ")

from nltk.tokenize import word_tokenize,sent_tokenize

sents = nltk.sent_tokenize (corpus)

bigram.extend (list(ngrams (sequence, 2)))

def removal (x):

bigram = removal (bigram)

str1 = 'after that alice said the' # use your sentence

#Add-1 smoothing is performed here.

ngrams_all = {1:[], 2:[], 3:[], 4:[]}

ngrams_prob = {1:[], 2:[], 3:[], 4:[]}

pred_1 = {1:[], 2:[], 3:[]}

pred_2 = {1:[], 2:[], 3:[]}

You might also like