0% found this document useful (0 votes)
58 views14 pages

Soundarya 256 NLP Practs

The document describes a student's natural language processing practical assignments. It includes code to implement sentence segmentation, word tokenization, stemming, lemmatization, and a trigram language model. The code segments tokenize text, perform various stemming and lemmatization algorithms, build n-grams, calculate n-gram probabilities, and make next word predictions using the probabilistic models. The output of running the code is also described.

Uploaded by

Kajal Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views14 pages

Soundarya 256 NLP Practs

The document describes a student's natural language processing practical assignments. It includes code to implement sentence segmentation, word tokenization, stemming, lemmatization, and a trigram language model. The code segments tokenize text, perform various stemming and lemmatization algorithms, build n-grams, calculate n-gram probabilities, and make next word predictions using the probabilistic models. The output of running the code is also described.

Uploaded by

Kajal Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

PRACTICAL NO: 1
AIM:
Write a program to implement sentence segmentation and word tokenization

THEORY:

CODES:
import nltk
from nltk.tokenize import word_tokenize

with open('soundarya.txt') as f:
lines = f.readlines()

1
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

for content in lines:


line = nltk.sent_tokenize(content)
print("Sentence is:",content)
print("Tokens are:",word_tokenize(content))
print()

TEXT DOCUMENT:

OUTPUT:

2
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

PRACTICAL NO: 2
AIM:
Write a program to Implement stemming and lemmatization.

THEORY:

3
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

A] Stemming:

CODES:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem import LancasterStemmer
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

def portstemming(words):
ps=PorterStemmer()
print("Porter Stemmer")
for word in words:
print(word,"--->",ps.stem(word))

def snowballstemming(words):
snowball = SnowballStemmer(language='english')
print("Snowball Stemmer")
for word in words:
print(word,"--->",snowball.stem(word))

def lancasterstemming(words):
lancaster = LancasterStemmer()
print("Lancaster Stemmer")
for word in words:
print(word,"--->",lancaster.stem(word))

4
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

print("Select operation.")
print("1.Porter Stemmer")
print("2.Snowball Stemmer")
print("3.Lancaster Stemmer")

while True:
choice = input("Enter choice(1/2/3): ")
if choice in ('1', '2', '3'):
if choice == '1':
print(portstemming(words))
elif choice == '2':
print(snowballstemming(words))
elif choice == '3':
print(lancasterstemming(words))

next_calculation = input("Do you want to do stemming again? (yes/no): ")


if next_calculation == "no":
break

else:
print("Invalid Input")

5
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

OUTPUT:

6
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

B] Lemmatization:

CODES:

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))

OUTPUT:

7
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

PRACTICAL NO: 3
AIM:
Write a program to Implement a tri-gram model.

THEORY:

CODES:
import nltk
from nltk.corpus import inaugural
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
f = open("corpus.txt", "r")
corpus = f.read()

from nltk.tokenize import word_tokenize,sent_tokenize


sents = nltk.sent_tokenize(corpus)
print("The number of sentences is", len(sents))
words = nltk.word_tokenize(corpus)
print("The number of tokens is", len(words))

8
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

average_tokens = round(len(words)/len(sents))
print("The average number of tokens per sentence is",average_tokens)
unique_tokens = set(words)
print("The number of unique tokens are", len(unique_tokens))
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
final_tokens = []
for each in words:
if each not in stop_words:
final_tokens.append(each)
print("The number of total tokens after removing stopwords are", len((final_tokens)))

sents = nltk.sent_tokenize (corpus)


stop_words = set (stopwords.words ('english'))
unigram=[]
bigram=[]
trigram=[]
fourgram=[]
tokenized_text = []
for sentence in sents:
sentence = sentence.lower()
sequence = word_tokenize (sentence)
for word in sequence:
if (word =="."):
sequence. remove (word)
else:
unigram.append (word)
tokenized_text.append(sequence)

9
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

bigram.extend (list(ngrams (sequence, 2)))


trigram.extend (list(ngrams (sequence, 3)))
fourgram.extend(list(ngrams (sequence, 4)))

def removal (x):


y =[]
for pair in x:
count = 0
for word in pair:
if word in stop_words:
count = count or 0
else:
count = count or 1
if (count==1):
y.append(pair)
return (y)

bigram = removal (bigram)


trigram = removal (trigram)
fourgram = removal (fourgram)
freq_bi = nltk.FreqDist(bigram)
freq_tri = nltk.FreqDist(trigram)
freq_four = nltk.FreqDist(fourgram)

print ("Most common n-grams without stopword removal and without add-1 smoothing: \n")
print("Most common bigrams:", freq_bi.most_common (5))
print("InMost common trigrams:",freq_tri.most_common (5))
print (" InMost common fourgrams:",freq_four.most_common (5))

10
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

str1 = 'after that alice said the' # use your sentence


str2 = 'alice felt so desperate that she was' # use your sentence

token_1 = word_tokenize(str1)
token_2 = word_tokenize(str2)
ngram_1 = {1:[], 2:[], 3:[]} #to store the n-grams formed
ngram_2 = {1:[], 2:[], 3:[]}
for i in range(3):
ngram_1[i+1] = list(ngrams(token_1, i+1))[-1]
ngram_2[i+1] = list(ngrams(token_2, i+1))[-1]
print("String 1: ", ngram_1,"\nString 2: ",ngram_2)

#Add-1 smoothing is performed here.

ngrams_all = {1:[], 2:[], 3:[], 4:[]}


for i in range(4):
for each in tokenized_text:
for j in ngrams(each, i+1):
ngrams_all[i+1].append(j);
ngrams_voc = {1:set([]), 2:set([]), 3:set([]), 4:set([])}
for i in range(4):
for gram in ngrams_all[i+1]:
if gram not in ngrams_voc[i+1]:
ngrams_voc[i+1].add(gram)
total_ngrams = {1:-1, 2:-1, 3:-1, 4:-1}
total_voc = {1:-1, 2:-1, 3:-1, 4:-1}
for i in range(4):

11
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

total_ngrams[i+1] = len(ngrams_all[i+1])
total_voc[i+1] = len(ngrams_voc[i+1])

ngrams_prob = {1:[], 2:[], 3:[], 4:[]}


for i in range(4):
for ngram in ngrams_voc[i+1]:
tlist = [ngram]
tlist.append(ngrams_all[i+1].count(ngram))
ngrams_prob[i+1].append(tlist)

for i in range(4):
for ngram in ngrams_prob[i+1]:
ngram[-1] = (ngram[-1]+1)/(total_ngrams[i+1]+total_voc[i+1]) #add-1 smoothing
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

pred_1 = {1:[], 2:[], 3:[]}


for i in range(3):
count = 0
for each in ngrams_prob[i+2]:
if each[0][:-1] == ngram_1[i+1]:
#to find predictions based on highest probability of n-grams

count +=1
pred_1[i+1].append(each[0][-1])
if count ==5:
break
if count<5:

12
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

while(count!=5):
pred_1[i+1].append("NOT FOUND")
#if no word prediction is found, replace with NOT FOUND
count +=1
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

pred_2 = {1:[], 2:[], 3:[]}


for i in range(3):
count = 0
for each in ngrams_prob[i+2]:
if each[0][:-1] == ngram_2[i+1]:
count +=1
pred_2[i+1].append(each[0][-1])
if count ==5:
break
if count<5:
while(count!=5):
pred_2[i+1].append("\0")
count +=1

print("Next word predictions for the strings using the probability models of bigrams, trigrams,
and fourgrams\n")
print(str1)
print("Bigram model predictions: {}\nTrigram model predictions: {}\nFourgram model
predictions: {}\n" .format(pred_1[1], pred_1[2], pred_1[3]))
print(str2)
print("Bigram model predictions: {}\nTrigram model predictions: {}\nFourgram model
predictions: {}" .format(pred_2[1], pred_2[2], pred_2[3]))

13
NAME: SOUNDARYA NATURAL LANGUAGE PROCESSING JOURNAL ROLL NO: 246

OUTPUT:

14

You might also like