0% found this document useful (0 votes)

7 views11 pages

A7 NLP Exp2

The document outlines an experiment aimed at implementing a bi-gram model using Python or NLTK, detailing the prerequisites, outcomes, and theoretical background on N-grams. It explains the types of N-grams, their applications in natural language processing, and provides sample code for building unigram, bigram, trigram, and fourgram models, including methods for smoothing and next word prediction. The conclusion emphasizes the successful implementation of the N-gram model and the learning achieved through the experiment.

Uploaded by

yashmokal414

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views11 pages

A7 NLP Exp2

Uploaded by

yashmokal414

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Roll No.

A07 Name: Yash Mokal

Class: BE–AI&DS Batch: A1

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment 02
A.1 Aim: Implement bi-gram model for 3 sentences using python or NLTK.

A.2 Prerequisite:
Python programming

A.3 Outcome:
After successful completion of this experiment students will be able to perform bi-gram for 3 sentences using Python.
A.4 Theory:

What are N-Grams?

N-grams represent a continuous sequence of N elements from a given set of texts. In broad terms, such items do
not necessarily stand for strings of words, they also can be phonemes, syllables or letters, depending on what you'd
like to accomplish.

However, in Natural Language Processing it is more commonly referring to N-grams as strings of words,
where n stands for an amount of words that you are looking for.

The following types of N-grams are usually distinguished:

● Unigram - An N-gram with simply one string inside (for example, it can be a unique word
- YouTube or TikTok from a given sentence e.g. YouTube is launching a new short-form video format that
seems an awful lot like TikTok).

● 2-gram or Bigram - Typically a combination of two strings or words that appear in a document: short-
form video or video format will be likely a search result of bigrams in a certain corpus of texts (and
not format video, video short-form as the word order remains the same).

● 3-gram or Trigram - An N-gram containing up to three elements that are processed together (e.g. short-
form video format or new short-form video) etc.

N-grams found its primary application in an area of probabilistic language models. As they estimate the probability
of the next item in a word sequence.
This approach for language modeling assumes a tight relationship between the position of each element in a string,
calculating the occurrence of the next word with respect to the previous one. In particular, the N-gram model
determines the probability as follows - N-1.

For instance, a trigram model (with N = 3) will predict the next word in a string based on the preceding two words
as N-1 = 2.

The other cases of implementation of N-grams models in the industry can be detection of plagiarism, where N-
grams obtained from two different texts are compared with each other to figure out the degree of similarity of the
analysed documents.

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia
and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination
of Python + NLTK means that you can easily add language-aware data products to your larger analytical
workflows and applications.

Building the N-gram Models

We now have some statistical parameters about the data such as the number of unique tokens which can be used
while defining the vocabulary size in a model. Next we create the following language models on the training
corpus -
1. Unigram
2. Bigram
3. Trigram
4. Fourgram

We list the top 5 bigrams, trigrams, four-grams without smoothing. We remove those which contain only articles,
prepositions, determiners, for example, ‘of the’, ‘in a’, etc. These are called stopwords and can be removed by
using an inbuilt list from NLTK. Stopwords are the English words which do not add much meaning to a sentence
and can safely be ignored without sacrificing the meaning of the sentence.

• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word
sequences.

bigram

ngram

• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional words.

A language model must be trained on a large corpus of text to estimate good parameter values.Model can be
evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the
training corpus would give an optimistically biased estimate).Ideally, the training (and test) corpus should be
representative of the actual application data.May need to adapt a general model to a small amount of new (in-
domain) data by adding highly weighted small corpus to original training data.

for more information refer link https://rstudio-pubs-

static.s3.amazonaws.com/115676_ab6bb49748c742b88127e8b5ce3e1298.html

sample code1 :
from nltk.util import ngrams
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))unigram=[]
bigram=[]
trigram=[]
fourgram=[]
tokenized_text = []for sentence in sents:
sentence = sentence.lower()
sequence = word_tokenize(sentence)
for word in sequence:
if (word =='.'):
sequence.remove(word)
else:
unigram.append(word)
tokenized_text.append(sequence)
bigram.extend(list(ngrams(sequence, 2)))
#unigram, bigram, trigram, and fourgram models are created
trigram.extend(list(ngrams(sequence, 3)))
fourgram.extend(list(ngrams(sequence, 4)))def removal(x):
#removes ngrams containing only stopwords
y = []
for pair in x:
count = 0
for word in pair:
if word in stop_words:
count = count or 0
else:
count = count or 1
if (count==1):
y.append(pair)
return(y)bigram = removal(bigram)
trigram = removal(trigram)
fourgram = removal(fourgram)freq_bi = nltk.FreqDist(bigram)
freq_tri = nltk.FreqDist(trigram)
freq_four = nltk.FreqDist(fourgram)print("Most common n-grams without stopword removal and without add-1
smoothing: \n")
print ("Most common bigrams: ", freq_bi.most_common(5))
print ("\nMost common trigrams: ", freq_tri.most_common(5))
print ("\nMost common fourgrams: ", freq_four.most_common(5))
Most common n-grams without stopword removal

We can also remove stopwords entirely from our dataset and find the n-gram models. Let us find the most common
n-grams in the dataset after removing all the stopwords from the dataset.
#stopwords = code for downloading stop words through nltkfrom nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))#prints top 10 unigrams, bigrams after removing
stopwordsprint("Most common n-grams with stopword removal and without add-1 smoothing: \n")
unigram_sw_removed = [p for p in unigram if p not in stop_words]
fdist = nltk.FreqDist(unigram_sw_removed)
print("Most common unigrams: ", fdist.most_common(10))bigram_sw_removed = []
bigram_sw_removed.extend(list(ngrams(unigram_sw_removed, 2)))
fdist = nltk.FreqDist(bigram_sw_removed)
print("\nMost common bigrams: ", fdist.most_common(10))

Most common n-grams after stopword removal

In order to get insights into the most common words used in a dataset, stopwords are generally removed to get
meaningful results.

What is Add-1 Smoothing?

We have used Maximum Likelihood Estimation (MLE) for training the parameters of an n-gram model. The
problem with MLE is that it assigns zero probability to unknown or unseen words. This is because MLE uses a
training corpus. If the word in the test set is not available in the training set, then the count of that particular word
is zero and it leads to zero probability. To eliminate this zero probability, we can do smoothing. Smoothing
involves taking some probability mass from the events seen in training and assigning it to unseen events. Add-1
smoothing or Laplace smoothing is a simple smoothing technique that adds 1 to the count of all n-grams in the
training set before normalizing them into probabilities.

Add-1 Smoothing

We can create a dictionary where each element is a list corresponding to a particular n-gram, and store every word
and its associated probability as elements of the list. Add-1 smoothing is performed considering all the unique
words in the tokenized text of our dataset as the vocabulary.
#Add-1 smoothing is performed here.
ngrams_all = {1:[], 2:[], 3:[], 4:[]}
for i in range(4):
for each in tokenized_text:
for j in ngrams(each, i+1):
ngrams_all[i+1].append(j);ngrams_voc = {1:set([]), 2:set([]), 3:set([]), 4:set([])}for i in range(4):
for gram in ngrams_all[i+1]:
if gram not in ngrams_voc[i+1]:
ngrams_voc[i+1].add(gram)total_ngrams = {1:-1, 2:-1, 3:-1, 4:-1}
total_voc = {1:-1, 2:-1, 3:-1, 4:-1}
for i in range(4):
total_ngrams[i+1] = len(ngrams_all[i+1])
total_voc[i+1] = len(ngrams_voc[i+1])

ngrams_prob = {1:[], 2:[], 3:[], 4:[]}

for i in range(4):
for ngram in ngrams_voc[i+1]:
tlist = [ngram]
tlist.append(ngrams_all[i+1].count(ngram))
ngrams_prob[i+1].append(tlist)

for i in range(4):
for ngram in ngrams_prob[i+1]:
ngram[-1] = (ngram[-1]+1)/(total_ngrams[i+1]+total_voc[i+1]) #add-1 smoothing

This is a general method to perform add-1 smoothing for any n-gram (the range can be increased from 4 if
necessary). Let us see how this can be used to obtain the most common n-grams and their smoothed probabilities
for unigrams, bigrams, trigrams and fourgrams.
#Prints top 10 unigram, bigram, trigram, fourgram after smoothingprint("Most common n-grams without stopword
removal and with add-1 smoothing: \n")
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

print ("Most common unigrams: ", str(ngrams_prob[1][:10]))

print ("\nMost common bigrams: ", str(ngrams_prob[2][:10]))
print ("\nMost common trigrams: ", str(ngrams_prob[3][:10]))
print ("\nMost common fourgrams: ", str(ngrams_prob[4][:10]))
Most common n-grams after add-1 smoothing:

We can also obtain the smoothed probabilities of each n-gram after removal of stopwords, according to the
requirement in context of the dataset.

Next Word Prediction:

Using the above bigram, trigram, and fourgram models that we just experimented with, we can predict the next
word (top 5 probable) given the previous n-gram for the sentences below. For str1, str2, we predict the next 2
possible word sequences using the trained smoothed models. For example, for the string ‘He looked very’ the next
words can be -

(1) ‘He looked very’ anxiously

(2) ‘He looked very’ uncomfortable

str1 = 'after that alice said the'
str2 = 'alice felt so desperate that she was'

We use the pre-trained smoothed models that were formed without removal of stopwords. We obtain n-grams of
the two strings to use as input to get the next word predictions.
#smoothed models without stopwords removed are usedtoken_1 = word_tokenize(str1)
token_2 = word_tokenize(str2)ngram_1 = {1:[], 2:[], 3:[]} #to store the n-grams formed
ngram_2 = {1:[], 2:[], 3:[]}
for i in range(3):
ngram_1[i+1] = list(ngrams(token_1, i+1))[-1]
ngram_2[i+1] = list(ngrams(token_2, i+1))[-1]print("String 1: ", ngram_1,"\nString 2: ",ngram_2)

N-grams of the 2 strings

Now we use the n-grams of the strings to get the next word predictions based on the highest probabilities.
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

pred_1 = {1:[], 2:[], 3:[]}

for i in range(3):
count = 0
for each in ngrams_prob[i+2]:
if each[0][:-1] == ngram_1[i+1]:
#to find predictions based on highest probability of n-grams

count +=1
pred_1[i+1].append(each[0][-1])
if count ==5:
break
if count<5:
while(count!=5):
pred_1[i+1].append("NOT FOUND")
#if no word prediction is found, replace with NOT FOUND count +=1for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

pred_2 = {1:[], 2:[], 3:[]}

for i in range(3):
count = 0
for each in ngrams_prob[i+2]:
if each[0][:-1] == ngram_2[i+1]:
count +=1
pred_2[i+1].append(each[0][-1])
if count ==5:
break
if count<5:
while(count!=5):
pred_2[i+1].append("\0")
count +=1

Now we can call the above method for each n-gram to get the next word predictions according to each model.
print("Next word predictions for the strings using the probability models of bigrams, trigrams, and
fourgrams\n")print("String 1 - after that alice said the-\n")print("Bigram model predictions: {}\nTrigram model
predictions: {}\nFourgram model predictions: {}\n" .format(pred_1[1], pred_1[2], pred_1[3]))print("String 2 -
alice felt so desperate that she was-\n")print("Bigram model predictions: {}\nTrigram model predictions:
{}\nFourgram model predictions: {}" .format(pred_2[1], pred_2[2], pred_2[3]))
Next word predictions for each n-gram model:

We see that str1 does now show any next words based on the fourgram model. This is because of the insufficient
length of the string. We also observe that the fourgram model predictions are more accurate and contextually
suitable than the trigram model predictions, which are in turn better than the bigram model predictions, and so on.
This further shows that predictions get better when more adjacent words are considered while predicting the next
word.

sample code 2
PART B
(PART B: TO BE COMPLETED BY STUDENTS)

B.1 Code Screenshot:

Unigram:

Bigram:
B.2 Input and Output:
Unigram:
- For unigram, we define the method build_unigram_model. In this method, pass our tokenized words and
calculate the probability as:
Probability of word = (count of given word ) / (total no. of words)
- We return these probabilities as a dictionary.
Input:

Output:

Bigram:
For bigram model, we define the build_bigram_model we take count of words first.
- Then we use Defaultdict, which is a sub-class of the dictionary class that returns a dictionary-like object
; to help in calculating probabilities of bigrams. It also uses unigram counts.
- Bigram probability is calculated as:
Count of [n-1][n] word = count of pair / unigram count of [nth] word
Input:

Output:

B.3 Conclusion:
Thus we have successfully implemented N-gram model using Python and learnt and demonstrated basic n-gram
model building using Counter and Defaultdict libraries to calculate probabilities for group of words from given
text.

Introduction To Problem Solving PPT - SD
100% (1)
Introduction To Problem Solving PPT - SD
42 pages
NLP - Module 2
No ratings yet
NLP - Module 2
77 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
2 N-Gram
No ratings yet
2 N-Gram
70 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
NLP m2
No ratings yet
NLP m2
74 pages
Health 9 Performance Task
75% (4)
Health 9 Performance Task
4 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Soundarya 256 NLP Practs
No ratings yet
Soundarya 256 NLP Practs
14 pages
NLP Expts
No ratings yet
NLP Expts
41 pages
Implementation of N-Gram Technique
No ratings yet
Implementation of N-Gram Technique
6 pages
03 LanguageModel
No ratings yet
03 LanguageModel
41 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
19102B0052 - NLP - Exp - 4
No ratings yet
19102B0052 - NLP - Exp - 4
5 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
N Gram Presentation
No ratings yet
N Gram Presentation
29 pages
04 - N-Gram Language Models
No ratings yet
04 - N-Gram Language Models
41 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
N Grams
No ratings yet
N Grams
51 pages
NLP Exp03
No ratings yet
NLP Exp03
5 pages
Module 2
No ratings yet
Module 2
98 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
Ngrams
100% (1)
Ngrams
22 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Natural Language Processing - Notes - Unit 2
No ratings yet
Natural Language Processing - Notes - Unit 2
19 pages
Ngram
No ratings yet
Ngram
41 pages
Pronouns
No ratings yet
Pronouns
41 pages
Sree017 NLP
No ratings yet
Sree017 NLP
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
Lecture13 LM YirenWang
No ratings yet
Lecture13 LM YirenWang
8 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Lecture 4 N Grams
No ratings yet
Lecture 4 N Grams
29 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
Removing Stopwords in NLP
No ratings yet
Removing Stopwords in NLP
32 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
02 Estimating N-Gram Probabilities 9-38
No ratings yet
02 Estimating N-Gram Probabilities 9-38
4 pages
NLP 2-5 Unit Notes
No ratings yet
NLP 2-5 Unit Notes
83 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
x0 Process
No ratings yet
x0 Process
4 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
NLP Midterm Spring2025
No ratings yet
NLP Midterm Spring2025
7 pages
3 Sampler
No ratings yet
3 Sampler
71 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Prog 2 NLP
No ratings yet
Prog 2 NLP
3 pages
N Grams
No ratings yet
N Grams
3 pages
Analysis of Statistical Parsing in Natural Language Processing
No ratings yet
Analysis of Statistical Parsing in Natural Language Processing
6 pages
Unit 2
No ratings yet
Unit 2
7 pages
IJISRT18DC138
No ratings yet
IJISRT18DC138
6 pages
7 Exp
No ratings yet
7 Exp
6 pages
KSJ 12 Eng First Mid Term Test - 2024
No ratings yet
KSJ 12 Eng First Mid Term Test - 2024
4 pages
Group 6 TEFL - Suggestopedia
100% (1)
Group 6 TEFL - Suggestopedia
5 pages
4ms Yearly Planning
No ratings yet
4ms Yearly Planning
6 pages
MFI 2 - Unit 1 - SB - L+S
No ratings yet
MFI 2 - Unit 1 - SB - L+S
7 pages
Learner's Answer Sheets: For Modular Distance Learning Quarter 3-Week 7
No ratings yet
Learner's Answer Sheets: For Modular Distance Learning Quarter 3-Week 7
24 pages
Grade 5. 2018
No ratings yet
Grade 5. 2018
67 pages
Thuyết Trình Ngữ Pháp (2) ..
No ratings yet
Thuyết Trình Ngữ Pháp (2) ..
37 pages
FF2 - 2024 - Thi
No ratings yet
FF2 - 2024 - Thi
98 pages
Hal Detailed Advertisment
No ratings yet
Hal Detailed Advertisment
16 pages
Scert, Telangana Scert, Telangana: Class VIII
No ratings yet
Scert, Telangana Scert, Telangana: Class VIII
210 pages
John: No. Very - People Care For What The Future Will Be Like
No ratings yet
John: No. Very - People Care For What The Future Will Be Like
8 pages
15 Sefat L2A E
No ratings yet
15 Sefat L2A E
13 pages
The-Effects-Of-Original-Pilipino-Music-On-The-Vocabulary-Of-The-High-School-Students-In-Bukidnon-Faith-Christian-School-Incorporated 2
No ratings yet
The-Effects-Of-Original-Pilipino-Music-On-The-Vocabulary-Of-The-High-School-Students-In-Bukidnon-Faith-Christian-School-Incorporated 2
6 pages
WEEK4 DLL ENGLISH
No ratings yet
WEEK4 DLL ENGLISH
9 pages
Introduction Sla
No ratings yet
Introduction Sla
6 pages
Speaking & Pronunciation 3A
No ratings yet
Speaking & Pronunciation 3A
3 pages
REV1 DRC010 Evolutty 360-X GUIDE
No ratings yet
REV1 DRC010 Evolutty 360-X GUIDE
44 pages
Arberry fiIbrhimShauq 1937
No ratings yet
Arberry fiIbrhimShauq 1937
19 pages
2A Present Perfect Vs Past Simple
No ratings yet
2A Present Perfect Vs Past Simple
14 pages
EMP EXCHANGE Ncoallocation
No ratings yet
EMP EXCHANGE Ncoallocation
1 page
Gawain 2 - Paglalakbay at Turismo
No ratings yet
Gawain 2 - Paglalakbay at Turismo
4 pages
Effects of Multiple Viewing of Captions and Subtitles On English Proficiency
No ratings yet
Effects of Multiple Viewing of Captions and Subtitles On English Proficiency
3 pages
Writing 7 Form
No ratings yet
Writing 7 Form
3 pages
What Can You Notice About Our Activity
No ratings yet
What Can You Notice About Our Activity
7 pages
CURRICULUM VITAE Irshad
No ratings yet
CURRICULUM VITAE Irshad
2 pages
Interrogative or Questions
No ratings yet
Interrogative or Questions
2 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet

A7 NLP Exp2

Uploaded by

A7 NLP Exp2

Uploaded by

Roll No.

A07 Name: Yash Mokal

What are N-Grams?

The following types of N-grams are usually distinguished:

Building the N-gram Models

for more information refer link https://rstudio-pubs-

Most common n-grams after stopword removal

What is Add-1 Smoothing?

ngrams_prob = {1:[], 2:[], 3:[], 4:[]}

print ("Most common unigrams: ", str(ngrams_prob[1][:10]))

Next Word Prediction:

(1) ‘He looked very’ *anxiously*

(2) ‘He looked very’ *uncomfortable*

N-grams of the 2 strings

pred_1 = {1:[], 2:[], 3:[]}

pred_2 = {1:[], 2:[], 3:[]}

B.1 Code Screenshot:

You might also like

(1) ‘He looked very’ anxiously

(2) ‘He looked very’ uncomfortable