0% found this document useful (0 votes)
7 views11 pages

A7 NLP Exp2

The document outlines an experiment aimed at implementing a bi-gram model using Python or NLTK, detailing the prerequisites, outcomes, and theoretical background on N-grams. It explains the types of N-grams, their applications in natural language processing, and provides sample code for building unigram, bigram, trigram, and fourgram models, including methods for smoothing and next word prediction. The conclusion emphasizes the successful implementation of the N-gram model and the learning achieved through the experiment.

Uploaded by

yashmokal414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

A7 NLP Exp2

The document outlines an experiment aimed at implementing a bi-gram model using Python or NLTK, detailing the prerequisites, outcomes, and theoretical background on N-grams. It explains the types of N-grams, their applications in natural language processing, and provides sample code for building unigram, bigram, trigram, and fourgram models, including methods for smoothing and next word prediction. The conclusion emphasizes the successful implementation of the N-gram model and the learning achieved through the experiment.

Uploaded by

yashmokal414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Roll No.

A07 Name: Yash Mokal


Class: BE–AI&DS Batch: A1

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment 02
A.1 Aim: Implement bi-gram model for 3 sentences using python or NLTK.

A.2 Prerequisite:
Python programming

A.3 Outcome:
After successful completion of this experiment students will be able to perform bi-gram for 3 sentences using Python.
A.4 Theory:

What are N-Grams?


N-grams represent a continuous sequence of N elements from a given set of texts. In broad terms, such items do
not necessarily stand for strings of words, they also can be phonemes, syllables or letters, depending on what you'd
like to accomplish.

However, in Natural Language Processing it is more commonly referring to N-grams as strings of words,
where n stands for an amount of words that you are looking for.

The following types of N-grams are usually distinguished:

● Unigram - An N-gram with simply one string inside (for example, it can be a unique word
- YouTube or TikTok from a given sentence e.g. YouTube is launching a new short-form video format that
seems an awful lot like TikTok).

● 2-gram or Bigram - Typically a combination of two strings or words that appear in a document: short-
form video or video format will be likely a search result of bigrams in a certain corpus of texts (and
not format video, video short-form as the word order remains the same).

● 3-gram or Trigram - An N-gram containing up to three elements that are processed together (e.g. short-
form video format or new short-form video) etc.

N-grams found its primary application in an area of probabilistic language models. As they estimate the probability
of the next item in a word sequence.
This approach for language modeling assumes a tight relationship between the position of each element in a string,
calculating the occurrence of the next word with respect to the previous one. In particular, the N-gram model
determines the probability as follows - N-1.

For instance, a trigram model (with N = 3) will predict the next word in a string based on the preceding two words
as N-1 = 2.

The other cases of implementation of N-grams models in the industry can be detection of plagiarism, where N-
grams obtained from two different texts are compared with each other to figure out the degree of similarity of the
analysed documents.

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia
and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination
of Python + NLTK means that you can easily add language-aware data products to your larger analytical
workflows and applications.

Building the N-gram Models

We now have some statistical parameters about the data such as the number of unique tokens which can be used
while defining the vocabulary size in a model. Next we create the following language models on the training
corpus -
1. Unigram
2. Bigram
3. Trigram
4. Fourgram

We list the top 5 bigrams, trigrams, four-grams without smoothing. We remove those which contain only articles,
prepositions, determiners, for example, ‘of the’, ‘in a’, etc. These are called stopwords and can be removed by
using an inbuilt list from NLTK. Stopwords are the English words which do not add much meaning to a sentence
and can safely be ignored without sacrificing the meaning of the sentence.

• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word
sequences.

bigram

ngram

• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional words.

A language model must be trained on a large corpus of text to estimate good parameter values.Model can be
evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the
training corpus would give an optimistically biased estimate).Ideally, the training (and test) corpus should be
representative of the actual application data.May need to adapt a general model to a small amount of new (in-
domain) data by adding highly weighted small corpus to original training data.

for more information refer link https://rstudio-pubs-


static.s3.amazonaws.com/115676_ab6bb49748c742b88127e8b5ce3e1298.html

sample code1 :
from nltk.util import ngrams
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))unigram=[]
bigram=[]
trigram=[]
fourgram=[]
tokenized_text = []for sentence in sents:
sentence = sentence.lower()
sequence = word_tokenize(sentence)
for word in sequence:
if (word =='.'):
sequence.remove(word)
else:
unigram.append(word)
tokenized_text.append(sequence)
bigram.extend(list(ngrams(sequence, 2)))
#unigram, bigram, trigram, and fourgram models are created
trigram.extend(list(ngrams(sequence, 3)))
fourgram.extend(list(ngrams(sequence, 4)))def removal(x):
#removes ngrams containing only stopwords
y = []
for pair in x:
count = 0
for word in pair:
if word in stop_words:
count = count or 0
else:
count = count or 1
if (count==1):
y.append(pair)
return(y)bigram = removal(bigram)
trigram = removal(trigram)
fourgram = removal(fourgram)freq_bi = nltk.FreqDist(bigram)
freq_tri = nltk.FreqDist(trigram)
freq_four = nltk.FreqDist(fourgram)print("Most common n-grams without stopword removal and without add-1
smoothing: \n")
print ("Most common bigrams: ", freq_bi.most_common(5))
print ("\nMost common trigrams: ", freq_tri.most_common(5))
print ("\nMost common fourgrams: ", freq_four.most_common(5))
Most common n-grams without stopword removal

We can also remove stopwords entirely from our dataset and find the n-gram models. Let us find the most common
n-grams in the dataset after removing all the stopwords from the dataset.
#stopwords = code for downloading stop words through nltkfrom nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))#prints top 10 unigrams, bigrams after removing
stopwordsprint("Most common n-grams with stopword removal and without add-1 smoothing: \n")
unigram_sw_removed = [p for p in unigram if p not in stop_words]
fdist = nltk.FreqDist(unigram_sw_removed)
print("Most common unigrams: ", fdist.most_common(10))bigram_sw_removed = []
bigram_sw_removed.extend(list(ngrams(unigram_sw_removed, 2)))
fdist = nltk.FreqDist(bigram_sw_removed)
print("\nMost common bigrams: ", fdist.most_common(10))

Most common n-grams after stopword removal

In order to get insights into the most common words used in a dataset, stopwords are generally removed to get
meaningful results.

What is Add-1 Smoothing?

We have used Maximum Likelihood Estimation (MLE) for training the parameters of an n-gram model. The
problem with MLE is that it assigns zero probability to unknown or unseen words. This is because MLE uses a
training corpus. If the word in the test set is not available in the training set, then the count of that particular word
is zero and it leads to zero probability. To eliminate this zero probability, we can do smoothing. Smoothing
involves taking some probability mass from the events seen in training and assigning it to unseen events. Add-1
smoothing or Laplace smoothing is a simple smoothing technique that adds 1 to the count of all n-grams in the
training set before normalizing them into probabilities.

Add-1 Smoothing

We can create a dictionary where each element is a list corresponding to a particular n-gram, and store every word
and its associated probability as elements of the list. Add-1 smoothing is performed considering all the unique
words in the tokenized text of our dataset as the vocabulary.
#Add-1 smoothing is performed here.
ngrams_all = {1:[], 2:[], 3:[], 4:[]}
for i in range(4):
for each in tokenized_text:
for j in ngrams(each, i+1):
ngrams_all[i+1].append(j);ngrams_voc = {1:set([]), 2:set([]), 3:set([]), 4:set([])}for i in range(4):
for gram in ngrams_all[i+1]:
if gram not in ngrams_voc[i+1]:
ngrams_voc[i+1].add(gram)total_ngrams = {1:-1, 2:-1, 3:-1, 4:-1}
total_voc = {1:-1, 2:-1, 3:-1, 4:-1}
for i in range(4):
total_ngrams[i+1] = len(ngrams_all[i+1])
total_voc[i+1] = len(ngrams_voc[i+1])

ngrams_prob = {1:[], 2:[], 3:[], 4:[]}


for i in range(4):
for ngram in ngrams_voc[i+1]:
tlist = [ngram]
tlist.append(ngrams_all[i+1].count(ngram))
ngrams_prob[i+1].append(tlist)

for i in range(4):
for ngram in ngrams_prob[i+1]:
ngram[-1] = (ngram[-1]+1)/(total_ngrams[i+1]+total_voc[i+1]) #add-1 smoothing

This is a general method to perform add-1 smoothing for any n-gram (the range can be increased from 4 if
necessary). Let us see how this can be used to obtain the most common n-grams and their smoothed probabilities
for unigrams, bigrams, trigrams and fourgrams.
#Prints top 10 unigram, bigram, trigram, fourgram after smoothingprint("Most common n-grams without stopword
removal and with add-1 smoothing: \n")
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

print ("Most common unigrams: ", str(ngrams_prob[1][:10]))


print ("\nMost common bigrams: ", str(ngrams_prob[2][:10]))
print ("\nMost common trigrams: ", str(ngrams_prob[3][:10]))
print ("\nMost common fourgrams: ", str(ngrams_prob[4][:10]))
Most common n-grams after add-1 smoothing:

We can also obtain the smoothed probabilities of each n-gram after removal of stopwords, according to the
requirement in context of the dataset.

Next Word Prediction:

Using the above bigram, trigram, and fourgram models that we just experimented with, we can predict the next
word (top 5 probable) given the previous n-gram for the sentences below. For str1, str2, we predict the next 2
possible word sequences using the trained smoothed models. For example, for the string ‘He looked very’ the next
words can be -

(1) ‘He looked very’ *anxiously*

(2) ‘He looked very’ *uncomfortable*


str1 = 'after that alice said the'
str2 = 'alice felt so desperate that she was'

We use the pre-trained smoothed models that were formed without removal of stopwords. We obtain n-grams of
the two strings to use as input to get the next word predictions.
#smoothed models without stopwords removed are usedtoken_1 = word_tokenize(str1)
token_2 = word_tokenize(str2)ngram_1 = {1:[], 2:[], 3:[]} #to store the n-grams formed
ngram_2 = {1:[], 2:[], 3:[]}
for i in range(3):
ngram_1[i+1] = list(ngrams(token_1, i+1))[-1]
ngram_2[i+1] = list(ngrams(token_2, i+1))[-1]print("String 1: ", ngram_1,"\nString 2: ",ngram_2)

N-grams of the 2 strings

Now we use the n-grams of the strings to get the next word predictions based on the highest probabilities.
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

pred_1 = {1:[], 2:[], 3:[]}


for i in range(3):
count = 0
for each in ngrams_prob[i+2]:
if each[0][:-1] == ngram_1[i+1]:
#to find predictions based on highest probability of n-grams

count +=1
pred_1[i+1].append(each[0][-1])
if count ==5:
break
if count<5:
while(count!=5):
pred_1[i+1].append("NOT FOUND")
#if no word prediction is found, replace with NOT FOUND count +=1for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)

pred_2 = {1:[], 2:[], 3:[]}


for i in range(3):
count = 0
for each in ngrams_prob[i+2]:
if each[0][:-1] == ngram_2[i+1]:
count +=1
pred_2[i+1].append(each[0][-1])
if count ==5:
break
if count<5:
while(count!=5):
pred_2[i+1].append("\0")
count +=1

Now we can call the above method for each n-gram to get the next word predictions according to each model.
print("Next word predictions for the strings using the probability models of bigrams, trigrams, and
fourgrams\n")print("String 1 - after that alice said the-\n")print("Bigram model predictions: {}\nTrigram model
predictions: {}\nFourgram model predictions: {}\n" .format(pred_1[1], pred_1[2], pred_1[3]))print("String 2 -
alice felt so desperate that she was-\n")print("Bigram model predictions: {}\nTrigram model predictions:
{}\nFourgram model predictions: {}" .format(pred_2[1], pred_2[2], pred_2[3]))
Next word predictions for each n-gram model:

We see that str1 does now show any next words based on the fourgram model. This is because of the insufficient
length of the string. We also observe that the fourgram model predictions are more accurate and contextually
suitable than the trigram model predictions, which are in turn better than the bigram model predictions, and so on.
This further shows that predictions get better when more adjacent words are considered while predicting the next
word.

sample code 2
PART B
(PART B: TO BE COMPLETED BY STUDENTS)

B.1 Code Screenshot:

Unigram:

Bigram:
B.2 Input and Output:
Unigram:
- For unigram, we define the method build_unigram_model. In this method, pass our tokenized words and
calculate the probability as:
Probability of word = (count of given word ) / (total no. of words)
- We return these probabilities as a dictionary.
Input:

Output:

Bigram:
For bigram model, we define the build_bigram_model we take count of words first.
- Then we use Defaultdict, which is a sub-class of the dictionary class that returns a dictionary-like object
; to help in calculating probabilities of bigrams. It also uses unigram counts.
- Bigram probability is calculated as:
Count of [n-1][n] word = count of pair / unigram count of [nth] word
Input:

Output:

B.3 Conclusion:
Thus we have successfully implemented N-gram model using Python and learnt and demonstrated basic n-gram
model building using Counter and Defaultdict libraries to calculate probabilities for group of words from given
text.

You might also like