A7 NLP Exp2
A7 NLP Exp2
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment 02
A.1 Aim: Implement bi-gram model for 3 sentences using python or NLTK.
A.2 Prerequisite:
Python programming
A.3 Outcome:
After successful completion of this experiment students will be able to perform bi-gram for 3 sentences using Python.
A.4 Theory:
However, in Natural Language Processing it is more commonly referring to N-grams as strings of words,
where n stands for an amount of words that you are looking for.
● Unigram - An N-gram with simply one string inside (for example, it can be a unique word
- YouTube or TikTok from a given sentence e.g. YouTube is launching a new short-form video format that
seems an awful lot like TikTok).
● 2-gram or Bigram - Typically a combination of two strings or words that appear in a document: short-
form video or video format will be likely a search result of bigrams in a certain corpus of texts (and
not format video, video short-form as the word order remains the same).
● 3-gram or Trigram - An N-gram containing up to three elements that are processed together (e.g. short-
form video format or new short-form video) etc.
N-grams found its primary application in an area of probabilistic language models. As they estimate the probability
of the next item in a word sequence.
This approach for language modeling assumes a tight relationship between the position of each element in a string,
calculating the occurrence of the next word with respect to the previous one. In particular, the N-gram model
determines the probability as follows - N-1.
For instance, a trigram model (with N = 3) will predict the next word in a string based on the preceding two words
as N-1 = 2.
The other cases of implementation of N-grams models in the industry can be detection of plagiarism, where N-
grams obtained from two different texts are compared with each other to figure out the degree of similarity of the
analysed documents.
NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia
and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination
of Python + NLTK means that you can easily add language-aware data products to your larger analytical
workflows and applications.
We now have some statistical parameters about the data such as the number of unique tokens which can be used
while defining the vocabulary size in a model. Next we create the following language models on the training
corpus -
1. Unigram
2. Bigram
3. Trigram
4. Fourgram
We list the top 5 bigrams, trigrams, four-grams without smoothing. We remove those which contain only articles,
prepositions, determiners, for example, ‘of the’, ‘in a’, etc. These are called stopwords and can be removed by
using an inbuilt list from NLTK. Stopwords are the English words which do not add much meaning to a sentence
and can safely be ignored without sacrificing the meaning of the sentence.
• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word
sequences.
bigram
ngram
• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional words.
A language model must be trained on a large corpus of text to estimate good parameter values.Model can be
evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the
training corpus would give an optimistically biased estimate).Ideally, the training (and test) corpus should be
representative of the actual application data.May need to adapt a general model to a small amount of new (in-
domain) data by adding highly weighted small corpus to original training data.
sample code1 :
from nltk.util import ngrams
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))unigram=[]
bigram=[]
trigram=[]
fourgram=[]
tokenized_text = []for sentence in sents:
sentence = sentence.lower()
sequence = word_tokenize(sentence)
for word in sequence:
if (word =='.'):
sequence.remove(word)
else:
unigram.append(word)
tokenized_text.append(sequence)
bigram.extend(list(ngrams(sequence, 2)))
#unigram, bigram, trigram, and fourgram models are created
trigram.extend(list(ngrams(sequence, 3)))
fourgram.extend(list(ngrams(sequence, 4)))def removal(x):
#removes ngrams containing only stopwords
y = []
for pair in x:
count = 0
for word in pair:
if word in stop_words:
count = count or 0
else:
count = count or 1
if (count==1):
y.append(pair)
return(y)bigram = removal(bigram)
trigram = removal(trigram)
fourgram = removal(fourgram)freq_bi = nltk.FreqDist(bigram)
freq_tri = nltk.FreqDist(trigram)
freq_four = nltk.FreqDist(fourgram)print("Most common n-grams without stopword removal and without add-1
smoothing: \n")
print ("Most common bigrams: ", freq_bi.most_common(5))
print ("\nMost common trigrams: ", freq_tri.most_common(5))
print ("\nMost common fourgrams: ", freq_four.most_common(5))
Most common n-grams without stopword removal
We can also remove stopwords entirely from our dataset and find the n-gram models. Let us find the most common
n-grams in the dataset after removing all the stopwords from the dataset.
#stopwords = code for downloading stop words through nltkfrom nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))#prints top 10 unigrams, bigrams after removing
stopwordsprint("Most common n-grams with stopword removal and without add-1 smoothing: \n")
unigram_sw_removed = [p for p in unigram if p not in stop_words]
fdist = nltk.FreqDist(unigram_sw_removed)
print("Most common unigrams: ", fdist.most_common(10))bigram_sw_removed = []
bigram_sw_removed.extend(list(ngrams(unigram_sw_removed, 2)))
fdist = nltk.FreqDist(bigram_sw_removed)
print("\nMost common bigrams: ", fdist.most_common(10))
In order to get insights into the most common words used in a dataset, stopwords are generally removed to get
meaningful results.
We have used Maximum Likelihood Estimation (MLE) for training the parameters of an n-gram model. The
problem with MLE is that it assigns zero probability to unknown or unseen words. This is because MLE uses a
training corpus. If the word in the test set is not available in the training set, then the count of that particular word
is zero and it leads to zero probability. To eliminate this zero probability, we can do smoothing. Smoothing
involves taking some probability mass from the events seen in training and assigning it to unseen events. Add-1
smoothing or Laplace smoothing is a simple smoothing technique that adds 1 to the count of all n-grams in the
training set before normalizing them into probabilities.
Add-1 Smoothing
We can create a dictionary where each element is a list corresponding to a particular n-gram, and store every word
and its associated probability as elements of the list. Add-1 smoothing is performed considering all the unique
words in the tokenized text of our dataset as the vocabulary.
#Add-1 smoothing is performed here.
ngrams_all = {1:[], 2:[], 3:[], 4:[]}
for i in range(4):
for each in tokenized_text:
for j in ngrams(each, i+1):
ngrams_all[i+1].append(j);ngrams_voc = {1:set([]), 2:set([]), 3:set([]), 4:set([])}for i in range(4):
for gram in ngrams_all[i+1]:
if gram not in ngrams_voc[i+1]:
ngrams_voc[i+1].add(gram)total_ngrams = {1:-1, 2:-1, 3:-1, 4:-1}
total_voc = {1:-1, 2:-1, 3:-1, 4:-1}
for i in range(4):
total_ngrams[i+1] = len(ngrams_all[i+1])
total_voc[i+1] = len(ngrams_voc[i+1])
for i in range(4):
for ngram in ngrams_prob[i+1]:
ngram[-1] = (ngram[-1]+1)/(total_ngrams[i+1]+total_voc[i+1]) #add-1 smoothing
This is a general method to perform add-1 smoothing for any n-gram (the range can be increased from 4 if
necessary). Let us see how this can be used to obtain the most common n-grams and their smoothed probabilities
for unigrams, bigrams, trigrams and fourgrams.
#Prints top 10 unigram, bigram, trigram, fourgram after smoothingprint("Most common n-grams without stopword
removal and with add-1 smoothing: \n")
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)
We can also obtain the smoothed probabilities of each n-gram after removal of stopwords, according to the
requirement in context of the dataset.
Using the above bigram, trigram, and fourgram models that we just experimented with, we can predict the next
word (top 5 probable) given the previous n-gram for the sentences below. For str1, str2, we predict the next 2
possible word sequences using the trained smoothed models. For example, for the string ‘He looked very’ the next
words can be -
We use the pre-trained smoothed models that were formed without removal of stopwords. We obtain n-grams of
the two strings to use as input to get the next word predictions.
#smoothed models without stopwords removed are usedtoken_1 = word_tokenize(str1)
token_2 = word_tokenize(str2)ngram_1 = {1:[], 2:[], 3:[]} #to store the n-grams formed
ngram_2 = {1:[], 2:[], 3:[]}
for i in range(3):
ngram_1[i+1] = list(ngrams(token_1, i+1))[-1]
ngram_2[i+1] = list(ngrams(token_2, i+1))[-1]print("String 1: ", ngram_1,"\nString 2: ",ngram_2)
Now we use the n-grams of the strings to get the next word predictions based on the highest probabilities.
for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)
count +=1
pred_1[i+1].append(each[0][-1])
if count ==5:
break
if count<5:
while(count!=5):
pred_1[i+1].append("NOT FOUND")
#if no word prediction is found, replace with NOT FOUND count +=1for i in range(4):
ngrams_prob[i+1] = sorted(ngrams_prob[i+1], key = lambda x:x[1], reverse = True)
Now we can call the above method for each n-gram to get the next word predictions according to each model.
print("Next word predictions for the strings using the probability models of bigrams, trigrams, and
fourgrams\n")print("String 1 - after that alice said the-\n")print("Bigram model predictions: {}\nTrigram model
predictions: {}\nFourgram model predictions: {}\n" .format(pred_1[1], pred_1[2], pred_1[3]))print("String 2 -
alice felt so desperate that she was-\n")print("Bigram model predictions: {}\nTrigram model predictions:
{}\nFourgram model predictions: {}" .format(pred_2[1], pred_2[2], pred_2[3]))
Next word predictions for each n-gram model:
We see that str1 does now show any next words based on the fourgram model. This is because of the insufficient
length of the string. We also observe that the fourgram model predictions are more accurate and contextually
suitable than the trigram model predictions, which are in turn better than the bigram model predictions, and so on.
This further shows that predictions get better when more adjacent words are considered while predicting the next
word.
sample code 2
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
Unigram:
Bigram:
B.2 Input and Output:
Unigram:
- For unigram, we define the method build_unigram_model. In this method, pass our tokenized words and
calculate the probability as:
Probability of word = (count of given word ) / (total no. of words)
- We return these probabilities as a dictionary.
Input:
Output:
Bigram:
For bigram model, we define the build_bigram_model we take count of words first.
- Then we use Defaultdict, which is a sub-class of the dictionary class that returns a dictionary-like object
; to help in calculating probabilities of bigrams. It also uses unigram counts.
- Bigram probability is calculated as:
Count of [n-1][n] word = count of pair / unigram count of [nth] word
Input:
Output:
B.3 Conclusion:
Thus we have successfully implemented N-gram model using Python and learnt and demonstrated basic n-gram
model building using Counter and Defaultdict libraries to calculate probabilities for group of words from given
text.