Department of Computer Engineering
Semester B.E. Semester VII – Computer Engineering
Subject NLP Lab
Subject Professor In- Prof. Suja Jayachandran
charge
Assisting Teachers Prof. Suja Jayachandran
Student Name Sapana angad survase
Roll Number 20102B2005
Grade and Subject Teacher’s
Signature
Experiment Number 3
Experiment Title To study N Gram Model to calculate bigrams from a given corpus and
calculate probability of a sentence.
Resources / Hardware: Computer System Programming language:
Apparatus Required Web IDE: Goggle colab python
Description A combination of words forms a sentence. However, such a formation
is meaningful only when the words are arranged in some order.
Eg: Sit I car in the
Such a sentence is not grammatically acceptable. However, some
perfectly grammatical sentences can be nonsensical too!
Eg: Colorless green ideas sleep furiously
One easy way to handle such unacceptable sentences is by assigning
probabilities to the strings of words i.e, how likely the sentence is.
Probability of a sentence
If we consider each word occurring in its correct location as an
independent event, the probability of the sentences is : P(w(1), w(2)...,
w(n-1), w(n))
Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) |
w(1)w(2)) ... P(w(n) | w(1)w(2) ... w(n-1))
Department of Computer Engineering
Bigrams
We can avoid this very long calculation by approximating that the
probability of a given word depends only on the probability of its
previous words. This assumption is called Markov assumption and such
a model is called Markov model- bigrams. Bigrams can be generalized
to the n-gram which looks at (n-1) words in the past. A bigram is a first-
order Markov model.
Therefore, P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|
w(2)) ... P(w(n)|w(n-1))
We use (eos) tag to mark the beginning and end of a sentence.
A bigram table for a given corpus can be generated and used as a
lookup table for calculating probability of sentences.
Eg: Corpus - (eos) You book a flight (eos) I read a book (eos) You read
(eos)
Bigram Table:
P((eos) you read a book (eos))
= P(you|eos) * P(read|you) * P(a|read) * P(book|a) * P(eos|book)
= 0.33 * 0.5 * 0.5 * 0.5 * 0.5
=.020625
Department of Computer Engineering
Program import re
bigramProbability = []
uniqueWords = []
def preprocess(corpus):
corpus = 'eos ' + corpus.lower()
corpus = corpus.replace('.', ' eos')
return corpus
def generate_tokens(corpus):
corpus = re.sub(r'[^a-zA-Z0-9\s]', ' ', corpus)
tokens = [token for token in corpus.split(" ") if token != ""]
return tokens
def generate_word_counts(wordList):
wordCount = {}
for word in wordList:
if word not in wordCount:
wordCount.update({word: 1})
else:wordCount[word] += 1
return(wordCount)
def generate_ngrams(tokens):
ngrams = zip(*[tokens[i:] for i in range(2)])
return [" ".join(ngram) for ngram in ngrams]
def print_probability_table():
print('\nBigram Probability Table:\n')
for word in uniqueWords:
print('\t', word, end = ' ')
print()
for i in range(len(uniqueWords)):
print(uniqueWords[i], end = ' ')
probabilities = bigramProbability[i]
for probability in probabilities:
print('\t', probability, end = ' ')
print()
def generate_bigram_table(corpus):
corpus= preprocess(corpus)
tokens = generate_tokens(corpus)
Department of Computer Engineering
wordCount = generate_word_counts(tokens)
uniqueWords.extend(list(wordCount.keys()))
bigrams = generate_ngrams(tokens)
print (bigrams)
for firstWord in uniqueWords:
probabilityList = []
for secondWord in uniqueWords:
bigram = firstWord + ' ' + secondWord
probability = bigrams.count(bigram) / wordCount[firstWord]
probabilityList.append(probability)
bigramProbability.append(probabilityList)
print_probability_table()
def get_probability(sentence):
corpus = preprocess(sentence)
tokens = generate_tokens(corpus)
probability = 1
for token in range(len(tokens) -1):
firstWord = tokens[token]
secondWord = tokens[token + 1]
pairProbability= bigramProbability[uniqueWords.index(firstWord)]
[uniqueWords.index(secondWord)]
print('Probability: {1} | {0} = {2}'.format(firstWord, secondWord, pa
irProbability))
probability *= pairProbability
print('Probability of Sentence:', probability)
corpus = 'You book a flight. I read a book. You read.'
example = 'You read a book.'
print('Corpus:', corpus)
generate_bigram_table(corpus)
print('\nSentence:', example)
get_probability(example)
Department of Computer Engineering
Output
Conclusion An n-gram model is a type of probabilistic language model for
predicting the next item in such a sequence in the form of a (n −
1)–order Markov model.
n-gram models are now widely used in probability,
communication theory, computational linguistics (for
instance, statistical natural language processing),
computational biology (for instance, biological sequence
analysis), and data compression.
Two benefits of n-gram models (and algorithms that use them)
are simplicity and scalability
With a larger n, a model can store more context with a
well-understood space–time trade off, enabling small
experiments to scale up efficiently.