0% found this document useful (0 votes)
71 views5 pages

NLP Exp03

The document describes using an n-gram model to calculate bigrams from a given corpus and determine the probability of sentences. It explains how to generate a bigram table by counting word pairs and calculating probabilities, and shows code to preprocess a corpus, generate word counts and n-grams, and calculate probabilities for a new sentence. The conclusion discusses how n-gram models are commonly used for natural language processing tasks and have benefits of simplicity and scalability.

Uploaded by

Umm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views5 pages

NLP Exp03

The document describes using an n-gram model to calculate bigrams from a given corpus and determine the probability of sentences. It explains how to generate a bigram table by counting word pairs and calculating probabilities, and shows code to preprocess a corpus, generate word counts and n-grams, and calculate probabilities for a new sentence. The conclusion discusses how n-gram models are commonly used for natural language processing tasks and have benefits of simplicity and scalability.

Uploaded by

Umm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Department of Computer Engineering

Semester B.E. Semester VII – Computer Engineering


Subject NLP Lab
Subject Professor In- Prof. Suja Jayachandran
charge
Assisting Teachers Prof. Suja Jayachandran

Student Name Sapana angad survase


Roll Number 20102B2005
Grade and Subject Teacher’s
Signature

Experiment Number 3
Experiment Title To study N Gram Model to calculate bigrams from a given corpus and
calculate probability of a sentence.
Resources / Hardware: Computer System Programming language:
Apparatus Required Web IDE: Goggle colab python

Description A combination of words forms a sentence. However, such a formation


is meaningful only when the words are arranged in some order.

Eg: Sit I car in the

Such a sentence is not grammatically acceptable. However, some


perfectly grammatical sentences can be nonsensical too!

Eg: Colorless green ideas sleep furiously

One easy way to handle such unacceptable sentences is by assigning


probabilities to the strings of words i.e, how likely the sentence is.

Probability of a sentence

If we consider each word occurring in its correct location as an


independent event, the probability of the sentences is : P(w(1), w(2)...,
w(n-1), w(n))

Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) |


w(1)w(2)) ... P(w(n) | w(1)w(2) ... w(n-1))
Department of Computer Engineering

Bigrams

We can avoid this very long calculation by approximating that the


probability of a given word depends only on the probability of its
previous words. This assumption is called Markov assumption and such
a model is called Markov model- bigrams. Bigrams can be generalized
to the n-gram which looks at (n-1) words in the past. A bigram is a first-
order Markov model.

Therefore, P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|


w(2)) ... P(w(n)|w(n-1))

We use (eos) tag to mark the beginning and end of a sentence.

A bigram table for a given corpus can be generated and used as a


lookup table for calculating probability of sentences.

Eg: Corpus - (eos) You book a flight (eos) I read a book (eos) You read
(eos)

Bigram Table:

P((eos) you read a book (eos))


= P(you|eos) * P(read|you) * P(a|read) * P(book|a) * P(eos|book)
= 0.33 * 0.5 * 0.5 * 0.5 * 0.5
=.020625
Department of Computer Engineering

Program import re

bigramProbability = []
uniqueWords = []

def preprocess(corpus):
  corpus = 'eos ' + corpus.lower()
  corpus = corpus.replace('.', ' eos')
  return corpus

def generate_tokens(corpus):
  corpus = re.sub(r'[^a-zA-Z0-9\s]', ' ', corpus)
  tokens = [token for token in corpus.split(" ") if token != ""]
  return tokens

def generate_word_counts(wordList):
  wordCount = {}
  for word in wordList:
    if word not in wordCount:
      wordCount.update({word: 1})
    else:wordCount[word] += 1
  return(wordCount)

def generate_ngrams(tokens):
  ngrams = zip(*[tokens[i:] for i in range(2)])
  return [" ".join(ngram) for ngram in ngrams]

def print_probability_table():
  print('\nBigram Probability Table:\n')
  for word in uniqueWords:
    print('\t', word, end = ' ')
  print()
  for i in range(len(uniqueWords)):
    print(uniqueWords[i], end = ' ')
    probabilities = bigramProbability[i]
    for probability in probabilities:
      print('\t', probability, end = ' ')
    print()

def generate_bigram_table(corpus):
  corpus= preprocess(corpus)
  tokens = generate_tokens(corpus)
Department of Computer Engineering

  wordCount = generate_word_counts(tokens)
  uniqueWords.extend(list(wordCount.keys()))
  bigrams = generate_ngrams(tokens)
  print (bigrams)
  for firstWord in uniqueWords:
    probabilityList = []
    for secondWord in uniqueWords:
      bigram = firstWord + ' ' + secondWord
      probability = bigrams.count(bigram) / wordCount[firstWord]
      probabilityList.append(probability)
    bigramProbability.append(probabilityList)
  print_probability_table()

def get_probability(sentence):
  corpus = preprocess(sentence)
  tokens = generate_tokens(corpus)
  probability = 1
  for token in range(len(tokens) -1):
    firstWord = tokens[token]
    secondWord = tokens[token + 1]
    pairProbability= bigramProbability[uniqueWords.index(firstWord)]
[uniqueWords.index(secondWord)]
    print('Probability: {1} | {0} = {2}'.format(firstWord, secondWord, pa
irProbability))
    probability *= pairProbability
  print('Probability of Sentence:', probability)

corpus = 'You book a flight. I read a book. You read.'
example = 'You read a book.'

print('Corpus:', corpus)
generate_bigram_table(corpus)

print('\nSentence:', example)
get_probability(example)
Department of Computer Engineering

Output

Conclusion  An n-gram model is a type of probabilistic language model for


predicting the next item in such a sequence in the form of a (n −
1)–order Markov model.
 n-gram models are now widely used in probability,
communication theory, computational linguistics (for
instance, statistical natural language processing),
computational biology (for instance, biological sequence
analysis), and data compression.
 Two benefits of n-gram models (and algorithms that use them)
are simplicity and scalability
 With a larger n, a model can store more context with a
well-understood space–time trade off, enabling small
experiments to scale up efficiently.

You might also like