0% found this document useful (0 votes)
10 views33 pages

NLP Week 03

The document discusses language models in natural language processing, explaining their role in predicting upcoming words and assigning probabilities to sentences. It covers concepts such as n-grams, the chain rule of probability, and the Markov assumption, which simplifies the prediction process by considering only the most recent words. Additionally, it introduces methods for estimating probabilities, including maximum likelihood estimation and the use of logarithms to manage numerical underflow.

Uploaded by

Faizad Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views33 pages

NLP Week 03

The document discusses language models in natural language processing, explaining their role in predicting upcoming words and assigning probabilities to sentences. It covers concepts such as n-grams, the chain rule of probability, and the Markov assumption, which simplifies the prediction process by considering only the most recent words. Additionally, it introduces methods for estimating probabilities, including maximum likelihood estimation and the use of logarithms to manage numerical underflow.

Uploaded by

Faizad Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CSCS 366 – Intro.

to NLP
Faizad Ullah

1
Language Models

2
Language Models
• A language model is a machine learning model LM that predicts
upcoming words.

• More formally, a language model assigns a probability to each


possible next word, or equivalently gives a probability distribution
over possible next works.

• Language models can also assign a probability to an entire


sentence.
Language Models
Summary
• For independent events A and B:

• For independent events A b and C:

• For dependent event A and B:

• For dependent events A, B and C:


Language Models

6
Probability of the next word

• Islamabad is a capital of ___________

• The baby started ____ when she saw her mother.

• The weather today is very ____.

‫آج موسم بہت ____ ہے۔‬

‫میں صبح اٹھ کر سب سے پہلے ____ پیتا ہوں۔‬

‫بارش کے بعد ہوا بہت ____ ہو گئی۔‬


Uses

• Machine Translation

• Spell Correction

• Speech Recognition

• Text Generation …
Probability of a sentence
• P(all of a sudden I notice three guys standing on the sidewalk)?

• P(on guys all I of notice sidewalk three a sudden standing the)?

• Counting of such sentences?


N-Gram: Basics of Counting
• Sources for Language models are Corpora

• Count words forms not lemmas


N-Grams
• Let’s begin with the task of computing P(w|h), the probability of a
word w given some history h.

• Suppose:
• h = “The water of Walden Pond is so beautifully”

• The probability of blue is:


• P(w|h) = P(blue|The water of Walden Pond is so beautifully)
N-Grams
• One way to count this is by relative frequency.
• Relative frequency-dividing the observed frequency of a particular
sequence by the observed frequency of a prefix.
• This would be answering the question “Out of the times we saw
the history h, how many times was it followed by the word w”, as
follows:
N-Grams
• If we had a large enough corpus, we could compute these two counts
and estimate the probability.

• Let see on the Google search engine.

• “FCCU is the best university in Pakistan”

• Why?

• This is because language is creative.


Calculate the probability of a sentence
• We need more clever ways to estimate P(w|h) or the probability
of an entire word sequence W.

• Let W = w1, w2, …, wn be a sentence


• P(W) = P(w1,w2, …, wn)

• Now, the question is how we compute the P(w1,w2,w3,…wn)?


• P(x,y) = p(x).p(y) if x and y are independent.
• P(x,y) = P(x).P(y|x) otherwise
Chain rule of probability

15
Chain rule of probability
• The chain rule shows the link between computing the joint
probability of a sequence and computing the conditional
probability of a word given previous words.

• We could estimate the joint probability of an entire sequence of


words by multiplying together a number of conditional
probabilities.
Chain rule of probability

P(W) = P(w1).P(w2|w1).P(w3|w1w2).P(w4|w1w2w3)...P(wn|w1w2…wn-1)
The Markov assumption

18
The Markov assumption
• The probability of a word depends only on the previous word is
called a Markov assumption.
The Markov assumption (n-gram)
• The intuition of the n-gram model is that instead of computing the
probability of a word given its entire history, we can approximate
the history by just the last few words.
• N=1 → Unigram

• N=2 → Bigram

• N=3 → Trigram

• N=4→ 4-gram
How to estimate probabilities
• In simple words, an intuitive way to estimate probabilities is called
maximum likelihood estimation or MLE.

• We get the MLE estimate for the parameters of an n-gram model


by getting counts from a corpus, and normalizing the counts so
that they lie between 0 and 1.
How to estimate probabilities

• A bigram probability of a word wn given a previous word wn-1


1. Compute the count of the bigram C(wn-1wn)

2. Normalize by the sum of all the bigrams that share the same
first word wn-1:
How to estimate probabilities: an example
• Let’s work through an example using a mini-corpus of three
sentences.
Estimates the n-gram probability by
• Augmenting sentences with <s> and </s>. dividing the observed frequency of a
particular sequence by the observed
frequency of a prefix.
How to estimate probabilities: Unigram

Unigram Count Probability


I 3 3/|N|
Sam 2 2/|N|
am 2 2/|N|
do 1 1/|N|

How to estimate probabilities: Bigram

Bigram Count Probability


<s> I 2 ?
I am 2 ?
am Sam 1 ?
Sam </s> 1 ?

How to estimate probabilities: Trigram

Trigram Count Probability


<s> <s> I 2 ?
<s> I am 1 ?
I am Sam 1 ?
am Sam </s> 1 ?

Log probabilities
• Probabilities are always between 0 and 1.
• When we multiply many small probabilities, the result becomes
even smaller and can approach zero (a problem called numerical
underflow).
• To avoid this, we use logarithms:
Naïve Bayes Classifier

28
The Bayes Theorem
• Generally, we want the most probable hypothesis given the training data

• Maximum a posteriori hypothesis 𝒉𝑴𝑨𝑷:


Naïve Bayes Classifier
Naïve Bayes Classifier
• 20 Not Spam Emails: • 10 Spam Emails:
• Dear = 15 • Dear = 10
• Friend = 8 • Friend = 5
• Money = 1 • Money = 8
• Bank = 2 • Bank = 7
• Win = 0 • Win = 6

• P(N)=? • P(N)=?
• P(Vi)=? • P(Vi)=?
Sources

• https://web.stanford.edu/~jurafsky/slp3/3.pdf

• https://web.stanford.edu/~jurafsky/slp3/4.pdf

You might also like