Deeplearning Ai
Deeplearning Ai
These slides are distributed under the Creative Commons License. DeepLearning.AI
makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or
distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides. For the rest of the details of the license, see
https://creativecommons.org/licenses/by-sa/2.0/legalcode
N-Grams:
Overview
deeplearning.ai
What you’ll be able to do!
Spelling correction
“He entered the ship to buy some groceries” - “ship” a dictionary word
• P(entered the shop to buy) > P(entered the ship to buy)
Augmentative communication
Predict most likely word from menu for people unable to physically talk or sign.
(Newell et al., 1998)
Learning objectives
● Process text corpus to N-gram
language model Sentence
● Out of vocabulary words
auto-complete
● Smoothing for previously unseen N-
grams
● Language model evaluation
N-grams and
Probabilities
deeplearning.ai
Outline
Size of corpus m = 7
Probability of unigram:
Bigram probability
Corpus: I am happy because I am learning
I happy
Probability of a bigram:
Trigram Probability
Corpus: I am happy because I am learning
Probability of a trigram:
N-gram probability
Probability of N-gram:
Quiz
Question:
Corpus: “In every place of great resort the monster was the fashion. They sang of it in the cafes, ridiculed it in the papers, and represented it on
the stage. ” (Jules Verne, Twenty Thousand Leagues under the Sea)
In the context of our corpus, what is the probability of word “papers” following the phrase “it in the”.
?
● Conditional probability and chain rule reminder
Probability of a sequence
Sentence not in corpus
● Problem: Corpus almost never contains the exact sentence we’re
interested in or even its longer subsequences!
Both
likely
0
Approximation of sequence probability
the teacher drinks tea
Approximation of sequence probability
● Markov assumption: only last N words matter
● Bigram
● N-gram
Question:
Given these conditional probabilities
P(Mary)=0.1; P(likes)=0.2; P(cats)=0.3
P(Mary|likes) =0.2; P(likes|Mary) =0.3; P(cats|likes)=0.1; P(likes|cats)=0.4
Approximate the probability of the following sentence with bigrams: “Mary likes cats”
the teacher drinks tea => <s> <s> the teacher drinks tea
Corpus:
<s> Lyn drinks chocolate
<s> John drinks
End of sentence token </s> - motivation
Corpus Sentences of length 2:
<s> yes no <s> yes yes
<s> yes no
<s> yes yes <s> no no
<s> no no <s> no yes
End of sentence token </s> - motivation
Corpus Sentences of length 2:
<s> yes no <s> yes yes
<s> yes no
<s> yes yes <s> no no
<s> no no <s> no yes
End of sentence token </s> - motivation
Corpus Sentences of length 3:
<s> yes no <s> yes yes yes
<s> yes yes <s> yes yes no
…
<s> no no <s> no no no
End of sentence token </s> - motivation
Corpus
<s> yes no
<s> yes yes
<s> no no
End of sentence token </s> - solution
● Bigram
<s> the teacher drinks tea => <s> the teacher drinks tea </s>
Corpus:
<s> Lyn drinks chocolate </s>
<s> John drinks </s>
End of sentence token </s> for N-grams
E.g. Trigram:
the teacher drinks tea => <s> <s> the teacher drinks tea </s>
Example - bigram
Corpus
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
Quiz
Objective: Apply sequence probability approximation with bigrams after adding start and end word.
Question:
Given these conditional probabilities
P(Mary)=0.1; P(likes)=0.2; P(cats)=0.3
P(Mary|<s>)=0.2; P(</s>|cats)=0.6
P(likes|Mary) =0.3; P(cats|likes)=0.1
Approximate the probability of the following sentence with bigrams: “<s> Mary likes cats </s>”
Type: Multiple Choice, single answer
Options and solution:
1. P(<s> Mary likes cats </s>) = 0 2. P(<s> Mary likes cats </s>) =0.0036
3. P(<s> Mary likes cats </s>) = 0.003 4. P(<s> Mary likes cats </s>) = 1
The N-gram
Language
Model
deeplearning.ai
Outline
● Count matrix
● Probability matrix
● Language model
● Log probability to avoid underflow
● Generative language model
Count matrix
Question:
Given the logarithm of these conditional probabilities:
log(P(Mary|<s>))=-2; log(P(</s>|cats))=-1
log(P(likes|Mary)) =-10; log(P(cats|likes))=-100
Approximate the log probability of the following sentence with bigrams : “<s> Mary likes cats </s>”
1. log(P(<s> Mary likes cats </s>)) = -113 2. log(P(<s> Mary likes cats </s>)) =2000
3. log(P(<s> Mary likes cats </s>)) = 113 4. log(P(<s> Mary likes cats </s>))= -112
Language
Model
Evaluation
deeplearning.ai
Outline
● Train/Validation/Test split
● Perplexity
Test data
● Split corpus to Train/Validation/Test Evaluate on Training
dataset
Validation
Training
Corpus
Perplexity
E.g. m=100
[Figure from Speech and Language Processing by Dan Jurafsky et. al]
Quiz
Objective: Calculate log perplexity from log probabilities using sum and correct normalization coefficient (not
including <s>).
Question:
Given the logarithm of these conditional probabilities:
log(P(Mary|<s>))=-2; log(P(</s>|cats))=-1
log(P(likes|Mary)) =-10; log(P(cats|likes))=-100
Assuming our test set is W=“<s> Mary likes cats </s>”, what is the model’s perplexity.
Type: Multiple Choice, single answer
Options and solution:
● Unknown words
● Choosing vocabulary
Out of vocabulary words
● Closed vs. Open vocabularies
Input query
Vocabulary <s>Adam drinks chocolate</s>
Lyn, drinks, chocolate
<s><UNK> drinks chocolate</s>
How to create vocabulary V
● Criteria:
○ Min word frequency f
○ Max |V|, include words by frequency
Question:
Given the training corpus and minimum word frequency=2, how would the vocabulary for corpus
preprocessed with <UNK> look like?
1. V = (I,am,happy) 2. V = (I,am,happy,learning,can,study)
3. V = (I,am,happy,I,am) 4. V=
(I,am,happy,learning,can,study,<UNK>)
Smoothing
deeplearning.ai
Outline
● Missing N-grams in corpus
● Smoothing
Can be 0
● Advanced methods:
Smoothing Kneser-Ney smoothing
Good-Turing smoothing
● Add-k smoothing
Backoff
● If N-gram missing => use (N-1)-gram, …
○ Probability discounting e.g. Katz backoff
○ “Stupid” backoff
Corpus
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
Interpolation
Quiz
Objective: Apply n-gram probability with add-k smoothing for phrase not present in the corpus.
Question:
Corpus: “I am happy I am learning”
In the context of our corpus, what is the estimated probability of word “can” following the word “I” using the
bigram model and add-k-smoothing where k=3.
1. P(can|I) = 0 2. P(can|I) =1