0% found this document useful (0 votes)
53 views69 pages

Deeplearning Ai

Here are the steps the generative language model would take to generate the sentence "<s> Lyn drinks chocolate </s>": 1. Choose the start token <s> 2. Choose the next bigram (<s>, Lyn) 3. Choose the next bigram (Lyn, drinks) 4. Choose the next bigram (drinks, chocolate) 5. Choose the end token </s>

Uploaded by

Jian Quan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views69 pages

Deeplearning Ai

Here are the steps the generative language model would take to generate the sentence "<s> Lyn drinks chocolate </s>": 1. Choose the start token <s> 2. Choose the next bigram (<s>, Lyn) 3. Choose the next bigram (Lyn, drinks) 4. Choose the next bigram (drinks, chocolate) 5. Choose the end token </s>

Uploaded by

Jian Quan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Copyright Notice

These slides are distributed under the Creative Commons License. DeepLearning.AI
makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or
distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides. For the rest of the details of the license, see
https://creativecommons.org/licenses/by-sa/2.0/legalcode
N-Grams:
Overview
deeplearning.ai
What you’ll be able to do!

● Create language model (LM) from text corpus to


○ Estimate probability of word sequences
○ Estimate probability of a word following a sequence of words
● Apply this concept to autocomplete a sentence with most likely suggestions

Text Language “chocolate“


corpus model “Lyn is eating … “ “eggs“
“toast“
Other Applications
Speech recognition

P(I saw a van) > P(eyes awe of an)

Spelling correction
“He entered the ship to buy some groceries” - “ship” a dictionary word
• P(entered the shop to buy) > P(entered the ship to buy)

Augmentative communication
Predict most likely word from menu for people unable to physically talk or sign.
(Newell et al., 1998)
Learning objectives
● Process text corpus to N-gram
language model Sentence
● Out of vocabulary words
auto-complete
● Smoothing for previously unseen N-
grams
● Language model evaluation
N-grams and
Probabilities
deeplearning.ai
Outline

● What are N-grams?

● N-grams and conditional probability from corpus


N-gram
An N-gram is a sequence of N words

Corpus: I am happy because I am learning

Unigrams: { I , am , happy , because , learning }

Bigrams: { I am , am happy , happy because … } I happy

Trigrams: { I am happy , am happy because, … }


Sequence notation

Corpus: This is great … teacher drinks tea.


Unigram probability
Corpus: I am happy because I am learning

Size of corpus m = 7

Probability of unigram:
Bigram probability
Corpus: I am happy because I am learning

I happy

Probability of a bigram:
Trigram Probability
Corpus: I am happy because I am learning

Probability of a trigram:
N-gram probability

Probability of N-gram:
Quiz

Objective: Apply n-gram probability calculation on sample corpus and 3-gram.

Question:
Corpus: “In every place of great resort the monster was the fashion. They sang of it in the cafes, ridiculed it in the papers, and represented it on
the stage. ” (Jules Verne, Twenty Thousand Leagues under the Sea)

In the context of our corpus, what is the probability of word “papers” following the phrase “it in the”.

Type: Multiple Choice, single answer


Options and solution:

1. P(papers|it in the) = 0 2. P(papers|it in the) =1

3. P(papers|it in the) = 2/3 4. P(papers|it in the) =


1/2
= C(it in the papers)/C(it in the)
Sequence
Probabilities
deeplearning.ai
Outline
● Sequence probability

● Sequence probability shortcomings

● Approximation by N-gram probabilities


Probability of a sequence
● Given a sentence, what is its probability?

?
● Conditional probability and chain rule reminder
Probability of a sequence
Sentence not in corpus
● Problem: Corpus almost never contains the exact sentence we’re
interested in or even its longer subsequences!

Input: the teacher drinks tea

Both
likely
0
Approximation of sequence probability
the teacher drinks tea
Approximation of sequence probability
● Markov assumption: only last N words matter

● Bigram
● N-gram

● Entire sentence modeled with bigram


Quiz

Objective: Apply sequence probability approximation with bigrams.

Question:
Given these conditional probabilities
P(Mary)=0.1; P(likes)=0.2; P(cats)=0.3
P(Mary|likes) =0.2; P(likes|Mary) =0.3; P(cats|likes)=0.1; P(likes|cats)=0.4

Approximate the probability of the following sentence with bigrams: “Mary likes cats”

Type: Multiple Choice, single answer


Options and solution:

1. P(Mary likes cats) = 0 2. P(Mary likes cats) =1

3. P(Mary likes cats) = 0.003 4. P(Mary likes cats) = 0.008


Starting and
Ending
Sentences
deeplearning.ai
Outline

● Start of sentence symbols <s>

● End of sentence symbol </s>


Start of sentence token <s>

the teacher drinks tea

<s> the teacher drinks tea


Start of sentence token <s> for N-grams
● Trigram:

the teacher drinks tea => <s> <s> the teacher drinks tea

● N-gram model: add N-1 start tokens <s>


End of sentence token </s> - motivation

Corpus:
<s> Lyn drinks chocolate
<s> John drinks
End of sentence token </s> - motivation
Corpus Sentences of length 2:
<s> yes no <s> yes yes
<s> yes no
<s> yes yes <s> no no
<s> no no <s> no yes
End of sentence token </s> - motivation
Corpus Sentences of length 2:
<s> yes no <s> yes yes
<s> yes no
<s> yes yes <s> no no
<s> no no <s> no yes
End of sentence token </s> - motivation
Corpus Sentences of length 3:
<s> yes no <s> yes yes yes
<s> yes yes <s> yes yes no

<s> no no <s> no no no
End of sentence token </s> - motivation
Corpus
<s> yes no
<s> yes yes
<s> no no
End of sentence token </s> - solution

● Bigram
<s> the teacher drinks tea => <s> the teacher drinks tea </s>

Corpus:
<s> Lyn drinks chocolate </s>
<s> John drinks </s>
End of sentence token </s> for N-grams

● N-gram => just one </s>

E.g. Trigram:
the teacher drinks tea => <s> <s> the teacher drinks tea </s>
Example - bigram
Corpus
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
Quiz

Objective: Apply sequence probability approximation with bigrams after adding start and end word.

Question:
Given these conditional probabilities
P(Mary)=0.1; P(likes)=0.2; P(cats)=0.3
P(Mary|<s>)=0.2; P(</s>|cats)=0.6
P(likes|Mary) =0.3; P(cats|likes)=0.1

Approximate the probability of the following sentence with bigrams: “<s> Mary likes cats </s>”
Type: Multiple Choice, single answer
Options and solution:

1. P(<s> Mary likes cats </s>) = 0 2. P(<s> Mary likes cats </s>) =0.0036

3. P(<s> Mary likes cats </s>) = 0.003 4. P(<s> Mary likes cats </s>) = 1
The N-gram
Language
Model
deeplearning.ai
Outline
● Count matrix
● Probability matrix
● Language model
● Log probability to avoid underflow
● Generative language model
Count matrix

● Rows: unique corpus (N-1)-grams


● Columns: unique corpus words

Corpus: <s>I study I learn</s>


● Bigram
<s> </s> I study learn
count matrix
<s> 0 0 1 0 0
</s> 0 0 0 0 0
I 0 0 0 1 1
“study I” bigram study 0 0 1 0 0
learn 0 1 0 0 0
Probability matrix
• Divide each cell by its row sum

Corpus: <s>I study I learn</s>


Count matrix (bigram) Probability matrix
<s> </s> I study learn sum <s> </s> I study learn
<s> 0 0 1 0 0 1 <s> 0 0 1 0 0
</s> 0 0 0 0 0 0 </s> 0 0 0 0 0
I 0 0 0 1 1 2 I 0 0 0 0.5 0.5
study 0 0 1 0 0 1 study 0 0 1 0 0
learn 0 1 0 0 0 1 learn 0 1 0 0 0
Language model
● probability matrix => language model
○ Sentence probability
○ Next word prediction

<s> </s> I study learn Sentence probability:


<s> 0 0 1 0 0 <s> I learn </s>
</s> 0 0 0 0 0
I 0 0 0 0.5 0.5
study 0 0 1 0 0
learn 0 1 0 0 0
Log probability
● All probabilities in calculation <=1 and multiplying them brings risk
of underflow

● Logarithm properties reminder

● Use log of the probabilities in Probability matrix and calculations

● Converts back from log


Generative Language model
Corpus:
<s> Lyn drinks chocolate </s> 1. (<s>, Lyn) or (<s>, John)?
<s> John drinks tea </s> 2. (Lyn,eats) or (Lyn,drinks) ?
3. (drinks,tea) or (drinks,chocolate)?
<s> Lyn eats chocolate </s>
4. (tea,</s>) - always
Algorithm:
1. Choose sentence start
2. Choose next bigram starting with previous word
3. Continue until </s> is picked
Quiz

Objective: Apply sum when calculating log probability instead of product.

Question:
Given the logarithm of these conditional probabilities:
log(P(Mary|<s>))=-2; log(P(</s>|cats))=-1
log(P(likes|Mary)) =-10; log(P(cats|likes))=-100

Approximate the log probability of the following sentence with bigrams : “<s> Mary likes cats </s>”

Type: Multiple Choice, single answer


Options and solution:

1. log(P(<s> Mary likes cats </s>)) = -113 2. log(P(<s> Mary likes cats </s>)) =2000

3. log(P(<s> Mary likes cats </s>)) = 113 4. log(P(<s> Mary likes cats </s>))= -112
Language
Model
Evaluation
deeplearning.ai
Outline
● Train/Validation/Test split

● Perplexity
Test data
● Split corpus to Train/Validation/Test Evaluate on Training
dataset

● For smaller corpora ● For large corpora (typical for text)


○ 80% Train ○ 98% Train

○ 10% Validation ○ 1% Validation

○ 10% Test ○ 1% Test


Test data - split method
● Continuous text ● Random short sequences
Test

Validation

Training
Corpus
Perplexity

W → test set containing m sentences s


→ i-th sentence in the test set, each ending with </s>
m → number of all words in entire test set W including
</s> but not including <s>
Perplexity

E.g. m=100

● Smaller perplexity = better model

● Character level models PP < word-based models PP


Perplexity for bigram models

→ j-th word in i-th sentence

● concatenate all sentences in W

→ i-th word in test set


Log perplexity
Examples Training 38 million words, test 1.5 million words, WSJ corpus
Perplexity Unigram: 962 Bigram: 170 Trigram: 109

[Figure from Speech and Language Processing by Dan Jurafsky et. al]
Quiz

Objective: Calculate log perplexity from log probabilities using sum and correct normalization coefficient (not
including <s>).
Question:
Given the logarithm of these conditional probabilities:
log(P(Mary|<s>))=-2; log(P(</s>|cats))=-1
log(P(likes|Mary)) =-10; log(P(cats|likes))=-100

Assuming our test set is W=“<s> Mary likes cats </s>”, what is the model’s perplexity.
Type: Multiple Choice, single answer
Options and solution:

1. log PP(W) = -113 2. log PP(W) = (-1/4)*(-113)

3. log PP(W) = (-1/5)*(-113) 4. log PP(W) = (-1/5)*113


Out of
Vocabulary
Words
deeplearning.ai
Outline

● Unknown words

● Update corpus with <UNK>

● Choosing vocabulary
Out of vocabulary words
● Closed vs. Open vocabularies

● Unknown word = Out of vocabulary word (OOV)

● special tag <UNK> in corpus and in input


Using <UNK> in corpus
● Create vocabulary V

● Replace any word in corpus and not in V by <UNK>

● Count the probabilities with <UNK> as with any other word


Example
Corpus Corpus
<s> Lyn drinks chocolate </s> <s> Lyn drinks chocolate </s>
<s> John drinks tea </s> <s> <UNK> drinks <UNK> </s>
<s> Lyn eats chocolate </s> <s> Lyn <UNK> chocolate </s>
Min frequency f=2

Input query
Vocabulary <s>Adam drinks chocolate</s>
Lyn, drinks, chocolate
<s><UNK> drinks chocolate</s>
How to create vocabulary V
● Criteria:
○ Min word frequency f
○ Max |V|, include words by frequency

● Use <UNK> sparingly

● Perplexity - only compare LMs with the same V


Quiz

Objective: Create corpus vocabulary based on minimum frequency.

Question:
Given the training corpus and minimum word frequency=2, how would the vocabulary for corpus
preprocessed with <UNK> look like?

“<s> I am happy I am learning </s> <s> I am happy I can study </s>”

Type: Multiple Choice, single answer


Options and solution:

1. V = (I,am,happy) 2. V = (I,am,happy,learning,can,study)

3. V = (I,am,happy,I,am) 4. V=
(I,am,happy,learning,can,study,<UNK>)
Smoothing
deeplearning.ai
Outline
● Missing N-grams in corpus

● Smoothing

● Backoff and interpolation


Missing N-grams in training corpus
● Problem: N-grams made of known words still might be missing in
the training corpus “John” , “eats” in corpus “John eats”
● Their counts cannot be used for probability estimation

Can be 0
● Advanced methods:
Smoothing Kneser-Ney smoothing
Good-Turing smoothing

● Add-one smoothing (Laplacian smoothing)

● Add-k smoothing
Backoff
● If N-gram missing => use (N-1)-gram, …
○ Probability discounting e.g. Katz backoff
○ “Stupid” backoff

Corpus
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
Interpolation
Quiz

Objective: Apply n-gram probability with add-k smoothing for phrase not present in the corpus.

Question:
Corpus: “I am happy I am learning”

In the context of our corpus, what is the estimated probability of word “can” following the word “I” using the
bigram model and add-k-smoothing where k=3.

Type: Multiple Choice, single answer


Options and solution:

1. P(can|I) = 0 2. P(can|I) =1

3. P(can|I) = 3/(2+3*4) 4. P(can|I) = 3/(3*4)


Week
Summary
deeplearning.ai
Summary
● N-Grams and probabilities
● Approximate sentence probability from N-Grams
● Build language model from corpus
● Fix missing information
○ Out of vocabulary words with <UNK>
○ Missing N-Gram in corpus with smoothing, backoff and interpolation
● Evaluate language model with perplexity
● Coding assignment!

You might also like