Lecture 4 N Grams
Lecture 4 N Grams
Introduction
Training and Test Sets,
- word prediction,
- statistical model,
- probability of occurrence,
N-gram Sensitivity to the Training
- language models (LMs), Corpus,
Applications of N-grams Unknown Words: Open versus
- part-of speech tagging, closed vocabulary tasks,
- natural language generation, - out of vocabulary (OOV),
- word similarity, - estimate the probabilities,
- authorship identification, Evaluating N-grams: Perplexity,
sentiment extraction, - Intrinsitic/extrinsic evaluation
Counting Words in Corpora, Smoothing
- text corpus or speech, - Laplace Smoothing, Interpolation
Simple (Unsmoothed) N-grams, - Back off
- transparent rules, Kneser- Ney Smoothing
- bigram model, trigram model,
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Introduction of N-Grams
“Predicting words seems somewhat less fraught”
We formalize this idea of word prediction with probabilistic models called
N-gram models, which predict the next word from the previous N −1 words.
For Examples;
=> Finding the probability of occurrence of word “been” in following the
sentence “I have….”
=> Please turn your homework ….
- Hopefully, most of you concluded that a very likely word is “in” or possibly
“over” but probably not “the”.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Introduction of N-Grams (Cont…)
=> Please turn your homework ….
An N-grams is an N-token sequence of words:
- a 2-gram (more commonly called a bigram) is a two-word sequence of
words like “please turn”, “turn your” or “your homework”.
- a 3-gram (more commonly called a trigram) is a three-word sequence of
words like “please turn your”, or “turn your homework”.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Introduction of N-Grams (Cont…)
Such statistical models of word sequences are also called language models
or LMs.
- computing the probability of the next word turns out to be closely related
to computing the probability of a sequence of words.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Introduction of N-Grams (Applications)
Estimators like N-grams that assign a conditional probability to possible
next words can be used to assign a joint probability to an entire sentence.
N-grams are essential in any task in which we have to identify words in
noisy, ambiguous input.
In speech recognition, for example, the input speech sounds are very
confusable and many words sound extremely similar. e.g., in the movie
“take the Money and Run”;
e.g., - “I have a gun” is far more probable than
- the non-word “I have a gub” or even “I have a gull”.
NOTE: Since these errors have real words, we can’t find them by just flagging
words that aren’t in the dictionary.
Therefore, “in about fifteen minuets” is a much less probable sequence than “in
about fifteen minutes”.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Introduction of N-Grams (Applications) (Cont…)
Applications of N-grams
N-grams are also crucial in NLP tasks like;
Part-of speech tagging:
- In determining the role of word as a noun or verb and etc. in sentence
given the previous words
e.g., - This is a Screw(Noun).
- Please Screw(Verb) this nut.
Natural language generation:
- Predicting the next word in sentence.
e.g., - This has to…..
Word similarity:
- Finding similarity between words.
e.g., - minutes and minuet.
as well as in applications from authorship identification and sentiment
extraction to predictive text input systems for cell phones.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Counting Words in Corpora
Probabilities are based on counting things.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Counting Words in Corpora (Cont…)
Brown corpus : It has 61,805 wordform type.
- with just 87-tag tagset.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Cont…)
N-grams Models
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Cont…)
The bigram model, for example,
- approximates the probability of a word given all the previous words
by the conditional probability of the preceding word
- In other words, instead of computing the probability;
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Cont…)
Finally, the simplest and most intuitive way to estimate probabilities is
called
- Maximum Likelihood Estimation, or MLE.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Cont…)
For Example:-
- Maximum Likelihood Estimation, or MLE.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Example-1)
Let’s work through an example using a mini-corpus of three sentences.
- We’ll first need to augment each sentence with a special symbol <s> at the
beginning of the sentence, to give us the bigram context of the first word. We’ll
also need a special end-symbol </s>.
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this
corpus.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Example-2)
Suppose the word Chinese occurs 400 times in a corpus of a million words
like the Brown corpus.
What is the probability that a random word selected from some other text
of say a million words will be the word Chinese?
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Class Participation)
Calculate the simple N-gram probabilities;
(a) with white-space , (b) with punctuations
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Class Participation)
Calculate the simple N-gram probabilities;
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Simple (Unsmoothed) N-grams (Class Participation)
Calculate the bigram probabilities using a mini-corpus of following
sentences combination;
Phone conversation Flight timing conversation
<s> Hello, are you fine </s> <s> My flight time is 2:00pm </s>
<s> Hello, I am fine </s> <s> Good, my flight time is also 2:00pm </s>
<s> are your fine</s> <s> let’s go same time 2:00pm</s>
<s> I am fine too</s> <s> I will come 2:00pm sharply</s>
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Training and Test Sets
The probabilities of an N-gram model come from the corpus it is trained
on.
The parameters of a statistical model are trained on some set of data
(training corpus) , and then;
- we apply the models to some new data in some task (test corpus) (such
as speech recognition) and see how well they work.
There is a useful metric for how well a given statistical model matches a
test corpus, called perplexity.
- Perplexity is based on computing the probability of each sentence in
the test set.
- and model assigns a higher probability to the test set.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4.1 N-Gram sensitivity to the Training Corpus
N-grams do a better and better job of modeling the training corpus;
- as we increase the value of N (i.e., Uni, bi, Trigram, ….)
We especially wouldn’t choose training and tests from different genres of
text like;
- newspaper text, early English fiction, telephone conversations, and web
pages.
For Example;
to build N-grams for text prediction in SMS (Short Message Service),
- we need a training corpus of SMS data.
To build N-grams on business meetings,
- we would need to have corpora of transcribed business meetings.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4.2 Unknown Words: Open versus closed vocabulary tasks
Closed Vocabulary is the assumption that we have such a lexicon; and
- the test set can only contain words from this lexicon.
- The closed vocabulary task thus assumes there are no unknown words.
We call these unseen events unknown words, or out of vocabulary
(OOV) words.
- The percentage of OOV words that appear in the test set is called the
OOV rate.
Open Vocabulary system is one where we model these potential
unknown words in the test set by adding a pseudo-word called
<UNK>.
• We can train the probabilities of the unknown word model <UNK>
by following ways
Example of open vocabulary;
- Subject used i.e., instead of For example.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4.2 Unknown Words: Open versus closed vocabulary tasks
(Estimate OOV probability)
1. Choose a vocabulary (word list) which is fixed in advance.
2. Convert in the training set any word that is not in this set (any
OOV word) to the unknown word token <UNK> in a text
normalization step.
3. Estimate the probabilities for <UNK> from its counts just like any
other regular word in the training set.
For Example;
- list of fruits i.e., apple, grapes, …..
- contents of course i.e., regular expression, parsing, ….
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5. Evaluating N-grams: Perplexity
The correct way to evaluate the performance of a language model is;
- to embed it in an application and measure the total performance of the
application.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. Smoothing
Any corpus is limited, some perfectly acceptable English word sequences
are bound to be missing from it.
- “zero probability N-grams”.
Words that are in our vocabulary (they are not unknown words),
- but appear in a test set in an unseen context (For Example; they appear
after a word they never appeared after in training)?.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6.1 Laplace Smoothing (add-1/ add-k smoothing)
One simple way to do smoothing might be just to take our matrix of
bigram counts,
- before we normalize them into probabilities, and add one to all the
counts.
This algorithm is called Laplace smoothing, or Laplace’s Law or add one
smoothing.
Laplace smoothing merely adds one to each count (hence its alternate
name add one smoothing).
For Example;
Probability of new word “fax” will not be zero it will be
P(fax)= 0+1/N+V (N = increment, V = total number of words)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6.2 Advanced Smoothing Methods: Kneser-Ney
Smoothing
A brief introduction to the most commonly used modern N-gram smoothing
method, the interpolated Kneser-Ney algorithm.
Kneser-Ney has its roots in a discounting method called Absolute discounting.
Absolute discounting is a much better method of computing a revised count c∗
than the Good-Turing discount.
Kneser-Ney discounting augments absolute discounting with a more
sophisticated way to handle the backoff distribution.
The Kneser-Ney intuition is to base our estimate on the number of different
contexts word w has appeared in.
For Example;
PE(I went to the store), has different contexts as;
=> PE(store went to I the) or PE(I store went to the) or PE(store went to I the)
Words that have appeared in more contexts are more likely to appear in some
new context as well.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)