Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP
Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP
N-grams: Motivation
An n-gram is a stretch of text n words long
N-grams
Corpus-based NLP
Corpus (pl. corpora) = a computer-readable collection of text and/or speech, often with annotations
N-grams
N-grams
Introduction to NLP
Approximation of language: information in n-grams tells us something about language, but doesnt capture the structure Efcient: nding and using every, e.g., two-word collocation in a text is quick and easy to do
We can use corpora to gather probabilities and other information about language use
Autumn 2005
Word prediction = n-grams can be used to aid in predicting the next word of an utterance, based on the previous n 1 words Useful for context-sensitive spelling correction, approximation of language, ...
We can say that a corpus used to gather prior information is training data Testing data, by contrast, is the data one uses to test the accuracy of a method type = distinct word (e.g., like) token = distinct occurrence of a word (e.g., the type like might have 20,000 tokens in a corpus)
1 / 23
2 / 23
3 / 23
Simple n-grams
N-grams
Unigrams
N-grams
Bigrams
N-grams
Lets assume we want to predict the next word, based on the previous context of I dreamed I saw the knights in What we want to nd is the likelihood of w7 being the next word, given that weve seen w1 , ..., w6 , in other words, P (w1 , ..., w7 ) In general, for wn , we are looking for: (1) P (w1 , ..., wn ) = P (w1 )P (w2 |w1 )...P (wn |w1 , ..., wn1 ) But these probabilities are impractical to calculate: they hardly ever occur in a corpus, if at all. (And it would be a lot of data to store, if we could calculate them.)
So, we can approximate these probabilities to a particular n-gram, for a given n. What should n be?
bigrams (n = 2) are a better choice and still easy to calculate: (4) P (wn |w1 , ..., wn1 ) P (wn |wn1 ) (5) P (over |The , quick , brown, fox , jumped ) P (over |jumped ) And thus, we obtain for the probability of a sentence: (6) P (w1 , ..., wn ) = P (w1 )P (w2 |w1 )P (w3 |w2 )...P (wn |wn1 )
Backoff Spelling
Easy to calculate, but we have no contextual information (3) The quick brown fox jumped
We would like to say that over has a higher probability in this context than lazy does.
4 / 23
5 / 23
6 / 23
Markov models
N-grams
Bigram example
N-grams
Trigrams
N-grams
A bigram model is also called a rst-order Markov model because it has one element of memory (one token in the past)
Backoff Spelling
What is the probability of seeing the sentence The quick brown fox jumped over the lazy dog? (7) P(The quick brown fox jumped over the lazy dog) = P (The | < start > )P (quick |The )P (brown|quick )...P (dog |lazy )
Backoff Spelling
Spelling
Markov models are essentially weighted FSAsi.e., the arcs between states have probabilities The states in the FSA are words
Wider context: P (know |did , he ) vs. P (know |he ) Generally, trigrams are still short enough that we will have enough data to gather accurate probabilities
7 / 23
8 / 23
9 / 23
N-grams
N-grams
Smoothing: Motivation
Lets assume that we have a good corpus and have trained a bigram model on it, i.e., learned MLE probabilities for bigrams
N-grams
Spelling
We mentioned earlier about having training data and testing data ... its important to remember what your training data is when applying your technology to new data
Backoff Spelling
This technique of gathering probabilities from a training corpus is called maximum likelihood estimation (MLE)
If you train your trigram model on Shakespeare, then you have learned the probabilities in Shakespeare, not the probabilities of English overall What corpus you use depends on what you want to do later
But we wont have seen every possible bigram lickety split is a possible English bigram, but it may not be in the corpus This is a problem of data sparseness there are zero probability bigrams which are actual possible bigrams in the language
To account for this sparseness, we turn to smoothing techniques making zero probabilities non-zero, i.e., adjusting probabilities to account for unseen data
10 / 23
11 / 23
12 / 23
Add-One Smoothing
N-grams
Smoothing example
N-grams
Discounting
N-grams
Smoothing Backoff
in order to still be a probability, all probabilities need to sum to one so, we add the number of word types to the denominator (i.e., we added one to every type of bigram, so we need to account for all our numerator additions)
Spelling
So, if treasure trove never occurred in the data, but treasure occurred twice, we have: (11)
Backoff Spelling
P (trove |treasure )
0+1 2+V
Lowering non-zero counts to get the probability mass we need for the zero count items The discounting factor can be dened as the ratio of the smoothed count to the MLE count
If all the surrounding probabilities are still high, then treasure trove could still be the best pick If the probability were zero, there would be no chance of it appearing.
(10)
P (wn |wn1 )
13 / 23
14 / 23
15 / 23
Witten-Bell Discounting
N-grams
N-grams
N-grams
Main idea: Instead of simply adding one to every n-gram, compute the probability of wi 1 , wi by seeing how likely wi 1 is at starting any bigram.
Backoff Spelling
Words that begin lots of bigrams lead to higher unseen bigram probabilities Non-zero bigrams are discounted in essentially the same manner as zero count bigrams Jurafsky and Martin show that they are only discounted by about a factor of one
T (wi 1 ) = the number of bigram types starting with wi 1 as the numerator, determines how high the value will be N (wi 1 ) = the number of bigram tokens starting with wi 1 N (wi 1 ) + T (wi 1 ) gives us the total number of events to divide by Z (wi 1 ) = the number of bigram tokens starting with wi 1 and having zero count this just distributes the probability mass between all zero count bigrams starting with wi 1
Spelling
Lets say were using a trigram model for predicting language, and we havent seen a particular trigram before.
Spelling
But maybe weve seen the bigram, or if not, the unigram information would be useful Backoff models allow one to try the most informative n-gram rst and then back off to lower n-grams
16 / 23
17 / 23
18 / 23
Backoff equations
Roughly speaking, this is how a backoff model works:
N-grams
N-grams
Deleted Interpolation
N-grams
But we have seen want more, so we can use that bigram to calculate a probability estimate. So, we look at P (more |want ) ... But were now assigning probability to P (more |maples , want ) which was zero before we wont have a true probability model anymore This is why 1 was used in the previous equations, to assign less re-weight to the probability.
Deleted interpolation is similar to backing off, except that we always use the bigram and unigram information to calculate the probability estimate
Backoff Spelling
Every trigram probability, then, is a composite of the focus words trigram, bigram, and unigram.
19 / 23
20 / 23
21 / 23
N-grams
N-grams
Some very useful notions for n-gram work can be found in information theory. Well just go over the basic ideas:
Backoff Spelling
Question: Given the previous word, what is the probability of the current word?
entropy = a measure of how much information is needed to encode something perplexity = a measure of the amount of surprise of an outcome mutual information = the amount of information one item has about another item (e.g., collocations have high mutual information)
e.g., given these, we have a 5% chance of seeing reports and a 0.001% chance of seeing report (these report cards). Thus, we will change report to reports
Generally, we choose the correction which maximizes the probability of the whole sentence As mentioned, we may hardly ever see these reports, so we wont know the probability of that bigram. Aside from smoothing techniques, another possible solution is to use bigrams of parts of speech.
e.g., What is the probability of a noun given that the previous word was an adjective?
23 / 23
22 / 23