0% found this document useful (0 votes)
52 views3 pages

Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP

The document discusses n-grams and their use in natural language processing tasks. Some key points: - An n-gram is a sequence of n words from a text. N-grams like bigrams and trigrams are used to calculate probability distributions of word sequences. - Maximum likelihood estimation is used to calculate probabilities of n-grams from a training corpus. However, this leads to data sparsity as not all n-grams will be seen. - Smoothing techniques like add-one smoothing are used to address sparsity by adjusting probabilities of unseen n-grams. Backoff models allow backing off to lower order n-grams when higher order n-grams are not seen. - N-gram
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views3 pages

Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP

The document discusses n-grams and their use in natural language processing tasks. Some key points: - An n-gram is a sequence of n words from a text. N-grams like bigrams and trigrams are used to calculate probability distributions of word sequences. - Maximum likelihood estimation is used to calculate probabilities of n-grams from a training corpus. However, this leads to data sparsity as not all n-grams will be seen. - Smoothing techniques like add-one smoothing are used to address sparsity by adjusting probabilities of unseen n-grams. Backoff models allow backing off to lower order n-grams when higher order n-grams are not seen. - N-gram
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

N-grams

N-grams: Motivation
An n-gram is a stretch of text n words long

N-grams

Corpus-based NLP
Corpus (pl. corpora) = a computer-readable collection of text and/or speech, often with annotations

N-grams

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing Backoff Spelling

N-grams

Introduction to NLP

Approximation of language: information in n-grams tells us something about language, but doesnt capture the structure Efcient: nding and using every, e.g., two-word collocation in a text is quick and easy to do

We can use corpora to gather probabilities and other information about language use

N -grams can help in a variety of NLP applications:

Autumn 2005

Word prediction = n-grams can be used to aid in predicting the next word of an utterance, based on the previous n 1 words Useful for context-sensitive spelling correction, approximation of language, ...

We can say that a corpus used to gather prior information is training data Testing data, by contrast, is the data one uses to test the accuracy of a method type = distinct word (e.g., like) token = distinct occurrence of a word (e.g., the type like might have 20,000 tokens in a corpus)

We can distinguish types and tokens in a corpus


1 / 23

2 / 23

3 / 23

Simple n-grams

N-grams

Unigrams

N-grams

Bigrams

N-grams

Motivation Simple n-grams

Motivation Simple n-grams

Motivation Simple n-grams Smoothing

Lets assume we want to predict the next word, based on the previous context of I dreamed I saw the knights in What we want to nd is the likelihood of w7 being the next word, given that weve seen w1 , ..., w6 , in other words, P (w1 , ..., w7 ) In general, for wn , we are looking for: (1) P (w1 , ..., wn ) = P (w1 )P (w2 |w1 )...P (wn |w1 , ..., wn1 ) But these probabilities are impractical to calculate: they hardly ever occur in a corpus, if at all. (And it would be a lot of data to store, if we could calculate them.)

Smoothing Backoff Spelling

So, we can approximate these probabilities to a particular n-gram, for a given n. What should n be?

Smoothing Backoff Spelling

Unigrams (n = 1): (2) P (wn |w1 , ..., wn1 ) P (wn )

bigrams (n = 2) are a better choice and still easy to calculate: (4) P (wn |w1 , ..., wn1 ) P (wn |wn1 ) (5) P (over |The , quick , brown, fox , jumped ) P (over |jumped ) And thus, we obtain for the probability of a sentence: (6) P (w1 , ..., wn ) = P (w1 )P (w2 |w1 )P (w3 |w2 )...P (wn |wn1 )

Backoff Spelling

Easy to calculate, but we have no contextual information (3) The quick brown fox jumped

We would like to say that over has a higher probability in this context than lazy does.

4 / 23

5 / 23

6 / 23

Markov models

N-grams

Bigram example

N-grams

Trigrams

N-grams

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff

A bigram model is also called a rst-order Markov model because it has one element of memory (one token in the past)

Backoff Spelling

What is the probability of seeing the sentence The quick brown fox jumped over the lazy dog? (7) P(The quick brown fox jumped over the lazy dog) = P (The | < start > )P (quick |The )P (brown|quick )...P (dog |lazy )

Backoff Spelling

If bigrams are good, then trigrams (n = 3) can be even better.


Spelling

Markov models are essentially weighted FSAsi.e., the arcs between states have probabilities The states in the FSA are words

Wider context: P (know |did , he ) vs. P (know |he ) Generally, trigrams are still short enough that we will have enough data to gather accurate probabilities

Much more on Markov models when we hit POS tagging ...

Probabilities are generally small, so log probabilities are usually used

Does this favor shorter sentences?

7 / 23

8 / 23

9 / 23

Training n-gram models

N-grams

Know your corpus

N-grams

Smoothing: Motivation
Lets assume that we have a good corpus and have trained a bigram model on it, i.e., learned MLE probabilities for bigrams

N-grams

Motivation Simple n-grams Smoothing Backoff

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff Spelling

Go through corpus and calculate relative frequencies: (8) P (wn |wn1 ) =


C (wn1 ,wn ) C (wn1 ) C (wn2 ,wn1 ,wn ) C (wn2 ,wn1 )

Spelling

We mentioned earlier about having training data and testing data ... its important to remember what your training data is when applying your technology to new data

Backoff Spelling

(9) P (wn |wn2 , wn1 ) =

This technique of gathering probabilities from a training corpus is called maximum likelihood estimation (MLE)

If you train your trigram model on Shakespeare, then you have learned the probabilities in Shakespeare, not the probabilities of English overall What corpus you use depends on what you want to do later

But we wont have seen every possible bigram lickety split is a possible English bigram, but it may not be in the corpus This is a problem of data sparseness there are zero probability bigrams which are actual possible bigrams in the language

To account for this sparseness, we turn to smoothing techniques making zero probabilities non-zero, i.e., adjusting probabilities to account for unseen data

10 / 23

11 / 23

12 / 23

Add-One Smoothing

N-grams

Smoothing example

N-grams

Discounting

N-grams

Motivation Simple n-grams

Motivation Simple n-grams

Motivation Simple n-grams Smoothing

One way to smooth is to add a count of one to every bigram:

Smoothing Backoff

in order to still be a probability, all probabilities need to sum to one so, we add the number of word types to the denominator (i.e., we added one to every type of bigram, so we need to account for all our numerator additions)

Spelling

So, if treasure trove never occurred in the data, but treasure occurred twice, we have: (11)

Smoothing Backoff Spelling

An alternate way of viewing smoothing is as discounting

Backoff Spelling

P (trove |treasure )

0+1 2+V

Lowering non-zero counts to get the probability mass we need for the zero count items The discounting factor can be dened as the ratio of the smoothed count to the MLE count

The probability wont be very high, but it will be better than 0.

If all the surrounding probabilities are still high, then treasure trove could still be the best pick If the probability were zero, there would be no chance of it appearing.

Jurafsky and Martin show that add-one smoothing can


discount probabilities by a factor of 8! Thats way too much ...

(10)

P (wn |wn1 )

C (wn1 ,wn )+1 C (wn1 )+V

V = total number of word types that we might see

13 / 23

14 / 23

15 / 23

Witten-Bell Discounting

N-grams

Witten-Bell Discounting formula


(12) zero count bigrams: p (wi |wi 1 ) =

T (wi 1 ) Z (wi 1 )(N (wi 1 )+T (wi 1 ))

N-grams

Backoff models: Basic idea

N-grams

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff

Motivation Simple n-grams Smoothing Backoff

Main idea: Instead of simply adding one to every n-gram, compute the probability of wi 1 , wi by seeing how likely wi 1 is at starting any bigram.

Backoff Spelling

Words that begin lots of bigrams lead to higher unseen bigram probabilities Non-zero bigrams are discounted in essentially the same manner as zero count bigrams Jurafsky and Martin show that they are only discounted by about a factor of one

T (wi 1 ) = the number of bigram types starting with wi 1 as the numerator, determines how high the value will be N (wi 1 ) = the number of bigram tokens starting with wi 1 N (wi 1 ) + T (wi 1 ) gives us the total number of events to divide by Z (wi 1 ) = the number of bigram tokens starting with wi 1 and having zero count this just distributes the probability mass between all zero count bigrams starting with wi 1

Spelling

Lets say were using a trigram model for predicting language, and we havent seen a particular trigram before.

Spelling

But maybe weve seen the bigram, or if not, the unigram information would be useful Backoff models allow one to try the most informative n-gram rst and then back off to lower n-grams

16 / 23

17 / 23

18 / 23

Backoff equations
Roughly speaking, this is how a backoff model works:

N-grams

Backoff models: example


Lets say weve never seen the trigram maples want more before

N-grams

Deleted Interpolation

N-grams

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing

If this trigram has a non-zero count, we use that information

(wi |wi 2 wi 1 ) = P (wi |wi 2 wi 1 ) (13) P

But we have seen want more, so we can use that bigram to calculate a probability estimate. So, we look at P (more |want ) ... But were now assigning probability to P (more |maples , want ) which was zero before we wont have a true probability model anymore This is why 1 was used in the previous equations, to assign less re-weight to the probability.

Deleted interpolation is similar to backing off, except that we always use the bigram and unigram information to calculate the probability estimate

Backoff Spelling

else, if the bigram count is non-zero, we use that bigram information:

(wi |wi 2 wi 1 ) = (16) P 1 P (wi |wi 2 wi 1 ) + 2 P (wi |wi 1 ) + 3 P (wi )


where the lambdas () all sum to one

(wi |wi 2 wi 1 ) = 1 P (wi |wi 1) (14) P

and in all other cases we just take the unigram information:

Every trigram probability, then, is a composite of the focus words trigram, bigram, and unigram.

(wi |wi 2 wi 1 ) = 2 P (wi ) (15) P

In general, backoff models have to be combined with discounting models

19 / 23

20 / 23

21 / 23

A note on information theory

N-grams

Context-sensitive spelling correction


Getting back to the task of spelling correction, we can look at bigrams of words to correct a particular misspelling.

N-grams

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff Spelling

Some very useful notions for n-gram work can be found in information theory. Well just go over the basic ideas:

Backoff Spelling

Question: Given the previous word, what is the probability of the current word?

entropy = a measure of how much information is needed to encode something perplexity = a measure of the amount of surprise of an outcome mutual information = the amount of information one item has about another item (e.g., collocations have high mutual information)

e.g., given these, we have a 5% chance of seeing reports and a 0.001% chance of seeing report (these report cards). Thus, we will change report to reports

Generally, we choose the correction which maximizes the probability of the whole sentence As mentioned, we may hardly ever see these reports, so we wont know the probability of that bigram. Aside from smoothing techniques, another possible solution is to use bigrams of parts of speech.

e.g., What is the probability of a noun given that the previous word was an adjective?
23 / 23

22 / 23

You might also like