0% found this document useful (0 votes)

52 views3 pages

Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP

The document discusses n-grams and their use in natural language processing tasks. Some key points: - An n-gram is a sequence of n words from a text. N-grams like bigrams and trigrams are used to calculate probability distributions of word sequences. - Maximum likelihood estimation is used to calculate probabilities of n-grams from a training corpus. However, this leads to data sparsity as not all n-grams will be seen. - Smoothing techniques like add-one smoothing are used to address sparsity by adjusting probabilities of unseen n-grams. Backoff models allow backing off to lower order n-grams when higher order n-grams are not seen. - N-gram

Uploaded by

Howell Erivera Yangco

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views3 pages

Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP

Uploaded by

Howell Erivera Yangco

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

N-grams

N-grams: Motivation
An n-gram is a stretch of text n words long

N-grams

Corpus-based NLP
Corpus (pl. corpora) = a computer-readable collection of text and/or speech, often with annotations

N-grams

Motivation Simple n-grams Smoothing Backoff Spelling

N-grams

Introduction to NLP

Approximation of language: information in n-grams tells us something about language, but doesnt capture the structure Efcient: nding and using every, e.g., two-word collocation in a text is quick and easy to do

We can use corpora to gather probabilities and other information about language use

N -grams can help in a variety of NLP applications:

Autumn 2005

Word prediction = n-grams can be used to aid in predicting the next word of an utterance, based on the previous n 1 words Useful for context-sensitive spelling correction, approximation of language, ...

We can say that a corpus used to gather prior information is training data Testing data, by contrast, is the data one uses to test the accuracy of a method type = distinct word (e.g., like) token = distinct occurrence of a word (e.g., the type like might have 20,000 tokens in a corpus)

We can distinguish types and tokens in a corpus

1 / 23

2 / 23

3 / 23

Simple n-grams

N-grams

Unigrams

N-grams

Bigrams

N-grams

Motivation Simple n-grams

Motivation Simple n-grams Smoothing

Lets assume we want to predict the next word, based on the previous context of I dreamed I saw the knights in What we want to nd is the likelihood of w7 being the next word, given that weve seen w1 , ..., w6 , in other words, P (w1 , ..., w7 ) In general, for wn , we are looking for: (1) P (w1 , ..., wn ) = P (w1 )P (w2 |w1 )...P (wn |w1 , ..., wn1 ) But these probabilities are impractical to calculate: they hardly ever occur in a corpus, if at all. (And it would be a lot of data to store, if we could calculate them.)

Smoothing Backoff Spelling

So, we can approximate these probabilities to a particular n-gram, for a given n. What should n be?

Smoothing Backoff Spelling

Unigrams (n = 1): (2) P (wn |w1 , ..., wn1 ) P (wn )

Backoff Spelling

Easy to calculate, but we have no contextual information (3) The quick brown fox jumped

We would like to say that over has a higher probability in this context than lazy does.

4 / 23

5 / 23

6 / 23

Markov models

N-grams

Bigram example

N-grams

Trigrams

N-grams

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff

A bigram model is also called a rst-order Markov model because it has one element of memory (one token in the past)

Backoff Spelling

What is the probability of seeing the sentence The quick brown fox jumped over the lazy dog? (7) P(The quick brown fox jumped over the lazy dog) = P (The | < start > )P (quick |The )P (brown|quick )...P (dog |lazy )

Backoff Spelling

If bigrams are good, then trigrams (n = 3) can be even better.

Spelling

Markov models are essentially weighted FSAsi.e., the arcs between states have probabilities The states in the FSA are words

Wider context: P (know |did , he ) vs. P (know |he ) Generally, trigrams are still short enough that we will have enough data to gather accurate probabilities

Much more on Markov models when we hit POS tagging ...

Probabilities are generally small, so log probabilities are usually used

Does this favor shorter sentences?

7 / 23

8 / 23

9 / 23

Training n-gram models

N-grams

Know your corpus

N-grams

Smoothing: Motivation
Lets assume that we have a good corpus and have trained a bigram model on it, i.e., learned MLE probabilities for bigrams

N-grams

Motivation Simple n-grams Smoothing Backoff

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff Spelling

Go through corpus and calculate relative frequencies: (8) P (wn |wn1 ) =

C (wn1 ,wn ) C (wn1 ) C (wn2 ,wn1 ,wn ) C (wn2 ,wn1 )

Spelling

We mentioned earlier about having training data and testing data ... its important to remember what your training data is when applying your technology to new data

Backoff Spelling

(9) P (wn |wn2 , wn1 ) =

This technique of gathering probabilities from a training corpus is called maximum likelihood estimation (MLE)

If you train your trigram model on Shakespeare, then you have learned the probabilities in Shakespeare, not the probabilities of English overall What corpus you use depends on what you want to do later

But we wont have seen every possible bigram lickety split is a possible English bigram, but it may not be in the corpus This is a problem of data sparseness there are zero probability bigrams which are actual possible bigrams in the language

To account for this sparseness, we turn to smoothing techniques making zero probabilities non-zero, i.e., adjusting probabilities to account for unseen data

10 / 23

11 / 23

12 / 23

Add-One Smoothing

N-grams

Smoothing example

N-grams

Discounting

N-grams

Motivation Simple n-grams

Motivation Simple n-grams Smoothing

One way to smooth is to add a count of one to every bigram:

Smoothing Backoff

in order to still be a probability, all probabilities need to sum to one so, we add the number of word types to the denominator (i.e., we added one to every type of bigram, so we need to account for all our numerator additions)

Spelling

So, if treasure trove never occurred in the data, but treasure occurred twice, we have: (11)

Smoothing Backoff Spelling

An alternate way of viewing smoothing is as discounting

Backoff Spelling

P (trove |treasure )

0+1 2+V

Lowering non-zero counts to get the probability mass we need for the zero count items The discounting factor can be dened as the ratio of the smoothed count to the MLE count

The probability wont be very high, but it will be better than 0.

If all the surrounding probabilities are still high, then treasure trove could still be the best pick If the probability were zero, there would be no chance of it appearing.

Jurafsky and Martin show that add-one smoothing can

discount probabilities by a factor of 8! Thats way too much ...

(10)

P (wn |wn1 )

C (wn1 ,wn )+1 C (wn1 )+V

V = total number of word types that we might see

13 / 23

14 / 23

15 / 23

Witten-Bell Discounting

N-grams

Witten-Bell Discounting formula

(12) zero count bigrams: p (wi |wi 1 ) =

T (wi 1 ) Z (wi 1 )(N (wi 1 )+T (wi 1 ))

N-grams

Backoff models: Basic idea

N-grams

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff

Main idea: Instead of simply adding one to every n-gram, compute the probability of wi 1 , wi by seeing how likely wi 1 is at starting any bigram.

Backoff Spelling

Words that begin lots of bigrams lead to higher unseen bigram probabilities Non-zero bigrams are discounted in essentially the same manner as zero count bigrams Jurafsky and Martin show that they are only discounted by about a factor of one

T (wi 1 ) = the number of bigram types starting with wi 1 as the numerator, determines how high the value will be N (wi 1 ) = the number of bigram tokens starting with wi 1 N (wi 1 ) + T (wi 1 ) gives us the total number of events to divide by Z (wi 1 ) = the number of bigram tokens starting with wi 1 and having zero count this just distributes the probability mass between all zero count bigrams starting with wi 1

Spelling

Lets say were using a trigram model for predicting language, and we havent seen a particular trigram before.

Spelling

But maybe weve seen the bigram, or if not, the unigram information would be useful Backoff models allow one to try the most informative n-gram rst and then back off to lower n-grams

16 / 23

17 / 23

18 / 23

Backoff equations
Roughly speaking, this is how a backoff model works:

N-grams

Backoff models: example

Lets say weve never seen the trigram maples want more before

N-grams

Deleted Interpolation

N-grams

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing

If this trigram has a non-zero count, we use that information

(wi |wi 2 wi 1 ) = P (wi |wi 2 wi 1 ) (13) P

But we have seen want more, so we can use that bigram to calculate a probability estimate. So, we look at P (more |want ) ... But were now assigning probability to P (more |maples , want ) which was zero before we wont have a true probability model anymore This is why 1 was used in the previous equations, to assign less re-weight to the probability.

Deleted interpolation is similar to backing off, except that we always use the bigram and unigram information to calculate the probability estimate

Backoff Spelling

else, if the bigram count is non-zero, we use that bigram information:

(wi |wi 2 wi 1 ) = (16) P 1 P (wi |wi 2 wi 1 ) + 2 P (wi |wi 1 ) + 3 P (wi )

where the lambdas () all sum to one

(wi |wi 2 wi 1 ) = 1 P (wi |wi 1) (14) P

and in all other cases we just take the unigram information:

Every trigram probability, then, is a composite of the focus words trigram, bigram, and unigram.

(wi |wi 2 wi 1 ) = 2 P (wi ) (15) P

In general, backoff models have to be combined with discounting models

19 / 23

20 / 23

21 / 23

A note on information theory

N-grams

Context-sensitive spelling correction

Getting back to the task of spelling correction, we can look at bigrams of words to correct a particular misspelling.

N-grams

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff Spelling

Some very useful notions for n-gram work can be found in information theory. Well just go over the basic ideas:

Backoff Spelling

Question: Given the previous word, what is the probability of the current word?

entropy = a measure of how much information is needed to encode something perplexity = a measure of the amount of surprise of an outcome mutual information = the amount of information one item has about another item (e.g., collocations have high mutual information)

e.g., given these, we have a 5% chance of seeing reports and a 0.001% chance of seeing report (these report cards). Thus, we will change report to reports

Generally, we choose the correction which maximizes the probability of the whole sentence As mentioned, we may hardly ever see these reports, so we wont know the probability of that bigram. Aside from smoothing techniques, another possible solution is to use bigrams of parts of speech.

e.g., What is the probability of a noun given that the previous word was an adjective?
23 / 23

22 / 23

Unit Ii - NLP
No ratings yet
Unit Ii - NLP
35 pages
NLP - Module 2
No ratings yet
NLP - Module 2
77 pages
Smooth N-Gram
No ratings yet
Smooth N-Gram
2 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
A Bit of Progress in Language Modeling
No ratings yet
A Bit of Progress in Language Modeling
73 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
NLP m2
No ratings yet
NLP m2
74 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Lecture10 PDF
No ratings yet
Lecture10 PDF
40 pages
Chapter-02 2
No ratings yet
Chapter-02 2
42 pages
Statistical Inference
No ratings yet
Statistical Inference
38 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
15 pages
NLP Lunch Tutorial: Smoothing: Bill Maccartney
No ratings yet
NLP Lunch Tutorial: Smoothing: Bill Maccartney
33 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Language Modeling
No ratings yet
Language Modeling
43 pages
04 - N-Gram Language Models
No ratings yet
04 - N-Gram Language Models
41 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
N-Grams and Smoothing: Course Based On Jurafsky and Martin (2009, Chap.4)
No ratings yet
N-Grams and Smoothing: Course Based On Jurafsky and Martin (2009, Chap.4)
36 pages
N Grams
No ratings yet
N Grams
51 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Ngrams
100% (1)
Ngrams
22 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
Ngram
No ratings yet
Ngram
41 pages
An Empirical Study of Smoothing Techniques For Language Modeling
No ratings yet
An Empirical Study of Smoothing Techniques For Language Modeling
9 pages
Natural Language Processing - Notes - Unit 2
No ratings yet
Natural Language Processing - Notes - Unit 2
19 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Module 2
No ratings yet
Module 2
98 pages
Ai Unit 3 Part 2
No ratings yet
Ai Unit 3 Part 2
8 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
Lecture13 LM YirenWang
No ratings yet
Lecture13 LM YirenWang
8 pages
Ngrams Final
No ratings yet
Ngrams Final
28 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
NLP Unit-II
No ratings yet
NLP Unit-II
20 pages
NLP Kneserney
No ratings yet
NLP Kneserney
10 pages
A7 NLP Exp2
No ratings yet
A7 NLP Exp2
11 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
April 22 Part 2achine Translation
No ratings yet
April 22 Part 2achine Translation
36 pages
NLP Lecture 8 Week 4
No ratings yet
NLP Lecture 8 Week 4
10 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
9-10. Evaluation of Language Models and Smoothing
No ratings yet
9-10. Evaluation of Language Models and Smoothing
10 pages
N Grams
No ratings yet
N Grams
3 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Unit 2
No ratings yet
Unit 2
7 pages
DLP The Discipline of Counseling
No ratings yet
DLP The Discipline of Counseling
8 pages
Customer Service Executive
100% (1)
Customer Service Executive
54 pages
Research Problem, Objectives and
No ratings yet
Research Problem, Objectives and
54 pages
Healing Through Play
No ratings yet
Healing Through Play
2 pages
Conditional Sentence
No ratings yet
Conditional Sentence
6 pages
Mindfulness Gill Hasson
No ratings yet
Mindfulness Gill Hasson
8 pages
An Intelligent Knowledge Extraction Framework For Recognizing Identification Information From Real-World ID Card Images
No ratings yet
An Intelligent Knowledge Extraction Framework For Recognizing Identification Information From Real-World ID Card Images
10 pages
Exploring The Effect of AI Apps On Students
No ratings yet
Exploring The Effect of AI Apps On Students
6 pages
Contoh Paper Tugas Studi Kasus Pancasila
No ratings yet
Contoh Paper Tugas Studi Kasus Pancasila
9 pages
Effect of Collaborative and Competitive Learning Strategy On The Achievement of Students' Performance in Chemistry
No ratings yet
Effect of Collaborative and Competitive Learning Strategy On The Achievement of Students' Performance in Chemistry
73 pages
Oral Communication 2nd Quarter
No ratings yet
Oral Communication 2nd Quarter
4 pages
Ued102 Indiviual Assignments
No ratings yet
Ued102 Indiviual Assignments
24 pages
2011 ReanalysisChange
No ratings yet
2011 ReanalysisChange
14 pages
Art As A Factor in The Soul's Evolution: by C. Jinarajadasa
No ratings yet
Art As A Factor in The Soul's Evolution: by C. Jinarajadasa
8 pages
Learning Outcomes:: Lesson I Basic Concepts of The Self
No ratings yet
Learning Outcomes:: Lesson I Basic Concepts of The Self
8 pages
Video Storyboard Sample Up and Atom ELC1A08
No ratings yet
Video Storyboard Sample Up and Atom ELC1A08
5 pages
Grade 11: How To Use Modal Verbs, Nouns, and Adverbs Appropriately: EN8G-llla-3.6
No ratings yet
Grade 11: How To Use Modal Verbs, Nouns, and Adverbs Appropriately: EN8G-llla-3.6
2 pages
Jade K Clark Resume April 19
No ratings yet
Jade K Clark Resume April 19
2 pages
The Research Problem: The Key Steps in Choosing A Topic
No ratings yet
The Research Problem: The Key Steps in Choosing A Topic
5 pages
Xuelin Qian Pose-Normalized Image Generation ECCV 2018 Paper
No ratings yet
Xuelin Qian Pose-Normalized Image Generation ECCV 2018 Paper
18 pages
HMS299 Presentation
No ratings yet
HMS299 Presentation
10 pages
Phillip Kevin Lane: Kotler - Keller
No ratings yet
Phillip Kevin Lane: Kotler - Keller
32 pages
Mencidor - MC Allied 1 (Activity 1)
No ratings yet
Mencidor - MC Allied 1 (Activity 1)
2 pages
Schools Division of Negros Oriental: Republic of The Philippines Region VII, Central Visayas
No ratings yet
Schools Division of Negros Oriental: Republic of The Philippines Region VII, Central Visayas
2 pages
Act 1
No ratings yet
Act 1
2 pages
Session3 - 1D - Lost - Property
No ratings yet
Session3 - 1D - Lost - Property
16 pages
Threshold Level Syllabus From Threshold 1990 (Ek and Trim 1990) Language Functions For Threshold Level 1 Imparting and Seeking Factual Information
No ratings yet
Threshold Level Syllabus From Threshold 1990 (Ek and Trim 1990) Language Functions For Threshold Level 1 Imparting and Seeking Factual Information
3 pages
SAMSON JERWIN Final Exam Masteral PE 204 Foundation of PE
No ratings yet
SAMSON JERWIN Final Exam Masteral PE 204 Foundation of PE
4 pages
Infinitive: - M - C - C - M - F F
No ratings yet
Infinitive: - M - C - C - M - F F
4 pages
Guided Generalization: Essential
No ratings yet
Guided Generalization: Essential
2 pages
BAYES Theorem
From Everand
BAYES Theorem
Jeffery Short
2/5 (5)
How Pi Can Save Your Life: Using Math to Survive Plane Crashes, Zombie Attacks, Alien Encounters, and Other Improbable Real-World Situations
From Everand
How Pi Can Save Your Life: Using Math to Survive Plane Crashes, Zombie Attacks, Alien Encounters, and Other Improbable Real-World Situations
Chris Waring
No ratings yet

Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP

Uploaded by

Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP

Uploaded by

N-grams

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing Backoff Spelling

N -grams can help in a variety of NLP applications:

We can distinguish types and tokens in a corpus

Motivation Simple n-grams

Motivation Simple n-grams

Motivation Simple n-grams Smoothing

Smoothing Backoff Spelling

Smoothing Backoff Spelling

Unigrams (n = 1): (2) P (wn |w1 , ..., wn1 ) P (wn )

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff

If bigrams are good, then trigrams (n = 3) can be even better.

Much more on Markov models when we hit POS tagging ...

Probabilities are generally small, so log probabilities are usually used

Does this favor shorter sentences?

Training n-gram models

Know your corpus

Motivation Simple n-grams Smoothing Backoff

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff Spelling

Go through corpus and calculate relative frequencies: (8) P (wn |wn1 ) =

(9) P (wn |wn2 , wn1 ) =

Motivation Simple n-grams

Motivation Simple n-grams

Motivation Simple n-grams Smoothing

One way to smooth is to add a count of one to every bigram:

Smoothing Backoff Spelling

An alternate way of viewing smoothing is as discounting

The probability wont be very high, but it will be better than 0.

Jurafsky and Martin show that add-one smoothing can

C (wn1 ,wn )+1 C (wn1 )+V

V = total number of word types that we might see

Witten-Bell Discounting formula

T (wi 1 ) Z (wi 1 )(N (wi 1 )+T (wi 1 ))

Backoff models: Basic idea

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff

Motivation Simple n-grams Smoothing Backoff

Backoff models: example

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing Backoff Spelling

Motivation Simple n-grams Smoothing

If this trigram has a non-zero count, we use that information

(wi |wi 2 wi 1 ) = P (wi |wi 2 wi 1 ) (13) P

else, if the bigram count is non-zero, we use that bigram information:

(wi |wi 2 wi 1 ) = (16) P 1 P (wi |wi 2 wi 1 ) + 2 P (wi |wi 1 ) + 3 P (wi )

(wi |wi 2 wi 1 ) = 1 P (wi |wi 1) (14) P

and in all other cases we just take the unigram information:

(wi |wi 2 wi 1 ) = 2 P (wi ) (15) P

In general, backoff models have to be combined with discounting models

A note on information theory

Context-sensitive spelling correction

Motivation Simple n-grams Smoothing

Motivation Simple n-grams Smoothing Backoff Spelling

You might also like