0% found this document useful (0 votes)

12 views118 pages

L3 LanguageModels

This document provides an introduction to probabilities in natural language processing (NLP). It discusses key probability concepts like sample spaces, events, joint probabilities, and conditional probabilities. It also explains how probabilities are used in NLP tasks like speech recognition, machine translation, and language modeling. Probabilities allow evidence from multiple sources to be combined systematically. N-gram language models predict word probabilities based on the previous n words.

Uploaded by

Ike S. Ma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views118 pages

L3 LanguageModels

Uploaded by

Ike S. Ma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 118

NLP

Introduction to NLP

123.
Probabilities
Probabilities in NLP
• Very important for language processing
– “Let’s meet in the conference …”
• Example in speech recognition:
– “recognize speech” vs “wreck a nice beach”
• Example in machine translation:
– “l’avocat general”: “the attorney general” vs. “the general avocado”
• Example in information retrieval:
– If a document includes three occurrences of “stir” and one of “rice”, what is the
probability that it is a recipe
• Probabilities make it possible to combine evidence from multiple
sources in a systematic way
Probabilities
• Probability theory
– predicting how likely it is that something will happen
• Experiment (trial)
– e.g., throwing a coin
• Possible outcomes
– heads or tails
• Sample spaces
– discrete (number of occurrences of “rice”) or continuous (e.g., temperature)
• Events
–  is the certain event
–  is the impossible event
– event space - all possible events
Sample Space
• Random experiment: an experiment with uncertain outcome
– e.g., flipping a coin, picking a word from text
• Sample space: all possible outcomes, e.g.,
– Tossing 2 fair coins,  ={HH, HT, TH, TT}
Events
• Event: a subspace of the sample space
– E , E happens iff outcome is in E, e.g.,
• E={HH} (all heads)
• E={HH,TT} (same face)
– Impossible event ()
– Certain event ()
• Probability of Event : 0  P(E) ≤1, s.t.
– P()=1 (outcome always in )
– P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different
face)
Example: Tossing a Die
• Sample space:  = {1,2,3,4,5,6}
• Fair die:
– p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6
• Unfair die example: p(1) = 0.3, p(2) = 0.2, ...
• N-dimensional die:
–  = {1, 2, 3, 4, …, N}
• Example in modeling text:
– Toss a die to decide which word to write in the next position
–  = {cat, dog, tiger, …}
Example: Flipping a Coin
•  : {Head, Tail}
• Fair coin:
– p(H) = 0.5, p(T) = 0.5
• Unfair coin, e.g.:
– p(H) = 0.3, p(T) = 0.7
• Flipping two fair coins:
– Sample space: {HH, HT, TH, TT}
• Example in modeling text:
– Flip a coin to decide whether or not to include a word in a document
– Sample space = {appear, absence}
Probabilities
• Probabilities
– numbers between 0 and 1
• Probability distribution
– distributes a probability mass of 1 throughout the sample space .
• Example:
– A fair coin is tossed three times.
– What is the probability of 3 heads?
– What is the probability of 2 heads?
Meaning of probabilities

• Frequentist
– I threw the coin 10 times and it turned up heads 5 times
• Subjective
– I am willing to bet 50 cents on heads
Probabilities
• Joint probability: P(AB), also written as P(A, B)
• Conditional Probability:
P(B|A) = P(AB)/P(A)
P(AB) = P(A)P(B|A) = P(B)P(A|B)
P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule)
For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A)

A B
AB
Properties of Probabilities

• p() = 0
• P(certain event)=1
• p(X)  p(Y), if X  Y
• p(X  Y) = p(X) + p(Y), if X  Y=
Conditional Probability

• Six-sided fair die

P(D even)=?
P(D>=4)=?
P(D even|D>=4)=?
P(D odd|D>=4)=?
• Multiple conditions
P(D odd|D>=4, D<=5)=?
Answers

• Six-sided fair die

P(D even)=3/6=1/2
P(D>=4)=3/6=1/2
P(D even|D>=4)=2/3
P(D odd|D>=4)=1/3
• Multiple conditions
P(D odd|D>=4, D<=5)=1/2
The Chain Rule
• P(w1,w2,w3…wn) = ?
• Using the chain rule:
– P(w1,w2,w3…wn) =P(w1) P(w2|w1) P(w3|w1,w2)… P(wn|w1,w2…wn-1)
• This rule is used in many ways in statistical NLP,
more specifically in Markov Models
Probability Review

[slide from Brendan O’Connor]

Probability Review 2/2

[slide from Brendan O’Connor]

NLP
Introduction to NLP

125.
Bayes Theorem
Bayes Theorem
• Formula for joint probability
p(A,B) = p(B|A)p(A)
p(A,B) = p(A|B)p(B)
• Therefore
p(B|A) = p(A|B)p(B)/p(A)
• Bayes’ theorem is used to calculate P(A|B) given
P(B|A)
Example
• Diagnostic test
• Test accuracy
p(positive | disease) = 0.05 – false positive
p(negative | disease) = 0.05 – false negative
So: p(positive | disease) = 1-0.05 = 0.95
Same for p(negative | disease)
In general the rates of false positives and false negatives
can be different
Example

• Diagnostic test with errors

• We don’t really care about p(positive)

– as long as it is not zero, we can divide both sides by this quantity
Example
• P(disease|positive) / P(disease|positive) =
(P(positive|disease) x P(disease))/(P(positive|disease) x P(disease))
• Suppose P(disease) = 0.001
– so P(disease) = 0.999
• P(disease|positive) / P(disease|positive) = (0.95 x 0.001)/(0.05 x 0.999)
=0.019
• P(disease|positive) + P(disease|positive) = 1
• P(disease|positive) ≈ 0.02
• Notes
– P(disease) is called the prior probability
– P(disease|positive) is called the posterior probability
– In this example the posterior is 20 times larger than the prior
Example: An Unfair Die
• It’s more likely to get a 6 and less likely to get a 1
– p(6) > p(1)
– How likely?
• What if you toss the die 1000 times,
and observe “6” 501 times, “1” 108 times?
– p(6) = 501/1000 = 0.501
– p(1) = 108/1000 = 0.108
– As simple as counting, but principled – maximum likelihood estimate

[slide from Qiaozhu Mei]

What if the Die has More Faces?
• Suitable to represent documents
• Every face corresponds to a word in vocabulary
• The author tosses a die to write a word
• Apparently, an unfair die

[slide from Qiaozhu Mei]

NLP
Introduction to NLP

211.
Language Models
Probabilistic Language Models
• Assign a probability to a sentence
– P(S) = P(w1,w2,w3,...,wn)
• The sum of the probabilities of all possible sentences must be 1.
• Predicting the next word
– Let’s meet in Times …
– General Electric has lost some market …
• Formula
– P(wn|w1,w2,...,wn-1)
Predicting the Next Word
• What word follows “your”? http://norvig.com/ngrams/count_2w.txt

your abilities 160848

your active 140797
your ability 1116122
your activities 226183
your ablum 112926
your activity 156213
your academic 274761
your actual 302488
your acceptance 783544
your ad 1450485
your access 492555
your address 1611337
your accommodation 320408
your admin 117943
your account 8149940
your ads 264771
your accounting 128409
your advantage 242238
your accounts 257118
your adventure 109658
your action 121057
your advert 101178
your actions 492448
your advertisement 172783
your activation 459379
Uses of Language Models
• Speech recognition
– P(“recognize speech”) > P(“wreck a nice beach”)
• Text generation
– P(“three houses”) > P(“three house”)
• Spelling correction
– P(“my cat eats fish”) > P(“my xat eats fish”)
• Machine translation
– P(“the blue house”) > P(“the house blue”)
• Other uses
– OCR
– Summarization
– Document classification
• Usually coupled with a translation model (later)
Probability of a Sentence
• How to compute the probability of a sentence?
– What if the sentence is novel?
• What we need to estimate:
• P(S)=P(w1,w2,w3…wn)
• Using the chain rule:
• P(S)= P(w1) P(w2|w1) P(w3|w1,w2)… P(wn|w1,w2…wn-1)
• Example:
• P(“I would like the pepperoni and spinach pizza”)=?
N-gram Models
• Predict the probability of a word based on the words before:
– P(square|Let’s meet in Times)
• Markov assumption
– Only look at limited history
• N-gram models
– Unigram – no context: P(square)
– Bigram: P(square|Times)
– Trigram: P(square|in Times)
• It is possible to go to 3,4,5-grams
– Longer n-grams suffer from sparseness
• Used for predicting the next word and also for random text generation
Approximating Shakespeare
Approximating the Wall Street Journal
N-Grams
• Shakespeare unigrams
– 29,524 types, approx. 900K tokens
• Bigrams
– 346,097 types, approx. 900K tokens
– How many bigrams are never seen in the data?
• Notice!
– very sparse data!
Google 1-T Corpus
• 1 trillion word tokens
• Number of tokens
– 1,024,908,267,229
• Number of sentences
– 95,119,665,584
• Number of unigrams
– 13,588,391
• Number of bigrams
– 314,843,401
• Number of trigrams
– 977,069,902
• Number of fourgrams
– 1,313,818,354
• Number of fivegrams
– 1,176,470,663
https://catalog.ldc.upenn.edu/ldc2006t13
Parameter Estimation
• Can we compute the conditional probabilities directly?
– No, because the data is sparse
• Markov assumption
• P(“musical” | “I would like two tickets for the”) = P(“musical | the”)
or
• P(“musical” | “I would like two tickets for the”) = P(“musical | for the”)
Maximum Likelihood Estimates
• Use training data
– Count how many times a given context appears in it.

• Unigram example:
– The word “pizza” appears 700 times in a corpus of 10,000,000 words.
– Therefore the MLE for its probability is P’(“pizza”) = 700/10,000,000 = 0.00007

• Bigram example:
– The word “with” appears 1,000 times in the corpus.
– The phrase “with spinach” appears 6 times
– Therefor the MLE for P’(spinach|with) = 6/1,000 = 0.006

• These estimates may not be good for corpora from other genres
Probability of a Sentence
P(“<S> I will see you on Monday</S>”) =
P(I|<S>)
x P(will|I)
x P(see|will)
x P(you|see)
x P(on|you)
x P(Monday|on)
x P(</S>|Monday)
Probability of a Sentence
Example from Jane Austen
• P(“Elizabeth looked at Darcy”)
• Use maximum likelihood estimates for the n-gram probabilities
– unigram: P(wi)=c(wi)/V
– bigram: P(wi|wi-1) = c(wi-1,wi)/c(wi-1)

• Values
– P(“Elizabeth”) = 474/617091 = .000768120
– P(“looked|Elizabeth”) = 5/474 = .010548523
– P(“at|looked”) = 74/337 = .219584569
– P(“Darcy|at”) = 3/4055 = .000739827

• Bigram probability
– P(“Elizabeth looked at Darcy”) = .000000001316 = 1.3 x 10-9

• Unigram probability
– P(“Elizabeth looked at Darcy”) = 474/617091 * 337/617091 * 4055/617091 * 304/617091 =
.000000000001357 = 1.3 x 10-12

• P(“looked Darcy Elizabeth at”) = ?

Generative Models

• Unigram:
– generate a word, then generate the next one, until you generate </S>.

W1 W2 W3 … Wn </S>

• Bigram:
– generate <S>, generate a word, then generate the next one based on
the previous one, etc., until you generate </S>.
<S> W1 W2 W3 … Wn </S>
Engineering Trick

• The MLE values are often on the order of 10-6 or less

– Multiplying 20 such values gives a number on the order of 10-120
– This leads to underflow
• Use logarithms instead
– 10-6 (in base 10) becomes -6
– Use sums instead of products
Language Modeling Tools

• http://www.speech.cs.cmu.edu/SLM_info.html
• http://www.speech.sri.com/projects/srilm/
• https://kheafield.com/code/kenlm/
• http://htk.eng.cam.ac.uk/
NLP
Introduction to NLP

212.
Smoothing and Interpolation
Smoothing
• If the vocabulary size is |V|=1M
– Too many parameters to estimate even a unigram model
– MLE assigns values of 0 to unseen (yet not impossible) data
– Let alone bigram or trigram models
• Smoothing (regularization)
– Reassigning some probability mass to unseen data
Smoothing
• How to model novel words?
– Or novel bigrams?
– Distributing some of the probability mass to allow for novel events
• Add-one (Laplace) smoothing:
– Bigrams: P(wi|wi-1) = (c(wi-1,wi)+1)/(c(wi-1)+V)
– This method reassigns too much probability mass to unseen events
• Possible to do add-k instead of add-one
– Both of these don’t work well in practice
Dealing with Sparse Data

• Two main techniques used

– Backoff
– Interpolation
Backoff
• Going back to the lower-order n-gram model if the
higher-order model is sparse (e.g., frequency <= 1)
• Learning the parameters
– From a development data set
Interpolation
• If P’(wi|wi-1,wi-2) is sparse:
– Use λ1P’(wi|wi-1,wi-2) +λ2P’(wi|wi-1)+λ3P’(wi)
– Ensure that λ1+λ2+λ3=1, λ1,λ2,λ3≤1, λ1,λ2,λ3≥0
– Better than backoff
– Estimate the hyper-parameters λ1,λ2,λ3 from held-out data (or using
EM), e.g., using 5-fold cross-validation
• See [Chen and Goodman 1998] for more details
• Software:
– http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html
NLP
Introduction to NLP

213.
Evaluation of Language Models
Evaluation of LM
• Extrinsic
– Use in an application
• Intrinsic
– Cheaper
– Based on information theory
• Correlate the two for validation purposes
Information Theory

• It is concerned with data transmission, data

compression, and measuring the amount of
information.
• Applies to statistical physics, economics, linguistics.
Information and Uncertainty
• The decrease in uncertainty is called information
• Example
– we know that a certain event will happen next week
– then we learn that it is more likely to happen on a workday
– the new information reduces the uncertainty
– the more new information we get, the smaller the remaining
uncertainty
Entropy

• Entropy tells us how informative a random variable is.

Examples
• One symbol (a)
– uncertainty is 0
• Two symbols (a,b)
– uncertainty is 1
– we can reduce it to 0 by using one bit of information (a=0,b=1)
• Four symbols (a,b,c,d)
– we need two bits of information (e.g., a=00,b=01,c=10,d=11)
• In general we need
– log2k bits, where k is the number of symbols
– note: this only holds if all symbols are equiprobable
Amount of Surprise
• Amount of surprise (given a general prob. distribution)
-log2 p(x) - for a specific outcome x
• If the distribution is uniform:
p(x) = 1/k
k = 1/p(x)
log2k = log2 (1/p(x)) = -log2 p(x)
• Average surprise

• Information need
H(X) = 0 means that we have all the information that we need
H(X) = 1 means that we need one bit of information, etc.
Entropy Example
• Sample 8-character language: A E I O U F G H

• Three bits per character if the characters are

equiprobable
Simplified Polynesian
• Six characters: P T K A I U, not equiprobable

• This number (2.5) can lead one to believe that 3 bits

per character are needed
– e.g. 000, 001, 010, 100, 101, 111
Simplified Polynesian
• More efficient encoding

• Longer codes for less frequent characters

• This can lower the average number of bits per
character to the theoretical estimate of 2.5
• Under what assumption, though?
Joint Entropy
• Amount of information to specify both x and y.

• Measures the amount of surprise of seeing a specific

tag bigram.
Conditional Entropy
• If we know x, how much additional information is
needed to know y.
Chain Rule for Entropy
• Chain rule for entropy
Probabilities of Syllables
• P(C,·) and P(·,V) - marginal probabilities

• P(C,V)
Surprise in Syllables

• Example
Polynesian Syllables (cont’d)
Polynesian Syllables (cont’d)
Pointwise Mutual Information
• Measured between two points (not two distributions)

• If p(x,y) = p(x)p(y), then I(x;y) = log 1 = 0

(independence)
Mutual Information

• Same, but for two distributions (not points)

• How much information does one of the distributions
contain about the other one.
Kullback-Leibler (KL) Divergence
• Measures how far two distributions are from one another

• It measures the number of bits needed to encode p by using q.

• Always non-negative
• D(p||q) = 0, iff p=q
• D(p||q) = ∞, iff ∃x∈X such that p(x)>0 and q(x)=0
• Not symmetric
Divergence as Mutual Information
Perplexity
• Does the model fit the data?
– A good model will give a high probability to a real sentence
• Perplexity
– Average branching factor in predicting the next word
– Lower is better (lower perplexity -> higher probability)
– N = number of words

1
Per  N
P ( w1w2 ...wN )
Perplexity
• Example:
– A sentence consisting of N equiprobable words: p(wi) = 1/k

1
Per  N
P ( w1w2 ...wN )
– Per = ((k-1)N)(-1/N) = k
• Perplexity is like a (weighted) branching factor
• Logarithmic version
– the exponent is = #bits to encode each word

Per  2
 (1 / N )  log 2 P ( wi )
The Shannon Game
• Consider the Shannon game:
– Connecticut governor Ned Lamont said ...
• What is the perplexity of guessing a digit if all digits are equally
likely? Do the math.
– 10
• How about a letter?
– 26

• How about guessing A (“operator”) with a probability of

1/4, B (“sales”) with a probability of 1/4 and 10,000 other
cases with a probability of 1/2 total
– example modified from Joshua Goodman.
Perplexity Across Distributions
• What if the actual distribution is very different from the
expected one?
• Example:
– All of the 10,000 other cases are equally likely but P(A) = P(B) = 0.
• Cross-entropy = log (perplexity), measured in bits
Sample Values for Perplexity
• Wall Street Journal (WSJ) corpus
– 38 M words (tokens)
– 20 K types
• Perplexity
– Evaluated on a separate 1.5M sample of WSJ documents
– Unigram 962
– Bigram 170
– Trigram 109
– More recent results – 47.7 (Yang et al. 2017 using AWD-LSTM
Word Error Rate
• Another evaluation metric
– Number of insertions, deletions, and substitutions
– Normalized by sentence length
– Same as Levenshtein Edit Distance
• Example:
– governor Ned Lamont met with the mayor
– the governor met the senator
– 3 deletions + 1 insertion + 1 substitution = WER of 5
Issues

• Out of vocabulary words (OOV)

– Split the training set into two parts
– Label all words in part 2 that were not in part 1 as <UNK>
• Clustering
– e.g., dates, monetary amounts, organizations, years
Long Distance Dependencies
• This is where n-gram language models fail by definition
• Missing syntactic information
– The students who participated in the game are tired
– The student who participated in the game is tired
• Missing semantic information
– The pizza that I had last night was tasty
– The class that I had last night was interesting
Other Ideas in LM
• Skip-grapm models
– Ms. Jane Doe, Ms. Mary Doe
• Syntactic models
– Condition words on other words that appear in a specific
syntactic relation with them
• Caching models
– Take advantage of the fact that words appear in bursts
Limitations
• Still no general solution for long-distance dependencies
• Cannot handle linear combinations, e.g.,
– Cats eat mice
– People eat broccoli
– Cats ear broccoli
– People eat mice
• Possible solution – use phrases or n-grams as features
(combinatorial explosion)
External Resources
• Google n-gram corpus
– http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-
belong-to-you.html
• Google book n-grams
– http://ngrams.googlelabs.com/
N-gram External Links
• http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
• http://norvig.com/mayzner.html
• http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
• https://books.google.com/ngrams/
• http://www.elsewhere.org/pomo/
• http://pdos.csail.mit.edu/scigen/
• http://www.magliery.com/Band/
• http://www.magliery.com/Country/
• http://johno.jsmf.net/knowhow/ngrams/index.php
• http://www.decontextualize.com/teaching/rwet/n-grams-and-markov-chains/
• http://gregstevens.com/2012/08/16/simulating-h-p-lovecraft
• http://kingjamesprogramming.tumblr.com/
NLP
Introduction to NLP

214.
The Noisy Channel Model
The Noisy Channel Model
• Example:
– Input: Written English (X)
– Encoder: garbles the input (X->Y)
– Output: Spoken English (Y)
• More examples:
– Grammatical English to English with mistakes
– English to bitmaps (characters)
• P(X,Y) = P(X)P(Y|X)
Encoding and Decoding

• Given f, guess e
f
e EF FE e’
encoder decoder

e’ = argmax
e
P(e|f) = argmax
e
P(f|e) P(e)

translation model language model

Example

• Translate “la maison blanche”

P(f|e) P(e)
cat rat piano
house white the
the house white
the red house
the small cat
the white house
Example

• Translate “la maison blanche”

P(f|e) P(e)
cat rat piano - -
house white the + -
the house white
the red house
the small cat
the white house
Example

• Translate “la maison blanche”

P(f|e) P(e)
cat rat piano - -
house white the + -
the house white + -
the red house - +
the small cat - +
the white house + +
Uses of the Noisy Channel Model

• Handwriting recognition
• Text generation
• Text summarization
• Machine translation
• Spelling correction
– See separate lecture on text similarity and edit distance
Spelling Correction

From Peter Norvig: http://norvig.com/ngrams/ch14.pdf

Features

• For each “e”:

– P(e)
– P(f|e)
– what else?
• What about some other task, e.g., POS tagging?
NLP
Introduction to NLP

215.
Part of Speech Tagging
The POS task
• Example
– Bahrainis vote in second round of parliamentary election
• Jabberwocky (by Lewis Carroll, 1872)
`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
Penn Treebank tagset (1/2)
Tag Description Example
CC coordinating conjunction and
CD cardinal number 1
DT determiner the
EX existential there there is
FW foreign word d‘oeuvre
IN preposition/subordinating conjunction in, of, like
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables
NNP proper noun, singular John
NNPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend's
Penn Treebank tagset (2/2)
Tag Description Example
PRP personal pronoun I, he, it
PRP$ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to to go, to him
UH interjection uhhuhhuhh
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
Universal POS

http://universaldependencies.org/u/pos/
Universal Features

http://universaldependencies.org/u/feat/
Some Observations
• Ambiguity
– count (noun) vs. count (verb)
– 11% of all types but 40% of all tokens in the Brown
corpus are ambiguous.
– Examples
• like can be tagged as ADP VERB ADJ ADV NOUN
• present can be tagged as ADJ NOUN VERB ADV
POS Ambiguity

Example from J&M

Some Observations
• More examples:
– transport, object, discount, address
– content
• French pronunciation:
– est, président, fils
• Three main techniques:
– rule-based
– machine learning (e.g., conditional random fields, maximum entropy Markov models,
neural networks)
– transformation-based
• Useful for parsing, translation, text to speech, word sense
disambiguation, etc.
Example

• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/NNS

• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/VBZ
Rule-based POS tagging

• Use dictionary or finite-state transducers to find all

possible parts of speech
• Use disambiguation rules
– e.g., ART+V
• Hundreds of constraints need to be designed manually
Example in French
<S> ^ beginning of sentence
La rf b nms u article
teneur nfs nms noun feminine singular
moyenne jfs nfs v1s v2s v3s adjective feminine singular
en p a b preposition
uranium nms noun masculine singular
des p r preposition
rivières nfp noun feminine plural
, x punctuation
bien_que cs subordinating conjunction
délicate jfs adjective feminine singular
à p preposition
calculer v verb
Sample Rules
• BS3 BI1
– A BS3 (3rd person subject personal pronoun) cannot be followed by a BI1 (1st person indirect personal
pronoun).
– In the example: “il nous faut” (= “we need”) – “il” has the tag BS3MS and “nous” has the tags [BD1P
BI1P BJ1P BR1P BS1P].
– The negative constraint “BS3 BI1” rules out “BI1P'', and thus leaves only 4 alternatives for the word
“nous”.
• NK
– The tag N (noun) cannot be followed by a tag K (interrogative pronoun); an example in the test corpus
would be: “... fleuve qui ...” (...river that...).
– Since “qui” can be tagged both as an “E” (relative pronoun) and a “K” (interrogative pronoun), the “E”
will be chosen by the tagger since an interrogative pronoun cannot follow a noun (“N”).
• RV
– A word tagged with R (article) cannot be followed by a word tagged with V (verb): for example “l'
appelle” (calls him/her).
– The word “appelle” can only be a verb, but “l'” can be either an article or a personal pronoun.
– Thus, the rule will eliminate the article tag, giving preference to the pronoun.
Classifier-based POS Tagging

• A baseline method would be to use a classifier to map

each individual word into a likely POS tag
– Why is this method unlikely to work well?
Sources of Information

• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/NNS

• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/VBZ

• Knowledge about individual words

– lexical information
– spelling (-or)
– capitalization (IBM)
• Knowledge about neighboring words
Evaluation
• Baseline
– tag each word with its most likely tag
– tag each OOV word as a noun.
– around 90%
• Current accuracy
– around 97% for English
– compared to 98% human performance
NLP

Advanced Language Patterns Mastery - NLP
100% (7)
Advanced Language Patterns Mastery - NLP
115 pages
NLP Week 03
No ratings yet
NLP Week 03
33 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
N Grams - Nptel Notes
No ratings yet
N Grams - Nptel Notes
75 pages
NLP - Module 2
No ratings yet
NLP - Module 2
77 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
No ratings yet
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
54 pages
Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
NLP Week4 Ngrams
No ratings yet
NLP Week4 Ngrams
60 pages
Logic Types
No ratings yet
Logic Types
10 pages
AI CSE Unit - 3 First Half
No ratings yet
AI CSE Unit - 3 First Half
51 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
Fall 2019 Prob Review
No ratings yet
Fall 2019 Prob Review
33 pages
Lecture 2 Language Model
No ratings yet
Lecture 2 Language Model
127 pages
Unit 3
No ratings yet
Unit 3
68 pages
Eisner-Probability How To Use Prob
No ratings yet
Eisner-Probability How To Use Prob
44 pages
NLP UNIT III (Part 1)
No ratings yet
NLP UNIT III (Part 1)
15 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
Language Models
No ratings yet
Language Models
34 pages
Language Modelling
No ratings yet
Language Modelling
17 pages
1JgIM70DTxiYCDO9A98YgQ Lecture04-Part3
No ratings yet
1JgIM70DTxiYCDO9A98YgQ Lecture04-Part3
12 pages
SD Bayes Theorem 1
No ratings yet
SD Bayes Theorem 1
35 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner & Jannis Vamvas
37 pages
Vandurme2011 How To Use Prob
No ratings yet
Vandurme2011 How To Use Prob
44 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
12 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
N-Grams and Smoothing: CSC 371: Spring 2012
No ratings yet
N-Grams and Smoothing: CSC 371: Spring 2012
39 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
01 Introduction To N-Grams 8-41
No ratings yet
01 Introduction To N-Grams 8-41
13 pages
Lec-1 Probabilistic Models
No ratings yet
Lec-1 Probabilistic Models
29 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Statistical Modeling & Intro To Probability
No ratings yet
Statistical Modeling & Intro To Probability
31 pages
Unit-3 Ai
No ratings yet
Unit-3 Ai
24 pages
Language Modelling
No ratings yet
Language Modelling
3 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
7.introduction To Artificial Intelligence
No ratings yet
7.introduction To Artificial Intelligence
25 pages
Lecture 2 - CS50 - S Introduction To Artificial Intelligence With Python
No ratings yet
Lecture 2 - CS50 - S Introduction To Artificial Intelligence With Python
24 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
Probablity Mit Removed
No ratings yet
Probablity Mit Removed
31 pages
Turn in Recitation and Tutorial Scheduling Form Policy: Text
No ratings yet
Turn in Recitation and Tutorial Scheduling Form Policy: Text
52 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
BAYES Theorem
From Everand
BAYES Theorem
Jeffery Short
2/5 (5)
Where Mathematics goes wrong?...: Mathematical Fallacies, Howlers & Number tricks …
From Everand
Where Mathematics goes wrong?...: Mathematical Fallacies, Howlers & Number tricks …
Sachin Vyavahare
5/5 (1)
CUDA Debugger API
No ratings yet
CUDA Debugger API
198 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
270 pages
CUDA C Best Practices Guide
No ratings yet
CUDA C Best Practices Guide
85 pages
L1 Introduction
No ratings yet
L1 Introduction
127 pages
Syntax-Semantics Issues For Multiple Event Clauses in Bangla
No ratings yet
Syntax-Semantics Issues For Multiple Event Clauses in Bangla
174 pages
62 Irregular Verbs
No ratings yet
62 Irregular Verbs
2 pages
Clause Analysis
No ratings yet
Clause Analysis
17 pages
Truth Value Judgment Task
No ratings yet
Truth Value Judgment Task
220 pages
MID-YEAR Assessment Template
No ratings yet
MID-YEAR Assessment Template
3 pages
Linguistics
No ratings yet
Linguistics
42 pages
Possessive Case - Genitive Case: 1. Singular Nouns
No ratings yet
Possessive Case - Genitive Case: 1. Singular Nouns
7 pages
So That VS Such That
No ratings yet
So That VS Such That
7 pages
Rumus Kalimat Pasif Bahasa Inggris
No ratings yet
Rumus Kalimat Pasif Bahasa Inggris
3 pages
КТП 6 кл 2024-2025
No ratings yet
КТП 6 кл 2024-2025
11 pages
Grammar Re-Imagined Foregrounding Understanding of Language Choice in Writing
No ratings yet
Grammar Re-Imagined Foregrounding Understanding of Language Choice in Writing
15 pages
Class 4 Syllabus For 2024-2025
No ratings yet
Class 4 Syllabus For 2024-2025
6 pages
Y2 Spag Overview Ver 1
No ratings yet
Y2 Spag Overview Ver 1
15 pages
Worksheet 1. PRESENT PERFECT
No ratings yet
Worksheet 1. PRESENT PERFECT
5 pages
Unit - I Worksheet - The Night We Won The Buick
No ratings yet
Unit - I Worksheet - The Night We Won The Buick
3 pages
Session 15 - English I 2025 - SUBE SEMIPRESENCIAL
No ratings yet
Session 15 - English I 2025 - SUBE SEMIPRESENCIAL
24 pages
1001 Merged2 Zrugu9
No ratings yet
1001 Merged2 Zrugu9
1,005 pages
10 Active-Passive Voice
No ratings yet
10 Active-Passive Voice
8 pages
02 - Unit 2 - Simple Past Vs Past Continuous - Worksheet 2
No ratings yet
02 - Unit 2 - Simple Past Vs Past Continuous - Worksheet 2
3 pages
103 Simple Present Vs Present Progressive US
No ratings yet
103 Simple Present Vs Present Progressive US
4 pages
English5 q1 Mod4 Lesson4 ComposingSentencesUsingAdverbsOfIntensityAndFrequency v2
No ratings yet
English5 q1 Mod4 Lesson4 ComposingSentencesUsingAdverbsOfIntensityAndFrequency v2
17 pages
(English (Auto-Generated) ) Latin 2 Lesson 13 - Fancy Tricks With The PPP - So You Really Want To Learn Latin (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Latin 2 Lesson 13 - Fancy Tricks With The PPP - So You Really Want To Learn Latin (DownSub - Com)
18 pages
Poppletlessonplan
No ratings yet
Poppletlessonplan
2 pages
Survival English Grammar
No ratings yet
Survival English Grammar
20 pages
Structural Linguistics
No ratings yet
Structural Linguistics
15 pages
Meaning: Division Into Parts
100% (1)
Meaning: Division Into Parts
15 pages
Sentence Variety
No ratings yet
Sentence Variety
14 pages
Rigpai Lodap
No ratings yet
Rigpai Lodap
222 pages
Irregular and Irregular Verbs Short List
No ratings yet
Irregular and Irregular Verbs Short List
2 pages

L3 LanguageModels

Uploaded by

L3 LanguageModels

Uploaded by

NLP

• Six-sided fair die

• Six-sided fair die

[slide from Brendan O’Connor]

[slide from Brendan O’Connor]

• Diagnostic test with errors

• We don’t really care about p(positive)

[slide from Qiaozhu Mei]

[slide from Qiaozhu Mei]

your abilities 160848

• P(“looked Darcy Elizabeth at”) = ?

• The MLE values are often on the order of 10-6 or less

• Two main techniques used

• It is concerned with data transmission, data

• Entropy tells us how informative a random variable is.

• Three bits per character if the characters are

• This number (2.5) can lead one to believe that 3 bits

• Longer codes for less frequent characters

• Measures the amount of surprise of seeing a specific

• If p(x,y) = p(x)p(y), then I(x;y) = log 1 = 0

• Same, but for two distributions (not points)

• It measures the number of bits needed to encode p by using q.

• How about guessing A (“operator”) with a probability of

• Out of vocabulary words (OOV)

translation model language model

• Translate “la maison blanche”

• Translate “la maison blanche”

• Translate “la maison blanche”

From Peter Norvig: http://norvig.com/ngrams/ch14.pdf

• For each “e”:

Example from J&M

• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/NNS

• Use dictionary or finite-state transducers to find all

• A baseline method would be to use a classifier to map

• Bethlehem/NNP Steel/NNP Corp./NNP ,/, hammered/VBN by/IN higher/JJR costs/NNS

• Knowledge about individual words

You might also like