0% found this document useful (0 votes)
3 views48 pages

NLP Unit-4

Language models assign probabilities to sentences, aiding various NLP applications like speech recognition and machine translation. N-gram models, which estimate probabilities based on prior context, can suffer from sparse data issues, necessitating smoothing techniques to adjust estimates. Syntactic parsing, using context-free grammars, helps in understanding sentence structure and relationships among words.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views48 pages

NLP Unit-4

Language models assign probabilities to sentences, aiding various NLP applications like speech recognition and machine translation. N-gram models, which estimate probabilities based on prior context, can suffer from sparse data issues, necessitating smoothing techniques to adjust estimates. Syntactic parsing, using context-free grammars, helps in understanding sentence structure and relationships among words.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Language Models

• Formal grammars (e.g. regular, context free)


give a hard “binary” model of the legal
sentences in a language.
• For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
• To specify a correct probability distribution,
the probability of all sentences in a
language must sum to 1.
Uses of Language Models
• Speech recognition
– “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
• OCR & Handwriting recognition
– More probable sentences are more likely correct readings.
• Machine translation
– More likely sentences are probably better translations.
• Generation
– More likely sentences are probably better NL generations.
• Context sensitive spelling correction
– “Their are problems wit this sentence.”
Completion Prediction

• A language model also supports predicting


the completion of a sentence.
– Please turn off your cell _____
– Your program does not ______
• Predictive text input systems can guess what
you are typing and give choices on how to
complete it.
N-Gram Models
• Estimate probability of each word given prior context.
– P(phone | Please turn off your cell)
• Number of parameters required grows exponentially with
the number of words of prior context.
• An N-gram model uses only N−1 words of prior context.
– Unigram: P(phone)
– Bigram: P(phone | cell)
– Trigram: P(phone | your cell)
• The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history. In particular, in a kth-order Markov model, the
next state only depends on the k most recent states,
therefore an N-gram model is a (N−1)-order Markov model.
N-Gram Model Formulas

• Word sequences
w1n = w1...wn

• Chain rule of probability


n
P( w ) = P( w1 ) P( w2 | w1 ) P( w3 | w )...P( wn | w ) =  P( wk | w1k −1 )
n
1
2
1
n −1
1
k =1

• Bigram approximation
n
P( w1 ) =  P( wk | wk −1 )
n

k =1

• N-gram approximation
n
P( w ) =  P( wk | wkk−−1N +1 )
n
1
k =1
Estimating Probabilities
• N-gram conditional probabilities can be estimated
from raw text based on the relative frequency of
word sequences.
C ( wn −1wn )
Bigram: P( wn | wn −1 ) =
C ( wn −1 )
n −1
C ( wn − N +1 wn )
N-gram: P( wn | wnn−−1N +1 ) =
C ( wnn−−1N +1 )
• To have a consistent probabilistic model, append a
unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.
Generative Model & MLE
• An N-gram model can be seen as a probabilistic
automata for generating sentences.
Initialize sentence with N−1 <s> symbols
Until </s> is generated do:
Stochastically pick the next word based on the conditional
probability of each word given the previous N −1 words.

• Relative frequency estimates can be proven to be


maximum likelihood estimates (MLE) since they
maximize the probability that the model M will
generate the training corpus T.
ˆ = argmax P(T | M ( ))

Example from Textbook

• P(<s> i want english food </s>)


= P(i | <s>) P(want | i) P(english | want)
P(food | english) P(</s> | food)
= .25 x .33 x .0011 x .5 x .68 = .000031
• P(<s> i want chinese food </s>)
= P(i | <s>) P(want | i) P(chinese | want)
P(food | chinese) P(</s> | food)
= .25 x .33 x .0065 x .52 x .68 = .00019
The Chain Rule applied to
compute joint probability of
words in sentence

P(w1w2…wn ) = P(wi | w1w2…wi−1)


i

P(“its water is so transparent”) =


P(its) × P(water|its) × P(is|its water)
× P(so|its water is) ×
P(transparent|its water is so)
How to estimate these
probabilities
• Could we just count anddivide?

P(the | its water is so transparent


that)=Count(its water is so
transparent that the)
Count(its water is so transparent that)

• No! Too many possible sentences!


• We’ll never see enough data for estimating
these
GIVEN CORPUS(TRAINING CORPUS ):
<S> I AM SAM </S>
<S> SAM I AM </S>
<S>I DO NOT LIKE GREEN VEG </S>
BIGRAM PROBABILITY ESTIMATES:
P(I|<S>)=2/3=0.67
P(SAM|<S>)=1/3=0.33
P(AM|I)=2/3=0.67
P(</S>|SAM)=1/2=.50
P(SAM|AM)=?
P(DO|I)=?
11
Train and Test Corpora
• A language model must be trained on a large
corpus of text to estimate good parameter values.
• Model can be evaluated based on its ability to
predict a high probability for a disjoint (held-out)
test corpus (testing on the training corpus would
give an optimistically biased estimate).
• Ideally, the training (and test) corpus should be
representative of the actual application data.
• May need to adapt a general model to a small
amount of new (in-domain) data by adding highly
weighted small corpus to original training data.
Evaluation of Language Models
• Ideally, evaluate use of model in end application
(extrinsic)
– Realistic
– Expensive
• Evaluate on ability to model test corpus
(intrinsic).
– Less realistic
– Cheaper
• Verify at least once that intrinsic evaluation
correlates with an extrinsic one.
Perplexity
• Measure of how well a model “fits” the test data.
• Uses the probability that the model assigns to the
test corpus.
• Normalizes for the number of words in the test
corpus and takes the inverse.
1
PP(W ) = N
P( w1w2 ...wN )

• Measures the weighted average branching factor


in predicting the next word (lower is better).
Sample Perplexity Evaluation

• The branching factor of a language is the


number of possible next words that can
follow any word. Consider the task of
recognizing the digits in English (zero, one,
two,..., nine), given that each of the 10
digits occurs with equal probability P = 1/
10 . The perplexity of this mini-language is
in fact 10.
• PP(W) = P(w1w2 ...wN) − 1/ N
• = ( (1 /10) N ) − 1/ N = 10
Smoothing
• Since there are a combinatorial number of possible
word sequences, many rare (but not impossible)
combinations never occur in training, so MLE
incorrectly assigns zero to many parameters (a.k.a.
sparse data).
• If a new combination occurs during testing, it is
given a probability of zero and the entire sequence
gets a probability of zero (i.e. infinite perplexity).
• In practice, parameters are smoothed (a.k.a.
regularized) to reassign some probability mass to
unseen events.
– Adding probability mass to unseen events requires
removing it from seen ones (discounting) in order to
maintain a joint distribution that sums to 1.
Laplace (Add-One) Smoothing
• “Hallucinate” additional training data in which each
possible N-gram occurs exactly once and adjust
estimates accordingly.
C ( wn −1wn ) + 1
Bigram: P( wn | wn −1 ) =
C ( wn −1 ) + V
n −1
n −1 C ( wn − N +1 wn ) + 1
N-gram: P( wn | wn − N +1 ) =
C ( wnn−−1N +1 ) + V
where V is the total number of possible (N−1)-grams
(i.e. the vocabulary size for a bigram model).
• Tends to reassign too much mass to unseen events,
so can be adjusted to add 0<<1 (normalized by V
instead of V).
Advanced Smoothing

• Many advanced techniques have been


developed to improve smoothing for
language models.
– Good-Turing
– Interpolation
– Backoff
– Kneser-Ney
– Class-based (cluster) N-grams
A Problem for N-Grams:
Long Distance Dependencies
• Many times local context does not provide the
most useful predictive clues, which instead are
provided by long-distance dependencies.
– Syntactic dependencies
• “The man next to the large oak tree near the grocery store on
the corner is tall.”
• “The men next to the large oak tree near the grocery store on
the corner are tall.”
– Semantic dependencies
• “The bird next to the large oak tree near the grocery store on
the corner flies rapidly.”
• “The man next to the large oak tree near the grocery store on
the corner talks rapidly.”
• More complex models of language are needed to
handle such dependencies.
Summary
• Language models assign a probability that a
sentence is a legal string in a language.
• They are useful as a component of many NLP
systems, such as ASR, OCR, and MT.
• Simple N-gram models are easy to train on
unsupervised corpora and can provide useful
estimates of sentence likelihood.
• MLE gives inaccurate parameters for models
trained on sparse data.
• Smoothing techniques adjust parameter estimates
to account for unseen (but not impossible) events.
Syntactic Parsing
• Syntax: the way words are arranged together
• Main ideas of syntax:
– Constituency
• Groups of words may behave as a single unit or
phrase, called constituent.
• CFG, a formalism allowing us to model the
constituency facts.
– Grammatical relations
• A formalization of ideas from traditional grammar
about SUBJECT and OBJECT
– Subcategorization and dependencies
• Referring to certain kind of relations between words
and phrases, e.g., the verb want can be followed by an
Context Free Grammar forinfinite, as in I want to fly to Detroit. 21
English
Background
• All of the kinds of syntactic knowledge can be modeled by
various kinds of CFG-based grammars.
• CFGs are thus backbone of many models of the syntax of
NL.
– Being integral to most models of NLU, of grammar checking, and more
recently speech understanding
• They are powerful enough to express sophisticated
relations among the words in a sentence, yet
computationally tractable enough that efficient algorithms
exists for parsing sentences with them.
• Also probability version of CFG.

Context Free Grammar for 22


English
9.1 Constituency
• NP:
– A sequence of words surrounding at least one noun, e.g.,
• three parties from Brooklyn arrive
• a high-class spot such as Mindy’s attracts
• the Broadway coppers love
• They sit
• Harry the Horse
• the reason he comes into the Hot Box
• Evidences of constituency
– The above NPs can all appear in similar syntactic environment, e.g., before, a verb.
– Preposed or postposed constructions, e.g., the PP, on September seventeenth, can be
placed in a number of different locations
• On September seventeenth, I’d like to fly from Atlanta to Denver.
• I’d like to fly on September seventeenth from Atlanta to Denver.
• I’d like to fly from Atlanta to Denver On September seventeenth.

Context Free Grammar for 23


English
9.2 Context-Free Rules and Trees
• CFG (or Phrase-Structure Grammar): NP
– The most commonly used mathematical system for
modeling constituent structure in English and other
Det Nom
NLs
– Terminals and non-terminals
Noun
– Derivation
a
– Parse tree
flight
– Start symbol

•Context Free Grammar for English 24


9.2 Context-Free Rules and Trees

Noun → flight | breeze | trip | morning | …


Verb → is | prefer | like | need | want | fly …
Adjective → cheapest | non-stop | first | latest | other | direct | …
Pronoun → me | I | you | it | …
Proper-Noun → Alaska | Baltimore | Los Angeles | Chicago | United | American | …
Determiner → the | a | an | this | these | that | …
Preposition → from | to | on | near | … The lexicon for L0
Conjunction → and | or | but | …

S → NP VP I + want a morning flight


NP → Pronoun I
| Proper-Noun Los Angeles
| Det Nominal a + flight
Nominal → Noun Nominal morning + flight
| Noun flights
VP → Verb do
| Verb NP want + a flight
| Verb NP PP leave + Boston + in the morning
| Verb PP leaving + on Thursday
PP → Preposition NP from + Los Angeles The grammar for L0
Sentence-Level Constructions
• There are a great number of possible overall sentences
structures, but four are particularly common and
important:
– Declarative structure, imperative structure, yes-n-no-question structure, and
wh-question structure.
• Sentences with declarative structure
– A subject NP followed by a VP
• The flight should be eleven a.m. tomorrow.
• I need a flight to Seattle leaving from Baltimore
making a stop in Minneapolis.
• The return flight should leave at around seven p.m.
• I would like to find out the flight number for the
United flight that arrives in San Jose around ten p.m.
• I’d like to fly the coach discount class.
Context Free Grammar•forI want a flight from Ontario to Chicago. 26
English
9.3 Sentence-Level Constructions

• Sentence with imperative structure


– Begin with a VP and have no subject.
– Always used for commands and suggestions
• Show the lowest fare.
• Show me the cheapest fare that has lunch.
• Give me Sunday’s flight arriving in Las Vegas from
Memphis and New York City.
• List all flights between five and seven p.m.
• List all flights from Burbank to Denver.
• Show me all flights that depart before ten a.m. and
have first class fares.
• Show me all the flights leaving Baltimore.
– S → VP
Context Free Grammar for 27
English
9.3 Sentence-Level Constructions
• Sentences with yes-no-question structure
– Begin with auxiliary, followed by a subject NP, followed by a VP.
• Do any of these flights have stops?
• Does American’s flight eighteen twenty five serve
dinner?
• Can you give me the same information for United?
– S → Aux NP VP

Context Free Grammar for 28


English
Background
• Syntactic parsing
– The task of recognizing a sentence and assigning a syntactic structure to it
• Since CFGs are a declarative formalism, they do not
specify how the parse tree for a given sentence should be
computed.
• Parse trees are useful in applications such as
– Grammar checking
– Semantic analysis
– Machine translation
– Question answering
– Information extraction

Parsing with CFG 29


Parsing as Search

• The parser can be viewed as searching through the space of


all possible parse trees to find the correct parse tree for the
sentence.
• How can we use the grammar to produce the parse tree?

•Parsing with CFG 30


Parsing as Search

• Top-down parsing

•Parsing with CFG 31


Parsing as Search

• Bottom-up parsing

•Parsing with CFG 32


10.2 A Basic Top-Down Parser
• Use depth-first strategy

•Parsing with CFG 33


10.2 A Basic Top-Down Parser
• A top-down, depth-
first, left-to-right
derivation

•Parsing with CFG 34


10.2 A Basic Top-Down Parser

Parsing with CFG 35


10.3 Problems with the Basic Top-Down
Parser
• Problems with the top-down parser
– Left-recursion
– Ambiguity
– Inefficiency reparsing of subtrees

Parsing with CFG 36


10.3 Problems with the Basic Top-Down
Parser
Ambiguity
• Parsers which do not incorporate disambiguators may
simply return all the possible parse trees for a given input.
• We do not want all possible parses from the robust, highly
ambiguous, wide-coverage grammars used in practical
applications.
• Reason:
– Potentially exponential number of parses that are possible for certain inputs
– Given the ATIS example:
• Show me the meal on Flight UA 386 from San
Francisco to Denver.
– The three PP’s at the end of this sentence yield a total of 14 parse trees for
this sentence.

Parsing with CFG 37


Statistical Parsing
• Statistical parsing uses a probabilistic model
of syntax in order to assign probabilities to
each parse tree.
• Provides principled approach to resolving
syntactic ambiguity.
• Allows supervised learning of parsers from
tree-banks of parse trees provided by human
linguists.
• Also allows unsupervised learning of parsers
from unannotated text, but the accuracy of
such parsers has been limited.
38
SCFG

39
SCFG

40
Probabilistic Context Free Grammar
(PCFG)
• A PCFG is a probabilistic version of a CFG where each
production has a probability.
• Probabilities of all productions rewriting a given non-
terminal must add to 1, defining a distribution for each
non-terminal.
• String generation is now probabilistic where production
probabilities are used to non-deterministically select a
production for rewriting a given non-terminal.

41
Simple PCFG for ATIS English
Grammar Prob Lexicon
S → NP VP 0.8 Det → the | a | that | this
S → Aux NP VP 0.1 + 1.0 0.6 0.2 0.1 0.1
S → VP 0.1 Noun → book | flight | meal | money
NP → Pronoun 0.2 0.1 0.5 0.2 0.2
NP → Proper-Noun 0.2 + 1.0 Verb → book | include | prefer
NP → Det Nominal 0.6 0.5 0.2 0.3
Nominal → Noun 0.3 Pronoun → I | he | she | me
Nominal → Nominal Noun 0.2 0.5 0.1 0.1 0.3
Nominal → Nominal PP 0.5 + 1.0
Proper-Noun → Houston | NWA
VP → Verb 0.2 0.8 0.2
VP → Verb NP 0.5 Aux → does
VP → VP PP 0.3 + 1.0 1.0
PP → Prep NP 1.0 Prep → from | to | on | near | through
0.25 0.25 0.1 0.2 0.2
Sentence Probability
• Assume productions for each node are chosen
independently.
• Probability of derivation is the product of the
probabilities of its productions.
P(D1) = 0.1 x 0.5 x 0.5 x 0.6 x 0.6 x S
0.1 D1
0.5 x 0.3 x 1.0 x 0.2 x 0.2 x VP
0.5
0.5 x 0.8 Verb NP 0.6
= 0.0000216 0.5
book Det Nominal 0.5
0.6
the Nominal PP 1.0
0.3
Noun Prep NP 0.2
0.5 0.2
flight through Proper-Noun
0.8
Houston
43
Syntactic Disambiguation
• Resolve ambiguity by picking most probable
parse tree.
S
P(D2) = 0.1 x 0.3 x 0.5 x 0.6 x 0.5 x 0.1 D2
0.6 x 0.3 x 1.0 x 0.5 x 0.2 x VP
0.3
VP 0.5
0.2 x 0.8
= 0.00001296 Verb NP 0.6
0.5 PP
book Det Nominal 1.0
0.6 0.3
the Noun Prep NP 0.2
0.5 0.2
flight through Proper-Noun
0.8
Houston

44 44
Sentence Probability
• Probability of a sentence is the sum of the probabilities of
all of its derivations.

P(“book the flight through Houston”) =


P(D1) + P(D2) = 0.0000216 + 0.00001296
= 0.00003456

45
Three Useful PCFG Tasks
• Observation likelihood: To classify and order sentences.
• Most likely derivation: To determine the most likely parse
tree for a sentence.
• Maximum likelihood training: To train a PCFG to fit
empirical training data.

46
Treebanks
• a treebank is a
parsed textcorpus that annotates syntactic or
semantic sentence structure.
• English Penn Treebank: Standard corpus for
testing syntactic parsing consists of 1.2 M
words of text from the Wall Street Journal
(WSJ).
• Typical to train on about 40,000 parsed
sentences and test on an additional standard
disjoint test set of 2,416 sentences.
• Chinese Penn Treebank: 100K words from the
47
Xinhua news service
Parsing Evaluation Metrics
• PARSEVAL metrics measure the fraction of the
constituents that match between the computed
and human parse trees. If P is the system’s parse
tree and T is the human parse tree:
– Recall = (# correct constituents in P) / (# constituents in T)
– Precision = (# correct constituents in P) / (# constituents in P)
• Labeled Precision and labeled recall require
getting the non-terminal label on the constituent
node correct to count as correct.

48

You might also like