0% found this document useful (0 votes)
11 views38 pages

Lecture05-Hmm Pos Tagging

This document discusses sequence labeling in Natural Language Processing (NLP) using Hidden Markov Models (HMMs) for Part-of-Speech (P.O.S) tagging. It covers concepts such as garden-path sentences, syntactic ambiguities, and the importance of P.O.S tagging in linguistic representation, along with the Viterbi algorithm for decoding sequences. Additionally, it introduces the Penn Treebank Tagset and the application of Bayesian inference in sequence labeling tasks.

Uploaded by

yl5404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views38 pages

Lecture05-Hmm Pos Tagging

This document discusses sequence labeling in Natural Language Processing (NLP) using Hidden Markov Models (HMMs) for Part-of-Speech (P.O.S) tagging. It covers concepts such as garden-path sentences, syntactic ambiguities, and the importance of P.O.S tagging in linguistic representation, along with the Viterbi algorithm for decoding sequences. Additionally, it introduces the Penn Treebank Tagset and the application of Bayesian inference in sequence labeling tasks.

Uploaded by

yl5404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Natural Language

Processing
Lecture 5: Sequence Labeling with Hidden Markov
Models. Part-of-Speech Tagging.

9/20/2024

COMS W4705
Daniel Bauer
Garden-Path Sentences
• The horse raced past the barn.

• The horse raced past the barn fell.

• The old dog the footsteps of the young.

• The cotton clothing is made of grows in Egypt.


Garden-Path Sentences
• Why does this happen?

past tense verb


VBD ???
The horse raced past the barn fell

• raced can be a past tense verb or a a past participle


(indicating passive voice).

• The verb interpretation is more likely before fell is read.


Garden-Path Sentences
• Why does this happen?

past participle
VBN VBD
[The horse raced past the barn] fell
NP

• raced can be a past tense verb or a a past participle


(indicating passive voice).

• Once fell is read, the verb interpretation is impossible.


Garden-Path Sentences
• Why does this happen?

adjective
JJ NN
[The old dog] [the footsteps of the young]
NP NP

• dog can be a noun or a verb (plural, present tense)


Garden-Path Sentences
• Why does this happen?

NNS VB
[The old] dog [the footsteps of the young]
NP NP

• dog can be a noun or a verb (plural, present tense)


Parts-of-Speech
• Classes of words that behave alike:
• Appear in similar contexts.
• Perform a similar grammatical function in the sentence.
• Undergo similar morphological transformations.
• Have similar meaning.

• ~9 traditional parts-of-speech:
• noun, pronoun, determiner, adjective, verb, adverb,
preposition, conjunction, interjection
Syntactic Ambiguities and
Parts-of-Speech
N / V? N / V? V / Preposition?
• Time ies like an arrow.
fl
Syntactic Ambiguities and
Parts-of-Speech
N N V
• [Time
NP
ies] like an arrow.
fl
Why do we need P.O.S.?
• Interacts with most levels of linguistic representation.

• Speech processing:
• lead (V) vs. lead (N).
• insult, insult
• object, object
• content, content

• Syntactic parsing

• …

• P.O.S. tag-set should contain morphological and maybe


syntactic information.
Penn Treebank Tagset
CC Coordinating conjunction PRP$ Possessive pronoun
CD Cardinal number RB Adverb
DT Determiner RBR Adverb, comparative
EX Existential there RBS Adverb, superlative
FW Foreign word RP Particle
IN Preposition or subordinating conjunction SYM Symbol
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
LS List item marker VBD Verb, past tense
MD Modal VBG Verb, gerund or present participle
NN Noun, singular or mass VBN Verb, past participle
NNS Noun, plural VBP Verb, non-3rd person singular present
NNP Proper noun, singular VBZ Verb, 3rd person singular present
NNPS Proper noun, plural WP Wh-pronoun
PDT Predeterminer WP$ Possessive wh-pronoun
POS Possessive ending WRB Wh-adverb
PRP Personal pronoun plus punctuation symbols
P.O.S. Tagsets
• Tagset is language speci c.

• Some language capture more morphological information


which should be re ected in the tag set.

• “Universal Part Of Speech Tags?”

• Petrov et al. 2011: Mapping of 25 language speci c


tag-sets to a common set of 12 universal tags

• "Universal Dependencies" framework uses 17 tags.


https://universaldependencies.org/u/pos/
fl
fi
fi
Part-of-Speech Tagging
• Goal: Assign a part-of-speech label to each word in a
sentence.

DT NN VBD DT NNS IN DT NN .
the koala put the keys on the table .

• This is an example of a sequence labeling task.

• Think of this as a translation task from a sequence of


words (w1, w2, …, wn) ∈ V*, to a sequence of tags
(t1, t2, …, tn) ∈T*.
Part-of-Speech Tagging
• Goal: Translate from a sequence of words
(w1, w2, …, wn) ∈ V*, to a sequence of tags
( t1, t2, …, tn ) ∈ T*.

• NLP is full of translation problems from one structure to


another. Basic solution:

• For each translation step:

1. Construct search space of possible translations.

2. Find best paths through this space (decoding) according


to some performance measure.
Bayesian Inference for
Sequence Labeling
• Recall Bayesian Inference (Generative Models): Given
some observation, infer the value of some hidden
variable. (see Naive Bayes’)

• We can apply this approach to sequence labeling:

• Assume each word wi in the observed sequence


(w1, w2, …, wn) ∈ V* was generated by some hidden
variable ti.

• Infer the most likely sequence of hidden variables given


the sequence of observed words.
Noisy Channel Model
“NN VBZ IN DT NN”
P(tags)

P(words | tags)

“time ies like an arrow”


• Goal: gure out what the original input to the the channel
was. Use Bayes’ rule:

• This model is used widely (speech recognition, MT)


fl
fi
Hidden Markov Models (HMMs)
• Generative (Bayesian) probability model.
Observations: sequences of words.
Hidden states: sequence of part-of-speech labels.

START NN VBZ IN DT NN

time ies like an arrow


• Hidden sequence is generated by an n-gram language
model (typically a bi-gram model)
t0 = START
fl
Markov Chains
start
0.1
0.7 0.2
0.4
0.6
1.0 0.2
DT NNZ IN VBZ
0.2
0.3
0.2
0.7 0.4

• A Markov chain is a sequence of random variables X1, X2, …


• The domain of these variables is a set of states.
• Markov assumption: Next state depends only on current state.

• This is a special case of a weighted nite state automaton (WFSA).


fi
Hidden Markov Models (HMMs)
• There are two types of probabilities:
Transition probabilities and Emission Probabilities.

transition probabilities
start t1 t2 t3

emission probabilities

w1 w2 w3
Important Tasks on HMMs
• Decoding: Given a sequence of words, nd the most likely tag
sequence.
(Bayesian inference using Viterbi algorithm).

• Evaluation: Given a sequence of words, nd the


total probability for this word sequence given an HMM.
Note that we can view the HMM as another type of language
model. (Forward algorithm)

• Training: Estimate emission and transition probabilities from


training data. (MLE, Forward-Backward a.k.a Baum-Welch
algorithm)
fi
fi
Decoding HMMs
VBZ VBZ VBZ VBZ VBZ

IN IN IN IN P

NN NN NN NN NN

DT DT DT DT DT

time ies like an arrow


Goal: Find the path with the highest total probability (given the words)

There are dn paths for n words and d tags.


fl
Emission Probabilities
• P(time | VB) = 0.2
P( ies | VB) = 0.3
P(like | VB) = 0.5

• P(time | NN) = 0.3


P( ies | NN) = 0.2
P(arrow | NN) = 0.5

• P(like | IN) = 1.0

• P(an | DT) = 1.0

(these are used in the example on the next slide)


fl
fl
Viterbi Algorithm
.1 x .2 = .02VBZ VBZ VBZ VBZ VBZ

0 IN IN IN IN IN

.2 x .3 = .06 NN NN NN NN NN

0 DT DT DT DT DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
.02 VBZ VBZ VBZ

IN

.06 NN NN NN

DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
0.0
.02 VBZ VBZ VBZ
.06 x .6 x .3 = .0108
0.6 IN

.06 NN NN NN

DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
.02 VBZ VBZ VBZ
.0108
0.4
IN
.02 x .4 x .2 = .0016
.06 NN NN NN
0.2
.06 x .2 x .2 = .0024
DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
0.0
.02 VBZ VBZ VBZ
.0108 .0024 x .6 x .5 = .00072
0.6 IN

.06 NN NN NN
.0024
DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
.02 VBZ VBZ VBZ
0.2
.0108 .00072
IN .0108 x .2 x 1 = .00216
.0024 x .2 x 1 = .00048
.06 NN NN 0.2 NN
.0024
DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
.02 VBZ VBZ VBZ
.0108 .00072
0.4
IN
.00072 x .4 x 1 = .000288
.00216

.06 NN NN NN
.0024 0.7
.00216 x .7 x 1 = .001512
DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
.02 VBZ VBZ VBZ
.0108 .00072
IN
.00216

.06 NN NN NN
.0024 .0001512
DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
.02 VBZ VBZ VBZ
.0108 .00072
IN
.00216 .0001512 x 1.0 x .5 = .0000756
.06 NN NN NN
.0024 .0001512
DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
.02 VBZ VBZ VBZ
.0108 .00072
IN
.00216 0.0000756
.06 NN NN NN
.0024 .0001512
DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
VBZ VBZ VBZ
.0108
IN
.00216 0.0000756
.06 NN NN NN
.0001512
DT

time ies like an arrow

• Idea: Because of the Markov assumption, we only need


the probabilities for Xn to compute the probabilities for
Xn+1.
This suggests a dynamic programming algorithm.
fl
Viterbi Algorithm
• Input: Sequence of observed words w1, …, wn

• Create a table π, such that each entry π[k,t] contains the score of the
highest-probability sequence ending in tag t at time k.

• initialize π[0,start]=1.0 and π[0,t]=0.0 for all tags t∈T.

• for k=1 to n:

• for t ∈ T:
emission probability


• return transition probability
Trigram Language Model
• Instead of using a unigram context , use a bigram
context .

• Think of this as having states that represent pairs of tags.

• So the HMM probability for a given tag and word sequence is:

• Need to handle data sparseness when estimating transition


probabilities (for example using backo or linear interpolation)
ff
HMMs as Language Models
• We can also use an HMM as language models (language
generation, MT, …), i.e. evaluate for a
given sentence.
What is the advantage over a plain word n-gram model?

• Problem: There are many tag-sequences that could have


generated w1, … wn.

• This is an example of spurious ambiguity.

• Need to compute:
Forward Algorithm
• Input: Sequence of observed words w1, …, wn

• Create a table π, such that each entry π[k,t] contains the sum of the
probabilities of all tag/word sequences ending in tag t at time k.

• initialize π[0,start]=1.0 and π[0,t]=0.0 for all tags t∈T.

• for k=1 to n:

• for t ∈ T:


• return
Named Entity Recognition
as Sequence Labeling
• Use 3 tags:
• O - outside of named entity
• I - inside named entity
• B - rst word (beginning) of named entity
O O B I O
… identi cation of tetronic acid in …

• Other encodings are possible (for example, NE-type speci c)


• This can also be used for other tasks such as phrase
chunking and semantic role labeling.
fi
fi
fi

You might also like