0% found this document useful (0 votes)
2 views

lecture7-pos-tagging

Mcgill COMP 550 Fall 2024 lecture note

Uploaded by

Bohan Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture7-pos-tagging

Mcgill COMP 550 Fall 2024 lecture note

Uploaded by

Bohan Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Lecture 7: Part of Speech Tagging

Instructor: Jackie CK Cheung & David Adelani


COMP-550
J&M Ch. 8.1–8.3 (1st ed); J&M Ch. 5.1–5.3
(2nd ed); J&M Ch. 8.1–8.4 (3rd ed)
Lecture cancellation
October 2 is cancelled due to NLP Workshop at MILA
• Register to attend online

https://mila.quebec/en/event/workshop-nlp-in-the-era-of-generative- 2
ai-cognitive-sciences-and-societal-transformation
So Far In the Course
Making a single prediction from a sequence
à text classification
Predicting the sequence itself
à language modelling
Today:
Making a series of predictions from a sequence, one
per token in the sequence
à sequence labelling
particular application: part-of-speech tagging

3
Outline
Parts of speech in English
POS tagging as a sequence labelling problem
Markov chains revisited
Hidden Markov models

4
Parts of Speech in English
Nouns restaurant, me, dinner
Verbs find, eat, is
Adjectives good, vegetarian
Prepositions in, of, up, above
Adverbs quickly, well, very
Determiners the, a, an

5
What is a Part of Speech?
A kind of syntactic category that tells you some of the
grammatical properties of a word.

The __________ was delicious.


• Only a noun fits here.
This hamburger is ___________ than that one.
• Only a comparative adjective fits.
The cat ate. (OK – grammatical)
*The cat enjoyed. (Ungrammatical. Note the *)

6
Important Note
You may have learned in grade school that nouns =
things, verbs = actions. This is wrong!

Nouns that can be actions or events:


• Examination, wedding, construction, opening
Verbs that are not necessarily actions:
• Be, have, want, enjoy, remember, realize

7
Penn Treebank Tagset
CC Coordinating conjunction PRP$ Possessive pronoun
CD Cardinal number RB Adverb
DT Determiner RBR Adverb, comparative
EX Existential there RBS Adverb, superlative
FW Foreign word RP Particle
IN Preposition; subord. conjunct. SYM Symbol
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
LS List item marker VBD Verb, past tense
MD Modal VBG Verb, gerund or present part.
NN Noun, singular or mass VBN Verb, past participle
NNS Noun, plural VBP Verb, non-3rd pers. sing. pres.
NNP Proper noun, singular VBZ Verb, 3rd pers. sing. pres.
NNPS Proper noun, plural WDT Wh-determiner
PDT Predeterminer WP Wh-pronoun
POS Possessive ending WP$ Possessive wh-pronoun
PRP Personal pronoun WRB Wh-adverb

8
Other Parts of Speech
Modals and auxiliary verbs
• The police can and will catch the fugitives.
• Did the chicken cross the road?
In English, these play an important role in question
formation, and in specifying tense, aspect and mood.
Conjunctions
• and, or, but, yet
They connect and relate elements.
Particles
• look up, turn down
Can be parts of particle verbs. May have other functions
(depending on what you consider a particle.)

9
Classifying Parts of Speech: Open Class
Open classes are parts of speech for which new words
are readily added to the language (neologisms).
• Nouns Twitter, Kleenex, turducken
• Verbs google, photoshop
• Adjectives Pastafarian, sick
• Adverbs automagically
• Interjections D’oh!
• More at https://neologisms.rice.edu/word/browse
Open class words usually convey most of the content.
They tend to be content words.

10
Closed Class
Closed classes are parts of speech for which new words
tend not to be added.
• Pronouns I, he, she, them, their
• Determiners a, the
• Quantifiers some, all, every
• Conjunctions and, or, but
• Modals and auxiliaries might, should, ought
• Prepositions to, of, from
Closed classes tend to convey grammatical information.
They tend to be function words.

11
Universal dependency Tagset
Open classes Closed classes
ADJ Adjective ADP Adposition
ADV Adverb AUX Auxiliary
INTJ Interjection CCONJ Coordinating conjunction
NOUN Noun DET Determiner
PROPN Proper noun NUM Numeral
VERB Verb PART Particle
PRON Pronoun
SCONJ Subordinating conjunction
Other
PUNCT Punctuation
SYM Symbol https://universaldependencies.org/u/pos/index.html
X other

12
Corpus Differences
How fine-grained do you want your tags to be?
e.g., PTB tagset distinguishes singular from plural nouns
• NN cat, water
• NNS cats

e.g., PTB doesn’t distinguish between intransitive verbs and


transitive verbs
• VBD listened (intransitive)
• VBD heard (transitive)

Brown corpus (87 tags) vs. PTB (45)

13
Language Differences
Languages differ widely in which parts of speech they
have, and in their specific functions and behaviours.
• In Japanese, there is no great distinction between nouns
and pronouns. Pronouns are open class. OTOH, true verbs
are a closed class.
• I in Japanese: watashi, watakushi, ore, boku, atashi, …
• In Wolof (Niger-Congo language spoken in West Africa),
verbs are not conjugated for person and tense. Instead,
pronouns are.
• maa ngi (1st person, singular, present continuous perfect)
• naa (1st person, singular, past perfect)
• In Salishan languages (in the Pacific Northwest), the
distinction between nouns and verbs is subtle or possibly
non-existent (disputed) (Kinkade, 1983).

14
Exercise
Give coarse POS tag labels to the following passage:

A Canadian geography nerd has become a bit of a TikTok

sensation in Iceland after he wowed a social media

influencer with his detailed knowledge of the country.

15
POS Tagging
Assume we have a tagset and a corpus with words
labelled with POS tags. What kind of problem is this?
Supervised or unsupervised?
Classification or regression?

Difference from classification that we saw last class—


context matters!
I saw the …
The team won the match …
Several cats …

16
Sequence Labelling
Predict labels for an entire sequence of inputs:
? ? ? ? ? ? ? ? ? ? ?
Pierre Vinken , 61 years old , will join the board …

NNP NNP , CD NNS JJ , MD VB DT NN


Pierre Vinken , 61 years old , will join the board …
Must consider:
Current word
Previous context

17
Markov Chains
Our model will assume an underlying Markov process
that generates the POS tags and words.
You’ve already seen Markov processes:
• Morphology: transitions between morphemes that make
up a word
• N-gram models: transitions between words that make up
a sentence
In other words, they are highly related to finite state
automata

18
Observable Markov Model
• N states that represent
unique observations about
the world. car

• Transitions between states


are weighted—weights of ants ran
all outgoing edges from a
state sum to 1.

• e.g., this is a bigram model of the


• What would a trigram
model look like?

19
Unrolling the Timesteps
A walk along the states in the Markov chain generates
the text that is observed:

the car of ants ran

The probability of the observation is the product of all


the edge weights (i.e., transition probabilities).

20
Hidden Variables
The POS tags to be predicted are hidden variables. We
don’t see them during test time (and sometimes not
during training either).
It is very common to have hidden phenomena:
• Encrypted symbols are outputs of hidden messages
• Genes are outputs of functional relationships
• Weather is the output of hidden climate conditions
• Stock prices are the output of market conditions
• …

21
Markov Process w/ Hidden Variables
Model transitions between POS tags, and outputs
(“emits”) a word which is observed at each timestep.

be 0.15
have 0.07
VB do 0.04
thing 0.03 …
stuff 0.015
market 0.006 0.04
… NN DT the 0.55
0.7 a 0.35
an 0.05
0.27 …
JJ
good 0.06
bad 0.35

22
Unrolling the Timesteps
Now, the sample looks something like this:

DT NN IN NNS VBD

the car of ants ran

23
Probability of a Sequence
Suppose we know both the sequence of POS tags and
words generated by them:
𝑃(𝑇ℎ𝑒/𝐷𝑇 𝑐𝑎𝑟/𝑁𝑁 𝑜𝑓/𝐼𝑁 𝑎𝑛𝑡𝑠/𝑁𝑁𝑆 𝑟𝑎𝑛/𝑉𝐵𝐷)
emit
= 𝑃 𝐷𝑇 ×𝑃 𝐷𝑇 → 𝑇ℎ𝑒
×𝑃 𝐷𝑇 trans
→ 𝑁𝑁 ×𝑃(𝑁𝑁 → emit
𝑐𝑎𝑟)
trans emit
×𝑃 𝑁𝑁 → 𝐼𝑁 ×𝑃(𝐼𝑁 → 𝑜𝑓)
trans emit
×𝑃 𝐼𝑁 → 𝑁𝑁𝑆 ×𝑃(𝑁𝑁𝑆 → 𝑎𝑛𝑡𝑠)
trans emit
×𝑃 𝑁𝑁𝑆 → 𝑉𝐵𝐷 ×𝑃(𝑉𝐵𝐷 → 𝑟𝑎𝑛)

• Product of hidden state transitions and observation


emissions
• Note independence assumptions

24
Graphical Models
Since we now have many random variables, it helps to
visualize them graphically. Graphical models precisely
tell us:
• Latent or hidden random variables (clear)

𝑄! 𝑃(𝑄! = 𝑉𝐵) : Probability that tth tag is VB

• Observed random variables (filled)

𝑂! 𝑃(𝑂! = 𝑎𝑛𝑡𝑠) : Probability that tth word is ants

• Conditional independence assumptions (the edges)

25
Hidden Markov Models
Graphical representation

𝑄! 𝑄" 𝑄# 𝑄$ 𝑄%

𝑂! 𝑂" 𝑂# 𝑂$ 𝑂%

Denote entire sequence of tags as 𝑸


Entire sequence of words as 𝑶

26
Decomposing the Joint Probability
Graph specifies how join probability decomposes

𝑄! 𝑄" 𝑄# 𝑄$ 𝑄%

𝑂! 𝑂" 𝑂# 𝑂$ 𝑂%

$%" $

𝑃(𝑶, 𝑸) = 𝑃 𝑄" * 𝑃(𝑄!&" |𝑄! ) * 𝑃(𝑂! |𝑄! )


!#" !#"
Initial state probability
Emission probabilities
State transition probabilities
27
Model Parameters
Let there be 𝑁 possible tags, 𝑊 possible words
Parameters 𝜃 has three components:
1. Initial probabilities for 𝑄$:
Π = {𝜋! , 𝜋" , … , 𝜋# } (categorical)
2. Transition probabilities for 𝑄! to 𝑄!%$:
𝐴 = 𝑎$% 𝑖, 𝑗 ∈ [1, 𝑁] (categorical)
3. Emission probabilities for 𝑄! to 𝑂! :
𝐵 = 𝑏$ (𝑤& ) 𝑖 ∈ 1, 𝑁 , 𝑘 ∈ 1, 𝑊 (categorical)

How many distributions and values of each type are there?

28
Training a HMM POS Tagger
Suppose that we have a labelled corpus of words with
their POS tags.
Supervised training possible using techniques that we
learned for N-gram language models!
• Initial probability distribution: look at the POS tags in the
first word of each sentence
• Transition probability distributions: look at transitions of
POS tags that are seen in the training corpus
• Emission probability distributions: look at emissions of
words from each POS tag in the training corpus

29
Supervised Estimation of Parameters
Recall categorical distributions’ MLE:
#(outcome i)
𝑃 outcome i =
# all events
For our parameters:
# 𝑄$ = 𝑖
𝜋& = 𝑃 𝑄$ = 𝑖 =
#(sentences)

𝑎&' = 𝑃 𝑄!%$ = 𝑗 𝑄! = 𝑖) = #(𝑖, 𝑗) / #(𝑖)

𝑏&( = 𝑃 𝑂! = 𝑘 𝑄! = 𝑖) = #(word 𝑘, tag 𝑖) / #(𝑖)


Previous discussions about smoothing and OOV items
also apply here.
30
Exercise in Supervised Training
What are the MLE for the following training corpus?
• Give the initial probability distribution, and the transition
and emission distributions from the DT and VBD tags.
DT NN VBD IN DT NN
the cat sat on the mat

DT NN VBD JJ
the cat was sad

RB VBD DT NN
so was the mat

DT JJ NN VBD IN DT JJ NN
the sad cat was on the sad mat
31
Inference with HMMs
Now that we have a model, how do we actually tag a
new sentence?
• Suppose that for each word, we just found the most likely
POS tag that emitted it. What is the problem with this?
• Need a way to find the best POS tag sequence (and we
need to define what best means).

Other questions: What about unsupervised and semi-


supervised learning?

32
Questions for an HMM
1. Compute likelihood of a sequence of observations,
𝑃(𝑶|𝜃) Forward algorithm, backward algorithm
2. What state sequence best explains a sequence of
observations?
Viterbi algorithm
argmax 𝑃(𝑸, 𝑶|𝜃)
𝑸

3. Given an observation sequence (without labels),


what is the best model for it?
Forward-backward algorithm
a.k.a. Baum-Welch algorithm
a.k.a. Expectation Maximization

33

You might also like