lecture7-pos-tagging
lecture7-pos-tagging
https://mila.quebec/en/event/workshop-nlp-in-the-era-of-generative- 2
ai-cognitive-sciences-and-societal-transformation
So Far In the Course
Making a single prediction from a sequence
à text classification
Predicting the sequence itself
à language modelling
Today:
Making a series of predictions from a sequence, one
per token in the sequence
à sequence labelling
particular application: part-of-speech tagging
3
Outline
Parts of speech in English
POS tagging as a sequence labelling problem
Markov chains revisited
Hidden Markov models
4
Parts of Speech in English
Nouns restaurant, me, dinner
Verbs find, eat, is
Adjectives good, vegetarian
Prepositions in, of, up, above
Adverbs quickly, well, very
Determiners the, a, an
5
What is a Part of Speech?
A kind of syntactic category that tells you some of the
grammatical properties of a word.
6
Important Note
You may have learned in grade school that nouns =
things, verbs = actions. This is wrong!
7
Penn Treebank Tagset
CC Coordinating conjunction PRP$ Possessive pronoun
CD Cardinal number RB Adverb
DT Determiner RBR Adverb, comparative
EX Existential there RBS Adverb, superlative
FW Foreign word RP Particle
IN Preposition; subord. conjunct. SYM Symbol
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
LS List item marker VBD Verb, past tense
MD Modal VBG Verb, gerund or present part.
NN Noun, singular or mass VBN Verb, past participle
NNS Noun, plural VBP Verb, non-3rd pers. sing. pres.
NNP Proper noun, singular VBZ Verb, 3rd pers. sing. pres.
NNPS Proper noun, plural WDT Wh-determiner
PDT Predeterminer WP Wh-pronoun
POS Possessive ending WP$ Possessive wh-pronoun
PRP Personal pronoun WRB Wh-adverb
8
Other Parts of Speech
Modals and auxiliary verbs
• The police can and will catch the fugitives.
• Did the chicken cross the road?
In English, these play an important role in question
formation, and in specifying tense, aspect and mood.
Conjunctions
• and, or, but, yet
They connect and relate elements.
Particles
• look up, turn down
Can be parts of particle verbs. May have other functions
(depending on what you consider a particle.)
9
Classifying Parts of Speech: Open Class
Open classes are parts of speech for which new words
are readily added to the language (neologisms).
• Nouns Twitter, Kleenex, turducken
• Verbs google, photoshop
• Adjectives Pastafarian, sick
• Adverbs automagically
• Interjections D’oh!
• More at https://neologisms.rice.edu/word/browse
Open class words usually convey most of the content.
They tend to be content words.
10
Closed Class
Closed classes are parts of speech for which new words
tend not to be added.
• Pronouns I, he, she, them, their
• Determiners a, the
• Quantifiers some, all, every
• Conjunctions and, or, but
• Modals and auxiliaries might, should, ought
• Prepositions to, of, from
Closed classes tend to convey grammatical information.
They tend to be function words.
11
Universal dependency Tagset
Open classes Closed classes
ADJ Adjective ADP Adposition
ADV Adverb AUX Auxiliary
INTJ Interjection CCONJ Coordinating conjunction
NOUN Noun DET Determiner
PROPN Proper noun NUM Numeral
VERB Verb PART Particle
PRON Pronoun
SCONJ Subordinating conjunction
Other
PUNCT Punctuation
SYM Symbol https://universaldependencies.org/u/pos/index.html
X other
12
Corpus Differences
How fine-grained do you want your tags to be?
e.g., PTB tagset distinguishes singular from plural nouns
• NN cat, water
• NNS cats
13
Language Differences
Languages differ widely in which parts of speech they
have, and in their specific functions and behaviours.
• In Japanese, there is no great distinction between nouns
and pronouns. Pronouns are open class. OTOH, true verbs
are a closed class.
• I in Japanese: watashi, watakushi, ore, boku, atashi, …
• In Wolof (Niger-Congo language spoken in West Africa),
verbs are not conjugated for person and tense. Instead,
pronouns are.
• maa ngi (1st person, singular, present continuous perfect)
• naa (1st person, singular, past perfect)
• In Salishan languages (in the Pacific Northwest), the
distinction between nouns and verbs is subtle or possibly
non-existent (disputed) (Kinkade, 1983).
14
Exercise
Give coarse POS tag labels to the following passage:
15
POS Tagging
Assume we have a tagset and a corpus with words
labelled with POS tags. What kind of problem is this?
Supervised or unsupervised?
Classification or regression?
16
Sequence Labelling
Predict labels for an entire sequence of inputs:
? ? ? ? ? ? ? ? ? ? ?
Pierre Vinken , 61 years old , will join the board …
17
Markov Chains
Our model will assume an underlying Markov process
that generates the POS tags and words.
You’ve already seen Markov processes:
• Morphology: transitions between morphemes that make
up a word
• N-gram models: transitions between words that make up
a sentence
In other words, they are highly related to finite state
automata
18
Observable Markov Model
• N states that represent
unique observations about
the world. car
19
Unrolling the Timesteps
A walk along the states in the Markov chain generates
the text that is observed:
20
Hidden Variables
The POS tags to be predicted are hidden variables. We
don’t see them during test time (and sometimes not
during training either).
It is very common to have hidden phenomena:
• Encrypted symbols are outputs of hidden messages
• Genes are outputs of functional relationships
• Weather is the output of hidden climate conditions
• Stock prices are the output of market conditions
• …
21
Markov Process w/ Hidden Variables
Model transitions between POS tags, and outputs
(“emits”) a word which is observed at each timestep.
be 0.15
have 0.07
VB do 0.04
thing 0.03 …
stuff 0.015
market 0.006 0.04
… NN DT the 0.55
0.7 a 0.35
an 0.05
0.27 …
JJ
good 0.06
bad 0.35
…
22
Unrolling the Timesteps
Now, the sample looks something like this:
DT NN IN NNS VBD
23
Probability of a Sequence
Suppose we know both the sequence of POS tags and
words generated by them:
𝑃(𝑇ℎ𝑒/𝐷𝑇 𝑐𝑎𝑟/𝑁𝑁 𝑜𝑓/𝐼𝑁 𝑎𝑛𝑡𝑠/𝑁𝑁𝑆 𝑟𝑎𝑛/𝑉𝐵𝐷)
emit
= 𝑃 𝐷𝑇 ×𝑃 𝐷𝑇 → 𝑇ℎ𝑒
×𝑃 𝐷𝑇 trans
→ 𝑁𝑁 ×𝑃(𝑁𝑁 → emit
𝑐𝑎𝑟)
trans emit
×𝑃 𝑁𝑁 → 𝐼𝑁 ×𝑃(𝐼𝑁 → 𝑜𝑓)
trans emit
×𝑃 𝐼𝑁 → 𝑁𝑁𝑆 ×𝑃(𝑁𝑁𝑆 → 𝑎𝑛𝑡𝑠)
trans emit
×𝑃 𝑁𝑁𝑆 → 𝑉𝐵𝐷 ×𝑃(𝑉𝐵𝐷 → 𝑟𝑎𝑛)
24
Graphical Models
Since we now have many random variables, it helps to
visualize them graphically. Graphical models precisely
tell us:
• Latent or hidden random variables (clear)
25
Hidden Markov Models
Graphical representation
𝑄! 𝑄" 𝑄# 𝑄$ 𝑄%
𝑂! 𝑂" 𝑂# 𝑂$ 𝑂%
26
Decomposing the Joint Probability
Graph specifies how join probability decomposes
𝑄! 𝑄" 𝑄# 𝑄$ 𝑄%
𝑂! 𝑂" 𝑂# 𝑂$ 𝑂%
$%" $
28
Training a HMM POS Tagger
Suppose that we have a labelled corpus of words with
their POS tags.
Supervised training possible using techniques that we
learned for N-gram language models!
• Initial probability distribution: look at the POS tags in the
first word of each sentence
• Transition probability distributions: look at transitions of
POS tags that are seen in the training corpus
• Emission probability distributions: look at emissions of
words from each POS tag in the training corpus
29
Supervised Estimation of Parameters
Recall categorical distributions’ MLE:
#(outcome i)
𝑃 outcome i =
# all events
For our parameters:
# 𝑄$ = 𝑖
𝜋& = 𝑃 𝑄$ = 𝑖 =
#(sentences)
DT NN VBD JJ
the cat was sad
RB VBD DT NN
so was the mat
DT JJ NN VBD IN DT JJ NN
the sad cat was on the sad mat
31
Inference with HMMs
Now that we have a model, how do we actually tag a
new sentence?
• Suppose that for each word, we just found the most likely
POS tag that emitted it. What is the problem with this?
• Need a way to find the best POS tag sequence (and we
need to define what best means).
32
Questions for an HMM
1. Compute likelihood of a sequence of observations,
𝑃(𝑶|𝜃) Forward algorithm, backward algorithm
2. What state sequence best explains a sequence of
observations?
Viterbi algorithm
argmax 𝑃(𝑸, 𝑶|𝜃)
𝑸
33