Week 9
Week 9
Introduction
Parts of speech tagging
Named entity recognition
Introduction
NLP crosses areas linguistics, computer science,
and artificial intelligence.
In linguistics, there are 8 parts of speech (POS) attributed to
Dionysius Thrax of Alexandria (c. 1st C. BCE):
noun, verb, pronoun, preposition, adverb, conjunction,
participle, article
Parts of speech tagging algorithm is the procedure of
marking up a word in a text (corpus) as corresponding to a
particular POS, based on both its definition and its context.
CS3TM20 © XH 2
POS tagging is useful for
Parsing: POS tagging can improve syntactic parsing
MT: reordering of adjectives and nouns (say from Spanish
to English)
Sentiment or affective tasks: may want to distinguish
adjectives or other POS
Text-to-speech (how do we pronounce “lead” or "object"?)
Or linguistic or language-analytic computational tasks
Need to control for POS when studying linguistic change
like creation of new words, or meaning shift
Or control for POS in measuring meaning similarity or
differences
Two classes of words: Open vs. Closed
Open class words
Usually content words: Nouns, Verbs, Adjectives,
Adverbs
Plus interjections: oh, ouch, uh-huh, yes, hello
New nouns and verbs like iPhone or to fax
Closed class words
Usually function words: short, frequent words with
grammatical function
determiners: a, an, the
pronouns: she, he, I
prepositions: on, under, over, near, by, …
Open class ("content") words
Nouns Verbs Adjectives old green tasty
6
CS3TM20 © XH
Part-of-Speech Tagging
Assigning a part-of-speech to each word in a text.
Map from sequence x1,…,xn of words to y1,…,yn of POS tags
7
CS3TM20 © XH
The Penn Treebank part-of-speech tags
Sample "Tagged" English sentences
There/PRO were/VERB 70/NUM children/NOUN
there/ADV ./PUNC
Preliminary/ADJ findings/NOUN were/AUX
reported/VERB in/ADP today/NOUN ’s/PART
New/PROPN England/PROPN Journal/PROPN
of/ADP Medicine/PROPN
Markov chain
HMM
Viterbi POS tagging algorithm
Markov Chain
0.8
Consider a sequence of state variables
A Markov model embodies the Markov
cold
assumption on the probabilities of this 0.1
0.1
sequence 0.1 0.1
0.3
P(| ) hot warm
= P(| )
(transition probability between states only 0.3
0.6 0.6
dependent on previous state)
15
CS3TM20 © XH
0.8
A Markov chain is specified
by the following component cold
0.1 0.1
a set of N states 0.1 0.1
0.3
A= each representing the hot warm
probability of moving
from state i to state j, 0.3
0.6 0.6
16
CS3TM20 © XH
Hidden Markov Model
A Markov chain is useful when we need to compute a
probability for a sequence of observable events.
In many cases, however, the events we are interested in are
hidden.
For example, we don’t normally observe part-of-speech tags
in a text. Rather, we see words, and must infer the tags from
the word sequence.
A hidden Markov model (HMM) allows us to talk about both
observed events (like words that we see in the input) and
hidden events (like part-of-speech tags) that we think of as
causal factors in our probabilistic model.
17
CS3TM20 © XH
Hidden Markov Model
a set of N states
A= each representing the probability of moving from state i
to state j,
a sequence of T observations, each one drawn from a
vocabulary V=
18
CS3TM20 © XH
We still have the Markov assumption on the probabilities of
tagging sequence
P(| ) = P(| )
(transition probability between states (tags) only dependent
on previous state)
Plus the second assumption is
P(| ) = P(| )
(emission probability of words only dependent on tag state)
We have given two matrices
A: transmission probabilities
B: emission probabilities
𝐶 ( 𝑡𝑖 − 1 , 𝑡𝑖 )
A: transmission probabilities 𝑃 ( 𝑡 𝑖|𝑡 𝑖 −1 ) =
𝐶 ( 𝑡𝑖 − 1 )
𝐶 ( 𝑡 𝑖 , 𝑤𝑖 )
B: emission probabilities 𝑃 ( 𝑤 𝑖|𝑡 𝑖 ) =
𝐶 (𝑡 𝑖)
How?
transmission probability
Nou Nou
Aux Verb Det
n n
Janet
Janet Will back the bill
24
25
26
MD: max
VB: 0.000028 0.0009 = …
NN: 0.000200 .0584 = …
27
28
V 0.0005354496 =max
JJ: 0.000340 .0005 = …
NN: 0.00022 0.0008 = …
RB: 0.010446 .1698 = 0.0017737308
29
V 0.0005354496 =max
JJ: 0.000340 .0005 = …
NN: 0.00022 0.0008 = …
RB: 0.010446 .1698 = 0.0017737308
30
In class exercise : complete finding the tags for “ the bill”
31
Notes on Exam style question:
You will be given a short sentence and two probability
matrices, return the tagging.
To simplify the negative log probabilities are used.
Instead of = max ( * ),
we track = min ( + ),
where =
This is convenient to have matrices with all positive
values and use additions.
32
Exam style question:
Consider a sentence “word1 word2 word3”. The following
transition matrices of the Hidden Markov Model are given as negative
log probabilities of (i) transition and (ii) emission respectively. Show
your working steps of constructing the Viterbi path and the tag.
(i) NNP MD VB NN (ii) word1 word2 word3
<s> 12 3 4 5 NNP 16 16 3
NNP 18 4 3 6 MD 2 18 18
MD 6 5 4 2
VB 18 18 2
VB 13 5 7 8
NN 18 7 18
NN 4 3 4 4
Abbreviation NNP: Proper noun MD: Modal VB: Verb NN: Noun 33
Answer:
(i) NNP MD VB NN (ii) word1 word2 word3
<s> 12 3 4 5 NNP 16 16 3
NNP 18 4 3 6 MD 2 18 18
MD 6 5 4 2 VB 18 18 2
VB 13 5 7 8
NN 18 7 18
NN 4 3 4 4