0% found this document useful (0 votes)
16 views156 pages

Hidden Markov Models and POS Tagging

NLP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views156 pages

Hidden Markov Models and POS Tagging

NLP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 156

Hidden Markov

Models and
POS Tagging
Natalie Parde
UIC CS 421
In general: assigning labels to
individual tokens or spans of
tokens given a longer string of
input
Sequence
Modeling
2 and
Sequence article verb preposition noun
Labeling
The students were excited about the lecture.
Natalie Parde - UIC CS 421

noun adjective article


Sequence Labeling
• Objective: Find the label for the next item, based on the labels of other
items in the sequence.

verb determiner verb noun

Give me a break! Did the window break?

pronoun noun determiner verb

Natalie Parde - UIC CS 421 3


• In document-level text classification,
models assume that the individual
datapoints being classified are
Why perform disconnected and independent
• Many NLP problems do not satisfy this
sequence assumption! Instead, they involve

Natalie Parde - UIC CS 421


• Interconnected decisions
labeling? • Each of which are mutually
dependent
• Each of which resolve different
ambiguities

4
• Named entity recognition
• Semantic role labeling
Example
Sequence person organization

Labeling Natalie Parde works at the University of Illinois at


Chicago and lives in Chicago, Illinois.
Applications location

agent source destination


Natalie drove for 15 hours from Dallas to Chicago in her
hail-damaged Honda Accord.
instrument

Natalie Parde - UIC CS 421 5


This Hidden Markov Models

Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward

Topics
Algorithm

Thursday

Tuesday

Parts of Speech
POS Tagsets
POS Tagging

Natalie Parde - UIC CS 421 6


This Hidden Markov Models

Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward

Topics
Algorithm

Thursday

Tuesday

Parts of Speech
POS Tagsets
POS Tagging

Natalie Parde - UIC CS 421 7


Probabilistic Sequence Models
• We can perform multiple, interdependent classifications to address a greater problem
using probabilistic sequence models
• These models can be neural networks, but they can also be lighter-weight alternatives
closer to finite state automata known as hidden Markov models
• Hidden Markov models are probabilistic generative models for sequences that make
predictions based on an underlying set of hidden states

Natalie Parde - UIC CS 421 8


• Finite state automata with probabilistic

What are state transitions


• Markov Property: The future is independent of

Markov the past, given the present.


• In other words, the next state only depends
on the current state …it is independent of
Models? previous history.
• Also referred to as Markov Chains

Natalie Parde - UIC CS 421


9
Sample Markov Model
.1 .1
q2
.7
.2

q0 q4

.7 .1 .3
.2 .2
.4 .1 .4 .3
q1 q3
.2

Natalie Parde - UIC CS 421 10


Sample Markov Model
.1 .1 P(q3 q2 q1 q4)
q2 = .2 * .1 * .2 * .3
= .0012
.7
.2

q0 q4

.7 .1 .3
.2 .2
.4 .1 .4 .3
q1 q3
.2

Natalie Parde - UIC CS 421 11


Hidden Markov Models
• Markov models that assume an underlying set of
hidden (unobserved) states in which the model can be
• Assume probabilistic transitions between states over
time
• Assume probabilistic generation of items (e.g., tokens)
from states

12
Formal Definition

• A Hidden Markov Model can be specified by enumerating the following properties:


• The set of states, Q
• A sequence of observation likelihoods, B, also called emission probabilities,
each expressing the probability of an observation being generated from a state i
• A start state, q0, and final state, qF, that are not associated with observations
Natalie Parde - UIC CS 421

13
Sample Hidden Markov Model
𝑃(𝑥|𝑞" ) .1
𝑃(𝑦|𝑞" ) = .4
.1 .1
𝑃(𝑧|𝑞" ) .5
q2
.7
.2

q0 q4

.7 .1 .3
.2 .2
.4 .1 .4 .3
q1 q3 𝑃(𝑥|𝑞# ) .7
𝑃(𝑥|𝑞! ) .2 .2 𝑃(𝑦|𝑞# ) = .1
𝑃(𝑦|𝑞! ) = .4 𝑃(𝑧|𝑞# ) .2
𝑃(𝑧|𝑞! ) .4
Natalie Parde - UIC CS 421 14
Formal Definition

• A Hidden Markov Model can be specified by enumerating the following properties:


• The set of states, Q
• A sequence of observation likelihoods, B, also called emission probabilities,
each expressing the probability of an observation ot being generated from a
state i
• A start state, q0, and final state, qF, that are not associated with observations,
together with transition probabilities out of q0 and into qF
• A transition probability matrix, A, where each aij represents the probability of
moving from state i to state j, such that ∑$!"# 𝑎%! = 1 ∀𝑖
Natalie Parde - UIC CS 421

• A sequence of T observations, O, each drawn from a vocabulary V = v1, v2, …,


vV

15
Sample Hidden Markov Model
B2
𝑃(𝑥|𝑞" ) .1
a02 = .1 a24 = .1
𝑃(𝑦|𝑞" ) = .4
q2 𝑃(𝑧|𝑞" ) .5
O = x, y, z a21 = .2 a23 = .7

q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 𝑃(𝑥|𝑞# ) .7
𝑃(𝑥|𝑞! ) .2 a31 = .2 𝑃(𝑦|𝑞# ) = .1
𝑃(𝑦|𝑞! ) = .4 𝑃(𝑧|𝑞# ) .2
𝑃(𝑧|𝑞! ) .4
Natalie Parde - UIC CS 421 16
Corresponding Transition Matrix

q0 q1 q2 q3 q4

a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5

q1
O = x, y, z a21 = .2 a23 = .7

q0 q4

a11 = .1
q2
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2

q3
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4

q4

Natalie Parde - UIC CS 421 17


Corresponding Transition Matrix

q0 q1 q2 q3 q4

a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5

q1 N/A .1 .4 .2 .3
O = x, y, z a21 = .2 a23 = .7

q0 q4

a11 = .1
q2
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2

q3
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4

q4

Natalie Parde - UIC CS 421 18


Corresponding Transition Matrix

q0 q1 q2 q3 q4

a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5

q1 N/A .1 .4 .2 .3
O = x, y, z a21 = .2 a23 = .7

q0 q4

a11 = .1
q2 N/A .2 N/A .7 .1
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2

q3
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4

q4

Natalie Parde - UIC CS 421 19


Corresponding Transition Matrix

q0 q1 q2 q3 q4

a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5

q1 N/A .1 .4 .2 .3
O = x, y, z a21 = .2 a23 = .7

q0 q4

a11 = .1
q2 N/A .2 N/A .7 .1
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2

q3 N/A .2 .1 .3 .4
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4

q4

Natalie Parde - UIC CS 421 20


Corresponding Transition Matrix

q0 q1 q2 q3 q4

a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5

q1 N/A .1 .4 .2 .3
O = x, y, z a21 = .2 a23 = .7

q0 q4

a11 = .1
q2 N/A .2 N/A .7 .1
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2

q3 N/A .2 .1 .3 .4
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4

q4 N/A N/A N/A N/A N/A

Natalie Parde - UIC CS 421 21


HMMs can also • More generally, you can use an HMM to
be used for generate a sequence of T observations: O
probabilistic text = o1, o2, …, oT
generation!
Begin in the start state
For t in [0, …, T]:
Randomly select a new state based on the
22 transition distribution for the current state
Randomly select an observation from the new state
based on the observation distribution for that state
Natalie Parde - UIC CS 421
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7

q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 23
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7

q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 24
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7

q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 25
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7

q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 26
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7

q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 27
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7

q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 28
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7

q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 29
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7

my unicorn laughed
q0 q4

a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 30
Three Fundamental HMM Problems

• Observation Likelihood: How likely is a particular observation


sequence to occur?
• Decoding: What is the best sequence of hidden states for an
observed sequence?
• What is the best sequence of labels for our test data?
• Learning: What are the transition probabilities and observation
likelihoods that best fit the observation sequence and HMM states?
• How do we empirically fit our training data?

Natalie Parde - UIC CS 421


31
This Hidden Markov Models

Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward

Topics
Algorithm

Thursday

Tuesday

Parts of Speech
POS Tagsets
POS Tagging

Natalie Parde - UIC CS 421 32


• Given a sequence of
observations and an
HMM, what is the
probability that this
sequence was generated
Observation by the model?
Likelihood • Useful for two tasks:
• Sequence
classification
• Selecting the most
likely sequence

Natalie Parde - UIC CS 421


33
Sequence Classification

• Assuming an HMM is available for every possible class,


what is the most likely class for a given observation
sequence?
• Which HMM is most likely to have generated the
sequence?

Natalie Parde - UIC CS 421


34
Most Likely Sequence
• Of two or more possible sequences, which one was most likely generated
by a given HMM?

I love long and Oh, yay, I just looooove


confusing homework Sarcasm
long and confusing
assignments. homework assignments.

35
How can we compute the observation
likelihood?
• Naïve Solution:
• Consider all possible state sequences, Q, of length T that the model, 𝜆, could
have traversed in generating the given observation sequence, O
• Compute the probability of a given state sequence from A, and multiply it by
the probability of generating the given observation sequence for that state
sequence
• P(O,Q | 𝜆) = P(O | Q, 𝜆) * P(Q | 𝜆)
• Repeat for all possible state sequences, and sum over all to get P(O | 𝜆)
• But, this is computationally complex!
• O(TNT)

Natalie Parde - UIC CS 421


36
How can we compute the
observation likelihood?

• Efficient Solution:
• Forward Algorithm: Dynamic programming
algorithm that computes the observation
probability by summing over the probabilities
of all possible hidden state paths that could
generate the observation sequence.
• Implicitly folds each of these paths into a
single forward trellis
• Why does this work?
• Markov assumption (the probability of being in
any state at a given time t only relies on the
probability of being in each possible state at
time t-1)
• Works in O(TN2) time!

Natalie Parde - UIC CS 421 37


How does the forward algorithm work?
• Let 𝛼! (𝑗) be the probability of being in state j after seeing the first t observations,
given your HMM 𝜆
• 𝛼! (𝑗) is computed by summing over the probabilities of every path that could lead
you to this cell
• 𝛼! 𝑗 = 𝑃 𝑜" , 𝑜# … 𝑜$ , 𝑞$ = 𝑗 𝜆 = ∑& !%" 𝛼$'" (𝑖)𝑎!( 𝑏( (𝑜$ )
• 𝛼$'" (𝑖): The previous forward path probability from the previous time step
• 𝑎!( : The transition probability from previous state qi to current state qj
• 𝑏( (𝑜$ ): The state observation likelihood of the observed item ot given the
current state j

Natalie Parde - UIC CS 421


38
Formal Algorithm
create a probability matrix forward[N+2,T]

for each state q in [1, …, N] do:


forward[q,1] ← a0,q * bq(o1)
for each time step t from 2 to T do:
for each state q in [1, …, N] do:
forward[q,t] ←∑3!
0 12 𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝑞 4 , 𝑡 − 1 ∗ 𝑎 ! ∗ 𝑏 (𝑜 )
0 ,0 0 6

forwardprob ←∑3
012 𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝑞, 𝑇

Natalie Parde - UIC CS 421 39


Sample Problem
• You’re trying to solve a problem that relies on you knowing which
days it was hot and cold in Chicago during the summer of 1923
• Unfortunately, you have no official records of the weather in Chicago
for that summer, although you’re trying to model some key weather
patterns from that year using an HMM
• You do have one promising lead: You find a detailed diary tracking
how many ice cream cones the author of that diary ate on each day
• You decide to focus on a three-day sequence:
• Day 1: 3 ice cream cones
• Day 2: 1 ice cream cone
• Day 3: 3 ice cream cones
• Your first task is to determine whether this HMM does a good job at
modeling your sequence

Natalie Parde - UIC CS 421 40


Your HMM
.7
.8 B1
𝑃(1|ℎ𝑜𝑡! ) .2
hot1 𝑃(2|ℎ𝑜𝑡! ) = .4
𝑃(3|ℎ𝑜𝑡! ) .4

q0 .3 .4
.6
B2
𝑃(1|𝑐𝑜𝑙𝑑! ) .5
cold2
𝑃(2|𝑐𝑜𝑙𝑑! ) = .4
.2 𝑃(3|𝑐𝑜𝑙𝑑! ) .1

Natalie Parde - UIC CS 421 41


Forward Trellis

• Incorporates all the information you’ll need


to implement the forward algorithm
• Observations
• Transition probabilities
• State observation likelihoods
• Forward probabilities from earlier
observations

Natalie Parde - UIC CS 421 42


Forward Step
𝛼 t-2(N) 𝛼 t-1(N)

qN qN 𝑎&$ 𝛼 t(j) = ∑! 𝛼"#$ (𝑖)𝑎!% 𝑏% (𝑜" ) qN

qj


𝛼 t-2(2) 𝛼 t-1(2) 𝑎"$

q2 q2 q2
𝑎!$
𝛼 t-2(1) 𝛼 t-1(1)

q1 q1 𝑏$ (𝑜% ) q1

Ot-2 Ot-1 Ot Ot+1

Natalie Parde - UIC CS 421 43


.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

q2 h h h h

q1 c c c c

q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 44
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

q2 h h h h
)
|h
(3
*P
4 t)
* . tar
.8 (h|s

q1 c c c c
P

c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 45
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝛼 1(h) = .32

q2 h h h h
)
|h
(3

𝛼 1(c) = .02
*P
4 t)
* . tar
.8 (h|s

q1 c c c c
P

c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 46
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝛼 1(h) = .32
P(h|h) * P(1|h)
.7* .2
q2 h h h h
P(c
)
|h

.3* |h) *
(3

𝛼 1(c) = .02
*P

.5 P(1
| c)
4 t)
* . tar
.8 (h|s

q1 c c c c
P

c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 47
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝛼 1(h) = .32
P(h|h) * P(1|h)
.7* .2
q2 h h h h
h)
(1| P(c
)
|h

P .3* |h) *
(3

*
𝛼 1(c) = .02 )
*P

h|c .5 P(1
P( .2 | c)
4 t)

.4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c)
.6* .5
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 48
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝛼 1(h) = .32 𝛼 2(h) = .32 * .14 + .02 * .08 = .0464


P(h|h) * P(1|h)
.7* .2
q2 h h h h
h)
(1| P(c
)
|h

P .3* |h) *
(3

* 𝛼 2(c) = .32 * .15 + .02 * .30 = .054


𝛼 1(c) = .02 )
*P

h|c .5 P(1
P( .2 | c)
4 t)

.4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c)
.6* .5
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 49
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝛼 1(h) = .32 𝛼 2(h) = .0464


P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) *


(3

* 𝛼 2(c) = .054 |c) *


𝛼 1(c) = .02 )
*P

h|c .5 P(1
(h .1 P(3
P( .2 | c) P .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 50
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝛼 1(h) = .32 𝛼 2(h) = .0464 𝛼 3(h) = .0464 * .28 + .054 * .16 = .021632
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) * 𝛼 3(c) = .0464 * .03 + .054 * .06 = .004632
(3

* 𝛼 2(c) = .054 |c) *


𝛼 1(c) = .02 )
*P

h|c .5 P(1
(h .1 P(3
P( .2 | c) P .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 51
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Forward Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

𝛼 = .021632 + .004632 = 0.026264


qF end end end end

𝛼 1(h) = .32 𝛼 2(h) = .0464 𝛼 3(h) = .0464 * .28 + .054 * .16 = .021632
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) * 𝛼 3(c) = .0464 * .03 + .054 * .06 = .004632
(3

* 𝛼 2(c) = .054 |c) *


𝛼 1(c) = .02 )
*P

h|c .5 P(1
(h .1 P(3
P( .2 | c) P .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 52
We’ve so far • What is the probability that a sequence
tackled one of observations fits a given HMM?
of the • Calculate using forward probabilities!
fundamental • However, there are still two remaining
HMM tasks. tasks to explore….

Natalie Parde - UIC CS 421 53


This Hidden Markov Models

Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward

Topics
Algorithm

Thursday

Tuesday

Parts of Speech
POS Tagsets
POS Tagging

Natalie Parde - UIC CS 421 54


Decoding
• Given an observation sequence
and an HMM, what is the best
hidden state sequence?
• How do we choose a state
sequence that is optimal in some
sense (e.g., best explains the
observations)?
• Very useful for sequence
labeling!

Natalie Parde - UIC CS 421


55
Decoding

• Naïve Approach:
• For each hidden state sequence Q, compute P(O|Q)
• Pick the sequence with the highest probability
• However, this is computationally inefficient!
• O(NT)

Natalie Parde - UIC CS 421


56
• Viterbi Algorithm
• Another dynamic programming algorithm
• Uses a similar trellis to the Forward algorithm
How can • Viterbi time complexity: O(N2T)

we decode
sequences
more
efficiently?

Natalie Parde - UIC CS 421


57
Viterbi Intuition

• Goal: Compute the joint probability of the observation sequence together with the
best state sequence
• So, recursively compute the probability of the most likely subsequence of
states that accounts for the first t observations and ends in state qj.
• 𝑣$ 𝑗 = max 𝑃 𝑞, , 𝑞" , … , 𝑞$'" , 𝑜" , … , 𝑜$ , 𝑞$ = 𝑞( |𝜆
)' ,)( ,…,))*(
• Also record backpointers that subsequently allow you to backtrace the most
probable state sequence
• 𝑏𝑡$ (𝑗) stores the state at time t-1 that maximizes the probability that the
system was in state qj at time t, given the observed sequence

Natalie Parde - UIC CS 421


58
Formal Algorithm
create a path probability matrix Viterbi[N+2,T]

for each state q in [1,…,N] do:


Viterbi[q,1] ← a0,q * bq(o1)
backpointer[q,1] ← 0
for each time step t in [2,…,T] do:
for each state q in [1,…,N] do:
𝑣𝑖𝑡𝑒𝑟𝑏𝑖[𝑞, 𝑡] ← ) max 𝑣𝑖𝑡𝑒𝑟𝑏𝑖 𝑞- , 𝑡 − 1 ∗ 𝑎&) ,& ∗ 𝑏& (𝑜. )
& ∈[#,…,+]
𝑏𝑎𝑐𝑘𝑝𝑜𝑖𝑛𝑡𝑒𝑟[𝑞, 𝑡] ← argmax 𝑣𝑖𝑡𝑒𝑟𝑏𝑖 𝑞- , 𝑡 − 1 ∗ 𝑎&) ,& ∗ 𝑏& (𝑜. )
&) ∈[#,…,+]
bestpathprob ← max 𝑣𝑖𝑡𝑒𝑟𝑏𝑖 𝑞′, 𝑇
&) ∈ #,…,+
bestpathpointer ← argmax 𝑣𝑖𝑡𝑒𝑟𝑏𝑖 𝑞′, 𝑇
&) ∈ #,…,+

Natalie Parde - UIC CS 421 59


• Viterbi is basically the forward
Seem algorithm + backpointers!
• Instead of summing across prior
familiar? forward probabilities, we use a max
function

Natalie Parde - UIC CS 421


60
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

q2 h h h h

q1 c c c c

q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 61
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

q2 h h h h
)
|h
(3
*P
4 t)
* . tar
.8 (h|s

q1 c c c c
P

c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 62
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝑣1(h) = .32

q2 h h h h
)
|h
(3

𝑣1(c) = .02
*P
4 t)
* . tar
.8 (h|s

q1 c c c c
P

c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 63
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝑣1(h) = .32
P(h|h) * P(1|h)
.7* .2
q2 h h h h
P(c
)
|h

.3* |h) *
(3

𝑣1(c) = .02
*P

.5 P(1
| c)
4 t)
* . tar
.8 (h|s

q1 c c c c
P

c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 64
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝑣1(h) = .32
P(h|h) * P(1|h)
.7* .2
q2 h h h h
h)
(1| P(c
)
|h

P .3* |h) *
(3

*
𝑣1(c) = .02 )
*P

h|c .5 P(1
P( .2 | c)
4 t)

.4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c)
.6* .5
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 65
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝑣1(h) = .32 𝑣2(h) = max(.32 * .14, .02 * .08) = .0448


P(h|h) * P(1|h)
.7* .2
q2 h h h h
h)
(1| P(c
)
|h

P .3* |h) *
(3

* 𝑣2(c) = max(.32 * .15, .02 * .30) = .048


𝑣1(c) = .02 )
*P

h|c .5 P(1
P( .2 | c)
4 t)

.4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c)
.6* .5
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 66
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝑣1(h) = .32 𝑣2(h) = .0448


P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) *


(3

* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P

h|c .5 P(1 h|c .1 P(3


P( .2 | c) P( .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 67
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

qF end end end end

𝑣1(h) = .32 𝑣2(h) = .0448 𝑣3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3

* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P

h|c .5 P(1 h|c .1 P(3


P( .2 | c) P( .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 68
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Trellis q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

𝑏𝑒𝑠𝑡𝑝𝑎𝑡ℎ𝑝𝑟𝑜𝑏 = max(.01254, .00288) = .01254

qF end end end end

𝑣1(h) = .32 𝑣2(h) = .0448 𝑣3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3

* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P

h|c .5 P(1 h|c .1 P(3


P( .2 | c) P( .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 69
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Backtrace q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

𝑏𝑒𝑠𝑡𝑝𝑎𝑡ℎ𝑝𝑟𝑜𝑏 = max(.01254, .00288) = .01254

qF end end end end

𝑣1(h) = .32 𝑣2(h) = .0448 𝒗3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3

* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P

h|c .5 P(1 h|c .1 P(3


P( .2 | c) P( .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 70
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Backtrace q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

𝑏𝑒𝑠𝑡𝑝𝑎𝑡ℎ𝑝𝑟𝑜𝑏 = max(.01254, .00288) = .01254

qF end end end end

𝑣1(h) = .32 𝒗2(h) = .0448 𝒗3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3

* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P

h|c .5 P(1 h|c .1 P(3


P( .2 | c) P( .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 71
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Backtrace q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

𝑏𝑒𝑠𝑡𝑝𝑎𝑡ℎ𝑝𝑟𝑜𝑏 = max(.01254, .00288) = .01254

qF end end end end

𝒗1(h) = .32 𝒗2(h) = .0448 𝒗3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3

* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P

h|c .5 P(1 h|c .1 P(3


P( .2 | c) P( .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 72
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4

Viterbi Backtrace q0

.2
.3
.6

cold2
.4

B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1

𝑏𝑒𝑠𝑡𝑝𝑎𝑡ℎ𝑝𝑟𝑜𝑏 = max(.01254, .00288) = .01254

qF end end end end

𝒗1(h) = .32 𝒗2(h) = .0448 𝒗3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h

P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3

* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P

h|c .5 P(1 h|c .1 P(3


P( .2 | c) P( .4 | c)
4 t)

.4* .4*
* . tar
.8 (h|s

q1 c c c c
P

P(c|c) * P(1|c) P(c|c) * P(3|c)


.6* .5 .6* .1
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start

3 1 3

o1 o2 CS 421
Natalie Parde - UIC
o3 73
The Viterbi algorithm is used in many
domains, even beyond text processing!
• Speech recognition
• Given an input acoustic signal, find the most likely sequence of words or
phonemes
• Digital error correction
• Given a received, potentially noisy signal, determine the most likely
transmitted message
• Computer vision
• Given noisy measurements in video sequences, estimate the most likely
trajectory of an object over time
• Economics
• Given historical data, predict financial market states at certain timepoints

Natalie Parde - UIC CS 421 74


This Hidden Markov Models

Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward

Topics
Algorithm

Thursday

Tuesday

Parts of Speech
POS Tagsets
POS Tagging

Natalie Parde - UIC CS 421 75


Finally …how do we train HMMs?

• If we have a set of observations, can we learn the parameters


(transition probabilities and observation likelihoods) directly?

.7
.8 B1

313 hot1
!(1|ℎ&'! ) .2
!(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
213
333 q0 .3
.6
.4

322 B2
!(1|/&01! ) .5
cold2
112 .2
!(2|/&01! ) = .4
!(3|/&01! ) .1

Natalie Parde - UIC CS 421 76


Forward-Backward Algorithm

• Special case of expectation-maximization (EM) algorithm


• Input:
• Unlabeled sequence of observations, O
• Vocabulary of hidden states, Q
• Output: Transition probabilities and observation likelihoods

Natalie Parde - UIC CS 421 77


• Iteratively estimate the counts for
transitions from one state to
another
How does • Start with base estimates for aij
and bj, and iteratively improve
the algorithm those estimates

78
compute • Get estimated probabilities by:
• Computing the forward probability
these for an observation
• Dividing that probability mass
outputs? among all the different paths that
contributed to this forward
Natalie Parde - UIC CS 421

probability (backward
probability)
Backward Algorithm

79
• We define the backward probability as follows:
• 𝛽6 𝑖 = 𝑃(𝑜6@2, 𝑜6@A, … , 𝑜B |𝑞6 = 𝑖, 𝜆)
• Probability of generating partial observations from time t+1 until
the end of the sequence, given that the HMM 𝜆 is in state i at time t
• Also computed using a trellis, but moves backwards instead
Natalie Parde - UIC CS 421
Backward Step
𝛽t+1(N)
𝛽t(i) = ∑&
$+! 𝛽%,! (𝑗)𝑎-$ 𝑏$ (𝑜%,! )
qN qN
𝑎-& 𝑏& (𝑜%,! )
qi


𝑎-" 𝛽t+1(2)

q2 q2 q2 𝑏" (𝑜%,! )

𝑎-! 𝛽t+1(1)

q1 q1 𝑏$ (𝑜% ) q1
𝑏! (𝑜%,! )

Ot-1 Ot Ot+1
𝛽. 𝑖 = 1

Natalie Parde - UIC CS 421 80


For the expectation step of the forward-backward
algorithm, we re-estimate transition probabilities and
observation likelihoods.
• We re-estimate transition probabilities, aij, as follows:
5!(6)5"#$7$8!%&(9)
• Let 𝜁4 𝑖, 𝑗 = 5'(:()
expected # transitions from state 𝑖 to state 𝑗 ∑>?=
;<= 3; (%,!)
• Then, 𝑎
B%! = =
expected # transitions from state 𝑖 ∑>?= A
;<= ∑@<= 3; (%,!)
• Check out the course textbook (Appendix A) for an in-depth discussion of how the
numerator and denominator above are derived!

Natalie Parde - UIC CS 421 81


Re-Estimating Observation Likelihood
• We re-estimate bj as follows:
6; (!)7; (!)
• Let 𝛾. 𝑗 =
6> (&B )
∑>
expected # of times in state 𝑗 and observing symbol 9C s.t. D;<EC :;(!)
• Then, 𝑏P! 𝑣8
;<=
= =
expected number of times in state 𝑗 ∑>
;<= :; (!)

Natalie Parde - UIC CS 421 82


Putting it all together, we have the
forward-backward algorithm!
initialize A and B
iterate until convergence:

# Expectation Step
compute 𝛾6 (𝑗) for all t and j
compute 𝜁6 (𝑖, 𝑗) for all t, i, and j

# Maximization Step
𝛼GH = 𝑎> GH for all i and j
𝑏H (𝑣I ) = 𝑏@H (𝑣I ) for all j, and all 𝑣I in the output vocab V

Natalie Parde - UIC CS 421 83


• HMMs are probabilistic generative models for
sequences
• They make predictions based on underlying hidden
states

Summary: • Three fundamental HMM problems include:


• Computing the likelihood of a sequence of

Hidden
observations
• Determining the best sequence of hidden states
for an observed sequence

Markov • Learning HMM parameters given an observation


sequence and a set of hidden states

Models • Observation likelihood can be computed using the


forward algorithm
• Sequences of hidden states can be decoded using
the Viterbi algorithm
• HMM parameters can be learned using the forward-
backward algorithm

Natalie Parde - UIC CS 421 84


This Hidden Markov Models

Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward

Topics
Algorithm

Thursday

Tuesday

Parts of Speech
POS Tagsets
POS Tagging

Natalie Parde - UIC CS 421 85


What are parts
of speech?
• Traditional (broad) categories:
• noun
• verb
• adjective
• adverb
• preposition
• article
• interjection
• pronoun
• conjunction
• Sometimes also referred to as lexical
categories, word classes, or morphological
classes

Natalie Parde - UIC CS 421 86


Natalie Parde - UIC CS 421

Parts of Speech

Noun Verb Adjective Adverb

• People, • Actions or • Descriptive • Modifies other


places, or states attributes words by
things • Eat, sleep, • Purple, answering
• Doctor, be…. triangular, how, in what
mountain, windy…. way, when,
cellphone…. where, and to
what extent
questions
• Gently, quite,
quickly….

87
Parts of Speech

Pronoun Preposition Article Interjection Conjunction


88

Describes Coordinates words in


Refers to nouns relationship between the same clause or
Indicates specificity Exclamations
mentioned elsewhere noun/pronoun and connects multiple
other word in clause clauses/sentences
Natalie Parde - UIC CS 421

he, she, you…. on, above, to…. a, an, the…. oh, yikes, ah…. and, but, if….
What is part-of-
The process of automatically assigning grammatical
speech (POS) word classes to individual tokens in text.
tagging?
Natalie Parde - UIC CS 421 89
Why is
POS
tagging lead

useful?
Even when using end-
to-end approaches or
pretrained LLMs, POS
tagging is useful.
Offers an avenue for interpretable linguistic
analysis!

Natalie Parde - UIC CS 421 91


POS Tag Categories
Each POS type falls into one of two larger classes:

• Open
• Closed

Open class:

• New members can be created at any time


• In English:
• Nouns, verbs, adjectives, and adverbs
• Many (but not all!) languages have these four classes

Closed class:

• A small, fixed membership …new members cannot be created spontaneously


• Usually function words
• In English:
• Prepositions and auxiliaries (may, can, been, etc.)

Natalie Parde - UIC CS 421 92


Finer-Grained POS Classes

• Broader POS classes often have smaller subclasses


• Noun:
• Proper (Illinois)
93
• Common (state)
• Verb:
• Main (tweet)
• Modal (had)
Natalie Parde - UIC CS 421

• Some subclasses of a part of speech might be open, while others are


closed
Open Class
Nouns Verbs Adjectives old older oldest

Proper Common Main Adverbs slowly


IBM cat / cats see
Italy snow registered

Closed Class
Modal
Determiners the some can Prepositions to with
had
Conjunctions and or

Natalie Parde - UIC CS 421 94


POS Tagging

• Can be very challenging!


• Words often have more than one valid part of speech tag
95
• Today’s faculty meeting went really well! = adverb
• Do you think the undergrads are well? = adjective
• Well, did you see the latest response to your email? = interjection
• Jurafsky and Martin’s book is a well of information. = noun
• Laughter began to well up inside her at, as always, a highly
Natalie Parde - UIC CS 421

inconvenient time. = verb


verb determiner

Give me a break! POS Tagging


pronoun noun
• Goal: Determine the best POS tag for a
particular instance of a word.

verb noun

Did the window break?

determiner verb

Natalie Parde - UIC CS 421 96


This Hidden Markov Models

Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward

Topics
Algorithm

Thursday

Tuesday

Parts of Speech
POS Tagsets
POS Tagging

Natalie Parde - UIC CS 421 97


POS Tagsets
In order to determine which POS tag to assign to a
word, we first need to decide which tagset we will use

Tagset: A finite set of POS tags, where each tag


defines a distinct grammatical role

Can range from very coarse to very fine

Natalie Parde - UIC CS 421 98


Penn Treebank Tagset
• Most common POS tagset
• 36 POS tags + 12 other tags (punctuation and currency)
• Used when developing the Penn Treebank, a corpus created at the University of
Pennsylvania containing more than 4.5 million words of American English
• Link to documentation: https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html

Natalie Parde - UIC CS 421 99


Penn Treebank Tagset

CC Coordinating Conjunction NNS Noun, plural TO to


CD Cardinal Number NNP Proper noun, singular UH Interjection
DT Determiner NNPS Proper noun, plural VB Verb, base form
EX Existential there PDT Predeterminer VBD Verb, past tense
Verb, gerund or present
FW Foreign word POS Possessive ending VBG
participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
Verb, non-3rd person singular
JJ Adjective PRP$ Possessive pronoun VBP
present
Verb, 3rd person singular
JJR Adjective, comparative RB Adverb VBZ
present
JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner
LS List item marker RBS Adverb, superlative WP Wh-pronoun
MD Modal RP Particle WP$ Possessive wh-pronoun
NN Noun, singular or mass SYM Symbol WRB Wh-adverb

Natalie Parde - UIC CS 421 100


What do some of these distinctions
mean? cities
CC Coordinating Conjunction NNS Noun, plural TO to
CD Cardinal Number NNP Proper noun, singular UH Interjection
DT Determiner NNPS Proper noun, plural VB Verb, base form
Chicago
EX Existential there PDT Predeterminer VBD Verb, past tense
Verb, gerund or present
FW Foreign word POS Possessive ending VBG
participle
Preposition or subordinating Chicagos
IN PRP Personal pronoun VBN Verb, past participle
conjunction
Verb, non-3rd person singular
JJ Adjective PRP$ Possessive pronoun VBP
present
Verb, 3rd person singular
JJR Adjective, comparative RB Adverb VBZ
present
city JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner
LS List item marker RBS Adverb, superlative WP Wh-pronoun
MD Modal RP Particle WP$ Possessive wh-pronoun
NN Noun, singular or mass SYM Symbol WRB Wh-adverb

Natalie Parde - UIC CS 421 101


What do some of these distinctions
mean?
eat
CC Coordinating Conjunction NNS Noun, plural TO to
CD Cardinal Number NNP Proper noun, singular UH Interjection
ate
DT Determiner NNPS Proper noun, plural VB Verb, base form
EX Existential there PDT Predeterminer VBD Verb, past tense
eating ending Verb, gerund or present
FW Foreign word POS Possessive VBG
participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
eaten
Verb, non-3rd person singular
JJ Adjective PRP$ Possessive pronoun VBP
present
eat Verb, 3rd person singular
JJR Adjective, comparative RB Adverb VBZ
present
shouldJJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner
eats
LS List item marker RBS Adverb, superlative WP Wh-pronoun
MD Modal RP Particle WP$ Possessive wh-pronoun
NN Noun, singular or mass SYM Symbol WRB Wh-adverb

Natalie Parde - UIC CS 421 102


What do some of these distinctions
mean?
CC Coordinating Conjunction NNS Noun, plural TO to
CD Cardinal Number NNP Proper noun, singular UH Interjection
DT Determiner NNPS Proper noun, plural VB Verb, base form
EX Existential there PDT Predeterminer VBD Verb, past tense
weirdForeign word Verb, gerund or present
FW POS Possessive ending VBG
participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
Verb, non-3rd person singular
JJ Adjective PRP$ Possessive pronoun VBP
present
weirder
Verb, 3rd person singular
JJR Adjective, comparative RB Adverb VBZ
present
JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner
LS List item marker RBS Adverb, superlative WP Wh-pronoun
MD Modal RP Particle WP$ Possessive wh-pronoun
weirdest
NN Noun, singular or mass SYM Symbol WRB Wh-adverb

Natalie Parde - UIC CS 421 103


What do some of these distinctions
mean?
CC Coordinating Conjunction NNS Noun, plural TO to
CD Cardinal Number NNP Proper noun, singular UH Interjection
DT Determiner NNPS Proper noun, plural VB Verb, base form
EX Existential there PDT Predeterminer VBD Verb, past tense
Verb, gerund or present
FW Foreign word POS Possessive ending VBG
participle
Preposition or subordinating calmly
IN PRP Personal pronoun VBN Verb, past participle
conjunction
Verb, non-3rd person singular
JJ Adjective PRP$ Possessive pronoun VBP
present
Verb, 3rd person singular
JJR Adjective, comparative RB Adverb VBZ
present
JJS calmer
Adjective, superlative RBR Adverb, comparative WDT Wh-determiner
LS List item marker RBS Adverb, superlative WP Wh-pronoun
MD Modal RP Particle WP$ Possessive wh-pronoun
calmest
NN Noun, singular or mass SYM Symbol WRB Wh-adverb

Natalie Parde - UIC CS 421 104


As a general (but not perfect!) rule….
CC Coordinating Conjunction NNS Noun, plural TO to
CD Cardinal Number NNP Proper noun, singular UH Interjection
DT Determiner NNPS Proper noun, plural VB Verb, base form
EX Existential there PDT Predeterminer VBD Verb, past tense
Clo s
Verb, gerund or present
FW Foreign word POS Possessive ending VBG
participle ed C
Preposition or subordinating
lass
IN PRP Personal pronoun VBN Verb, past participle
conjunction
Verb, non-3rd person singular
JJ Adjective PRP$ Possessive pronoun VBP
present
Verb, 3rd person singular
JJR Adjective, comparative RB Adverb VBZ
present
JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner
LS List item marker RBS Adverb, superlative WP Wh-pronoun
MD Modal RP Particle WP$ Possessive wh-pronoun
NN Noun, singular or mass SYM Symbol WRB Wh-adverb

Natalie Parde - UIC CS 421 105


As a general (but not perfect!) rule….
CC Coordinating Conjunction NNS Noun, plural TO to
CD Cardinal Number NNP Proper noun, singular UH Interjection
DT Determiner NNPS Proper noun, plural VB Verb, base form
EX Existential there PDT Predeterminer VBD Verb, past tense
Verb, gerund or present
FW Foreign word POS
l a s s
Possessive ending VBG
participle
Preposition or subordinating nC
pe pronoun
IN
conjunction
PRP OPersonal VBN Verb, past participle

Verb, non-3rd person singular


JJ Adjective PRP$ Possessive pronoun VBP
present
Verb, 3rd person singular
JJR Adjective, comparative RB Adverb VBZ
present
JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner
LS List item marker RBS Adverb, superlative WP Wh-pronoun
MD Modal RP Particle WP$ Possessive wh-pronoun
NN Noun, singular or mass SYM Symbol WRB Wh-adverb

Natalie Parde - UIC CS 421 106


Other Popular POS Tagsets

Brown C5 C7
Corpus Tagset Tagset
~1 million
Text from the Text from the
words of
American British National British National
Corpus Corpus
English text

146 (!!) POS


82 (!) POS tags 61 POS tags
tags

Natalie Parde - UIC CS 421 107


This Hidden Markov Models

Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward

Topics
Algorithm

Thursday

Tuesday

Parts of Speech
POS Tagsets
POS Tagging

Natalie Parde - UIC CS 421 108


So …how
can we
assign
POS
tags?

Natalie Parde - UIC CS 421 109


So …how can we assign POS tags?
Time flies like an arrow; fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 110
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 111
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 112
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 113
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 114
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 115
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 116
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 117
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 118
So …how can we assign POS tags?
Time flies like an arrow fruit flies like a banana

CC Coordinating Conjunction NNS Noun, plural TO to

CD Cardinal Number NNP Proper noun, singular UH Interjection

DT Determiner NNPS Proper noun, plural VB Verb, base form

EX Existential there PDT Predeterminer VBD Verb, past tense

FW Foreign word POS Possessive ending VBG Verb, gerund or present participle

Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction

JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present

JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present

JJS Adjective, superlative RBR Adverb, comparative WDT Wh-determiner

LS List item marker RBS Adverb, superlative WP Wh-pronoun

MD Modal RP Particle WP$ Possessive wh-pronoun

NN Noun, singular or mass SYM Symbol WRB Wh-adverb


Natalie Parde - UIC CS 421 119
Ambiguity is a big issue for
POS taggers!

• Many words have multiple senses


• time = noun, verb
• flies = noun, verb
• like = verb, preposition

Natalie Parde - UIC CS 421 120


Just how • Brown Corpus: Approximately 11% of
ambiguous word types have multiple valid part of
speech labels
is natural • These tend to be very common words!
language? • We think that the meeting will only
last two more hours. = IN
121
• Was that the 32nd Piazza post
today? = DT
• You can’t eat that many donuts
every time the clock strikes
midnight! = RB
Natalie Parde - UIC CS 421

• Overall, ~40% of word tokens are


instances of ambiguous word types
Despite
this,
modern • Accuracy > 97%
• Even a simple baseline can achieve ~90%

POS accuracy
• Tag every word with its most frequent

taggers
tag
• Tag unknown words as nouns

still work
quite well.
Natalie Parde - UIC CS 421 122
How do
POS
taggers
work?
Rule-Based POS Tagging

Start with a dictionary, and assign all relevant tags to the


words in that dictionary

Manually design rules to selectively remove invalid tags for


test instances in context

Keep the remaining correct tag for each word

Natalie Parde - UIC CS 421 124


• Start with a dictionary that specifies
permissible tags for our small vocabulary:
• she

Example • PRP
• promised

Rule- • to
• VBN, VBD

Based
• TO
• back

Approach
• VB, JJ, RB, NN
• the
• DT
• bill
• NN, VB

Natalie Parde - UIC CS 421 125


Example Rule-Based Approach
Assign every possible tag to each word in the sequence

she promised to back the bill


PRP VBN TO VB DT NN
VBD JJ VB
RB
NN

Natalie Parde - UIC CS 421 126


Example Rule-Based Approach
Apply rules to eliminate invalid tags

Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”

she promised to back the bill


PRP VBN TO VB DT NN
VBD JJ VB
RB
NN

Natalie Parde - UIC CS 421 127


Example Rule-Based Approach
Keep the remaining correct tag for each word

she promised to back the bill


PRP VBN TO VB DT NN
VBD JJ VB
RB
NN

Natalie Parde - UIC CS 421 128


Rule-based
POS taggers • Like all rule-based methods, they carry important
disadvantages:
are an • Time-consuming to build
• Difficult to update or generalize to new
adequate domains
• Might miss important patterns latent in the
baseline, specified text domain

but….

Natalie Parde - UIC CS 421 129


Nice alternative to rule-
based POS tagging?
• Statistical POS Tagging: POS taggers that make
decisions based on learned knowledge of POS tag
distribution in a training corpus
• the is usually tagged as DT
• Words with uppercase letters are more likely to be
tagged NNP or NNPS
• Words starting with the prefix un- may be tagged JJ
• Words ending with the suffix –ly may be tagged RB

Natalie Parde - UIC CS 421 130


Simple Statistical POS Tagger

• Using a training corpus, determine the most frequent tag for each
word
• Assign POS tags to new words based on those frequencies
• Assign NN to new words for which there is no information from the
training corpus

I saw a wampimuk at the zoo yesterday!

Natalie Parde - UIC CS 421 131


Simple Statistical POS Tagger

• Using a training corpus, determine the most frequent tag for each
word
• Assign POS tags to new words based on those frequencies
• Assign NN to new words for which there is no information from the
training corpus
95% DT 90% IN 85% NN
95% PRP
I saw a wampimuk at the zoo yesterday!
75% VBD ??? 95% DT 90% NN

Natalie Parde - UIC CS 421 132


Simple Statistical POS Tagger

• Using a training corpus, determine the most frequent tag for each
word
• Assign POS tags to new words based on those frequencies
• Assign NN to new words for which there is no information from the
training corpus

PRP DT IN NN

I saw a wampimuk at the zoo yesterday!


VBD NN DT NN

Natalie Parde - UIC CS 421 133


• This approach works reasonably well
• Approximately 90% accuracy
• However, we can do much better!
• One way to improve upon our results is
to use HMMs

Simple
Statistical
POS Tagger
Natalie Parde - UIC CS 421 134
Bigram HMM POS Tagger
• To determine the tag ti for a single word wi:
• 𝑡! = argmax 𝑃 𝑡( 𝑡!'" 𝑃 𝑤! 𝑡(
$/ ∈{$' ,$( ,…,$)*( }

• This means we need to be able to compute two probabilities:


• The probability that the tag is tj given that the previous tag is ti-1
• 𝑃 𝑡( 𝑡!'"
• The probability that the word is wi given that the tag is tj
• 𝑃 𝑤! 𝑡(
• We can compute both of these from corpora like the Penn Treebank or the Brown
Corpus
• Then, we can find the most optimal sequence of tags using the Viterbi algorithm!

Natalie Parde - UIC CS 421 135


• Given two possible sequences
Superman is expected to fly tomorrow
of tags from the Brown Corpus
NNP VBZ VBN TO VB NR tagset for the following
NNP VBZ VBN TO NN NR sentence, what is the best way
to tag the word “fly”?

Proper noun, singular Verb, past participle

Verb, 3rd person singular present Infinitive to Adverbial noun

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 136
• Since we’re creating a bigram
HMM tagger and focusing on the
word “fly,” we only need to be
Superman is expected to fly tomorrow concerned with the subsequence
“to fly tomorrow”
NNP VBZ VBN TO VB NR
• For simplicity when decoding,
NNP VBZ VBN TO NN NR we’ll assume that:
• The first word in the subsequence
for sure has label TO (v0(TO) = 1.0)
• The word “tomorrow” for sure has
label NR (P(“tomorrow”|NR) = 1.0)

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 137
We have the following HMM
sample:

a22

a24
Superman is expected to fly tomorrow a02 TO2

NNP VBZ VBN TO VB NR


a21
NNP VBZ VBN TO NN NR start0 NR4
a11 a33
a01 a12 a32 a34
a03
VB1 NN3
a31

a14
a13

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 138
The specific transition
probabilities we are interested in
are:
a22

a24
Superman is expected to fly tomorrow a02 TO2

NNP VBZ VBN TO VB NR a23


a21
NNP VBZ VBN TO NN NR start0 NR4
a11 a33
a01 a12 a32 a34
a03
VB1 NN3
a31

a14
a13

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 139
Superman is expected to fly tomorrow
NNP VBZ VBN TO VB NR
NNP VBZ VBN TO NN NR
• We can estimate the transition
probabilities for a21, a23, a34,
TO2
and a14 using frequency counts
from the Brown Corpus
a23 <(.F?= .F )
a21 • 𝑃 𝑡% 𝑡%;# =
NR4 <(.F?= )

a34

VB1 NN3

a14

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 140
Superman is expected to fly tomorrow
NNP VBZ VBN TO VB NR
NNP VBZ VBN TO NN NR • We can estimate the transition
probabilities for a21, a23, a34,
and a14 using frequency counts
from the Brown Corpus
TO2
<(.F?= .F )
• 𝑃 𝑡% 𝑡%;# =
<(.F?= )
a21 0.00047
NR4 • So, P(NN|TO) = C(TO NN) /
a34
C(TO) = 0.00047

VB1 NN3

a14

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 141
Superman is expected to fly tomorrow
NNP VBZ VBN TO VB NR • We can estimate the transition
NNP VBZ VBN TO NN NR probabilities for a21, a23, a34,
and a14 using frequency counts
from the Brown Corpus
<(.F?= .F )
TO2 • 𝑃 𝑡% 𝑡%;# =
<(.F?= )

0.00047 • So, P(NN|TO) = C(TO NN) /


0.83
NR4 C(TO) = 0.00047
a34 • Likewise, P(VB|TO) = C(TO
VB) / C(TO) = 0.83
VB1 NN3

a14

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 142
Superman is expected to fly tomorrow
• We can estimate the transition
NNP VBZ VBN TO VB NR probabilities for a21, a23, a34,
NNP VBZ VBN TO NN NR
and a14 using frequency counts
from the Brown Corpus
<(.F?= .F )
• 𝑃 𝑡% 𝑡%;# =
<(.F?= )
TO2
• So, P(NN|TO) = C(TO NN) /
C(TO) = 0.00047
0.83 0.00047
NR4 • Likewise, P(VB|TO) = C(TO
a34
VB) / C(TO) = 0.83
• P(NR|VB) = C(VB NR) / C(VB)
VB1 NN3 = 0.0027
0.0027

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 143
Superman is expected to fly tomorrow • We can estimate the transition
probabilities for a21, a23, a34,
NNP VBZ VBN TO VB NR and a14 using frequency counts
NNP VBZ VBN TO NN NR
from the Brown Corpus
<(.F?= .F )
• 𝑃 𝑡% 𝑡%;# =
<(.F?= )
• So, P(NN|TO) = C(TO NN) /
TO2 C(TO) = 0.00047
0.00047
• Likewise, P(VB|TO) = C(TO
0.83
NR4
VB) / C(TO) = 0.83
• P(NR|VB) = C(VB NR) / C(VB)
0.0012 = 0.0027
VB1 NN3 • Finally, P(NR|NN) = C(NN NR) /
C(NN) = 0.0012
0.0027

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 144
Superman is expected to fly tomorrow
• We have our transition
NNP VBZ VBN TO VB NR probabilities …what now?
NNP VBZ VBN TO NN NR • Observation likelihoods!
• We can also estimate these
using frequency counts from
TO2 the Brown Corpus
<(=F ,.F )
fly
• 𝑃 𝑤% 𝑡% =
0.83 0.00047 <(.F )
NR4
VB • Since we’re trying to decide the
NN best tag for “fly,” we need to
0.0012
compute both P(fly|VB) and
VB1 NN3 P(fly|NN)

0.0027

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 145
Superman is expected to fly tomorrow • We have our transition
probabilities …what now?
NNP VBZ VBN TO VB NR
• Observation likelihoods!
NNP VBZ VBN TO NN NR
• We can also estimate these
using frequency counts from
the Brown Corpus
<(=F ,.F )
TO2 • 𝑃 𝑤% 𝑡% =
<(.F )

0.83 0.00047 fly • Since we’re trying to decide the


NR4 best tag for “fly,” we need to
VB 0.00012
compute both P(fly|VB) and
0.0012 NN P(fly|NN)
VB1 NN3
• P(fly|VB) = C(fly, VB) / C(VB) =
0.00012
0.0027

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 146
Superman is expected to fly tomorrow • We have our transition
probabilities …what now?
NNP VBZ VBN TO VB NR • Observation likelihoods!
NNP VBZ VBN TO NN NR • We can also estimate these
using frequency counts from the
Brown Corpus
#(%! ,'! )
• 𝑃 𝑤" 𝑡" = #('! )
TO2
• Since we’re trying to decide the
0.00047 fly best tag for “fly,” we need to
0.83 compute both P(fly|VB) and
NR4
VB 0.00012 P(fly|NN)
• P(fly|VB) = C(fly, VB) / C(VB) =
0.0012 NN 0.00057 0.00012
VB1 NN3 • P(fly|NN) = C(fly, NN) / C(NN) =
0.00057
0.0027

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 147
Superman is expected to fly tomorrow
• Now, to decide how to tag “fly,”
we can consider our two
NNP VBZ VBN TO VB NR possible sequences:
NNP VBZ VBN TO NN NR • to (TO) fly (VB) tomorrow (NR)
• to (TO) fly (NN) tomorrow (NR)
• We will select the tag that
maximizes the probability:
TO2
• P(ti|TO)P(NR|ti)P(fly|ti)

0.83 0.00047 fly • We determine that:


NR4 • P(VB|TO)P(NR|VB)P(fly|VB) =
VB 0.00012
0.83 * 0.0027 * 0.00012 =
0.0012 NN 0.00057 0.00000027
• P(NN|TO)P(NR|NN)P(fly|NN) =
VB1 NN3 0.00047 * 0.0012 * 0.00057 =
0.00000000032
0.0027

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 148
Superman is expected to fly tomorrow • Now, to decide how to tag “fly,”
we can consider our two
NNP VBZ VBN TO VB NR possible sequences:
NNP VBZ VBN TO NN NR • to (TO) fly (VB) tomorrow (NR)
• to (TO) fly (NN) tomorrow (NR)
• We will select the tag that
maximizes the probability:
TO2 • P(ti|TO)P(NR|ti)P(fly|ti)
• We determine that:
0.83 0.00047 fly • P(VB|TO)P(NR|VB)P(fly|VB) =
NR4 0.83 * 0.0027 * 0.00012 =
VB 0.00012
0.00000027
0.0012 NN 0.00057 • Optimal sequence!
• P(NN|TO)P(NR|NN)P(fly|NN) =
VB1 NN3 0.00047 * 0.0012 * 0.00057 =
0.00000000032
0.0027

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 149
Superman is expected to fly tomorrow • Visualized in a Viterbi trellis,
this would look like:
NNP VBZ VBN TO VB NR v2(NR) = max(2.68*10-7*0.0012*1.0, 9.96*10-5*0.0027*1.0)
= 0.00000027

NNP VBZ VBN TO NN NR NR NR

0.0
012
NR

v1(NN) = 1.0 * 0.00047 * 0.00057 = 2.68*10-7

27
NN NN NN

00
0.
TO2 v1(VB) = 1.0 * 0.83 * 0.00012 = 9.96*10-5

7
004
VB VB VB

0.0
0.83 0.00047 fly
v0(TO) = 1.0 83
NR4 0.
VB 0.00012
TO TO TO

0.0012 NN 0.00057

VB1 NN3 to fly tomorrow


0.0027

Example: Bigram HMM Tagger


Natalie Parde - UIC CS 421 150
Neural Sequence t3

noun
Modeling t2

adjective
t1
• Use a sequential or pretrained neural
network architecture determiner
h3
• Recurrent neural networks
• Transformers h2

• Predict a label for each item in the latte

input sequence h1
delicious
• If using a subword vocabulary,
you will need to merge the labels
predicted for all subwords in a h0 a

word

Natalie Parde - UIC CS 421 151


How can POS
taggers handle
unknown words?
• New words are continually added to language,
so it is likely that a POS tagger will encounter
words not found in its training corpus
• Easy baseline approach: Assume that
unknown words are nouns
• More sophisticated approach: Assume that
unknown words have a probability
distribution similar to other words
occurring only once in the training corpus,
and make an (informed) random choice
• Even more sophisticated approach: Use
morphological information to choose the
POS tag (for example, words ending with “ed”
tend to be tagged VBN)

Natalie Parde - UIC CS 421 152


Evaluation Metrics
for POS Taggers
• Common metrics for POS taggers are:
• Accuracy
153 • Precision
• Recall
• F1
Natalie Parde - UIC CS 421
Comparison

• The scores computed for these metrics


should be compared to alternative POS
tagging methods, to place the values in
context
• Is this a good accuracy, or just okay?
• It’s good to compare to both a lower-bound
baseline and an upper-bound ceiling
• Baseline: What should your POS
tagger definitely perform better than?
• Most Frequent Class
• Ceiling: What is the highest possible
value for this task?
• Human Agreement

Natalie Parde - UIC CS 421 154


What factors can impact performance?

155
• Many factors can lead to your results being higher or lower than
expected!
• Some common factors:
• The size of the training dataset
• The specific characteristics of your tag set
Natalie Parde - UIC CS 421

• The difference between your training and test corpora


• The number of unknown words in your test corpus
POS tagging is the process of automatically
assigning grammatical word classes (parts
of speech) to individual tokens

Summary: The most common POS tagset is the Penn


Treebank tagset
Part-of-
Speech

Natalie Parde - UIC CS 421


Ambiguity is common in natural language,
and is a major issue that POS taggers must
Tagging address

Although POS taggers can be designed


using many approaches, statistical (and
neural) models are most common

156

You might also like