Hidden Markov Models and POS Tagging
Hidden Markov Models and POS Tagging
Models and
POS Tagging
Natalie Parde
UIC CS 421
In general: assigning labels to
individual tokens or spans of
tokens given a longer string of
input
Sequence
Modeling
2 and
Sequence article verb preposition noun
Labeling
The students were excited about the lecture.
Natalie Parde - UIC CS 421
4
• Named entity recognition
• Semantic role labeling
Example
Sequence person organization
Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward
Topics
Algorithm
Thursday
Tuesday
Parts of Speech
POS Tagsets
POS Tagging
Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward
Topics
Algorithm
Thursday
Tuesday
Parts of Speech
POS Tagsets
POS Tagging
q0 q4
.7 .1 .3
.2 .2
.4 .1 .4 .3
q1 q3
.2
q0 q4
.7 .1 .3
.2 .2
.4 .1 .4 .3
q1 q3
.2
12
Formal Definition
13
Sample Hidden Markov Model
𝑃(𝑥|𝑞" ) .1
𝑃(𝑦|𝑞" ) = .4
.1 .1
𝑃(𝑧|𝑞" ) .5
q2
.7
.2
q0 q4
.7 .1 .3
.2 .2
.4 .1 .4 .3
q1 q3 𝑃(𝑥|𝑞# ) .7
𝑃(𝑥|𝑞! ) .2 .2 𝑃(𝑦|𝑞# ) = .1
𝑃(𝑦|𝑞! ) = .4 𝑃(𝑧|𝑞# ) .2
𝑃(𝑧|𝑞! ) .4
Natalie Parde - UIC CS 421 14
Formal Definition
15
Sample Hidden Markov Model
B2
𝑃(𝑥|𝑞" ) .1
a02 = .1 a24 = .1
𝑃(𝑦|𝑞" ) = .4
q2 𝑃(𝑧|𝑞" ) .5
O = x, y, z a21 = .2 a23 = .7
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 𝑃(𝑥|𝑞# ) .7
𝑃(𝑥|𝑞! ) .2 a31 = .2 𝑃(𝑦|𝑞# ) = .1
𝑃(𝑦|𝑞! ) = .4 𝑃(𝑧|𝑞# ) .2
𝑃(𝑧|𝑞! ) .4
Natalie Parde - UIC CS 421 16
Corresponding Transition Matrix
q0 q1 q2 q3 q4
a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5
q1
O = x, y, z a21 = .2 a23 = .7
q0 q4
a11 = .1
q2
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2
q3
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4
q4
q0 q1 q2 q3 q4
a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5
q1 N/A .1 .4 .2 .3
O = x, y, z a21 = .2 a23 = .7
q0 q4
a11 = .1
q2
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2
q3
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4
q4
q0 q1 q2 q3 q4
a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5
q1 N/A .1 .4 .2 .3
O = x, y, z a21 = .2 a23 = .7
q0 q4
a11 = .1
q2 N/A .2 N/A .7 .1
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2
q3
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4
q4
q0 q1 q2 q3 q4
a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5
q1 N/A .1 .4 .2 .3
O = x, y, z a21 = .2 a23 = .7
q0 q4
a11 = .1
q2 N/A .2 N/A .7 .1
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2
q3 N/A .2 .1 .3 .4
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4
q4
q0 q1 q2 q3 q4
a02 = .1 a24 = .1
B2
!(#|%" ) .1
q0 N/A .7 .1 .2 N/A
!('|%" ) = .4
q2 !((|%" ) .5
q1 N/A .1 .4 .2 .3
O = x, y, z a21 = .2 a23 = .7
q0 q4
a11 = .1
q2 N/A .2 N/A .7 .1
a01 = .7 a33 = .3 a14 = .3
a13 = .2
a03 = .2
q3 N/A .2 .1 .3 .4
a32 = .1 a34 = .4 B3
q1 a12 = .4 q3
B1 !(#|%# ) .7
!(#|%! ) .2 a31 = .2 !('|%# ) = .1
!('|%! ) = .4 !((|%# ) .2
!((|%! ) .4
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 23
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 24
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 25
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 26
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 27
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 28
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 29
Sample Text Generation
dog = .2, cat = .3,
a02 = .1 a24 = .1 lizard = .1, unicorn = .4
q2
a21 = .2 a23 = .7
my unicorn laughed
q0 q4
a11 = .1 a14 = .3
a01 = .7 a33 = .3
a13 = .2
a03 = .2
a32 = .1 a34 = .4
q1 a12 = .4 q3
laughed = .5, ate = .2,
a31 = .2 slept = .3
the = .3, her = .1,
my = .3, Devika’s = .3
Natalie Parde - UIC CS 421 30
Three Fundamental HMM Problems
Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward
Topics
Algorithm
Thursday
Tuesday
Parts of Speech
POS Tagsets
POS Tagging
35
How can we compute the observation
likelihood?
• Naïve Solution:
• Consider all possible state sequences, Q, of length T that the model, 𝜆, could
have traversed in generating the given observation sequence, O
• Compute the probability of a given state sequence from A, and multiply it by
the probability of generating the given observation sequence for that state
sequence
• P(O,Q | 𝜆) = P(O | Q, 𝜆) * P(Q | 𝜆)
• Repeat for all possible state sequences, and sum over all to get P(O | 𝜆)
• But, this is computationally complex!
• O(TNT)
• Efficient Solution:
• Forward Algorithm: Dynamic programming
algorithm that computes the observation
probability by summing over the probabilities
of all possible hidden state paths that could
generate the observation sequence.
• Implicitly folds each of these paths into a
single forward trellis
• Why does this work?
• Markov assumption (the probability of being in
any state at a given time t only relies on the
probability of being in each possible state at
time t-1)
• Works in O(TN2) time!
forwardprob ←∑3
012 𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝑞, 𝑇
q0 .3 .4
.6
B2
𝑃(1|𝑐𝑜𝑙𝑑! ) .5
cold2
𝑃(2|𝑐𝑜𝑙𝑑! ) = .4
.2 𝑃(3|𝑐𝑜𝑙𝑑! ) .1
qj
…
…
𝛼 t-2(2) 𝛼 t-1(2) 𝑎"$
q2 q2 q2
𝑎!$
𝛼 t-2(1) 𝛼 t-1(1)
q1 q1 𝑏$ (𝑜% ) q1
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
q2 h h h h
q1 c c c c
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 44
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
q2 h h h h
)
|h
(3
*P
4 t)
* . tar
.8 (h|s
q1 c c c c
P
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 45
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝛼 1(h) = .32
q2 h h h h
)
|h
(3
𝛼 1(c) = .02
*P
4 t)
* . tar
.8 (h|s
q1 c c c c
P
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 46
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝛼 1(h) = .32
P(h|h) * P(1|h)
.7* .2
q2 h h h h
P(c
)
|h
.3* |h) *
(3
𝛼 1(c) = .02
*P
.5 P(1
| c)
4 t)
* . tar
.8 (h|s
q1 c c c c
P
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 47
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝛼 1(h) = .32
P(h|h) * P(1|h)
.7* .2
q2 h h h h
h)
(1| P(c
)
|h
P .3* |h) *
(3
*
𝛼 1(c) = .02 )
*P
h|c .5 P(1
P( .2 | c)
4 t)
.4*
* . tar
.8 (h|s
q1 c c c c
P
P(c|c) * P(1|c)
.6* .5
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 48
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
P .3* |h) *
(3
h|c .5 P(1
P( .2 | c)
4 t)
.4*
* . tar
.8 (h|s
q1 c c c c
P
P(c|c) * P(1|c)
.6* .5
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 49
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
h|c .5 P(1
(h .1 P(3
P( .2 | c) P .4 | c)
4 t)
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 50
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝛼 1(h) = .32 𝛼 2(h) = .0464 𝛼 3(h) = .0464 * .28 + .054 * .16 = .021632
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h
P .3* |h) * P .3* |h) * 𝛼 3(c) = .0464 * .03 + .054 * .06 = .004632
(3
h|c .5 P(1
(h .1 P(3
P( .2 | c) P .4 | c)
4 t)
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 51
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Forward Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝛼 1(h) = .32 𝛼 2(h) = .0464 𝛼 3(h) = .0464 * .28 + .054 * .16 = .021632
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h
P .3* |h) * P .3* |h) * 𝛼 3(c) = .0464 * .03 + .054 * .06 = .004632
(3
h|c .5 P(1
(h .1 P(3
P( .2 | c) P .4 | c)
4 t)
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 52
We’ve so far • What is the probability that a sequence
tackled one of observations fits a given HMM?
of the • Calculate using forward probabilities!
fundamental • However, there are still two remaining
HMM tasks. tasks to explore….
Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward
Topics
Algorithm
Thursday
Tuesday
Parts of Speech
POS Tagsets
POS Tagging
• Naïve Approach:
• For each hidden state sequence Q, compute P(O|Q)
• Pick the sequence with the highest probability
• However, this is computationally inefficient!
• O(NT)
we decode
sequences
more
efficiently?
• Goal: Compute the joint probability of the observation sequence together with the
best state sequence
• So, recursively compute the probability of the most likely subsequence of
states that accounts for the first t observations and ends in state qj.
• 𝑣$ 𝑗 = max 𝑃 𝑞, , 𝑞" , … , 𝑞$'" , 𝑜" , … , 𝑜$ , 𝑞$ = 𝑞( |𝜆
)' ,)( ,…,))*(
• Also record backpointers that subsequently allow you to backtrace the most
probable state sequence
• 𝑏𝑡$ (𝑗) stores the state at time t-1 that maximizes the probability that the
system was in state qj at time t, given the observed sequence
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
q2 h h h h
q1 c c c c
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 61
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
q2 h h h h
)
|h
(3
*P
4 t)
* . tar
.8 (h|s
q1 c c c c
P
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 62
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝑣1(h) = .32
q2 h h h h
)
|h
(3
𝑣1(c) = .02
*P
4 t)
* . tar
.8 (h|s
q1 c c c c
P
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 63
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝑣1(h) = .32
P(h|h) * P(1|h)
.7* .2
q2 h h h h
P(c
)
|h
.3* |h) *
(3
𝑣1(c) = .02
*P
.5 P(1
| c)
4 t)
* . tar
.8 (h|s
q1 c c c c
P
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 64
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝑣1(h) = .32
P(h|h) * P(1|h)
.7* .2
q2 h h h h
h)
(1| P(c
)
|h
P .3* |h) *
(3
*
𝑣1(c) = .02 )
*P
h|c .5 P(1
P( .2 | c)
4 t)
.4*
* . tar
.8 (h|s
q1 c c c c
P
P(c|c) * P(1|c)
.6* .5
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 65
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
P .3* |h) *
(3
h|c .5 P(1
P( .2 | c)
4 t)
.4*
* . tar
.8 (h|s
q1 c c c c
P
P(c|c) * P(1|c)
.6* .5
c)
(3|
t) *P
| s tar
c
P( * .1
.2
q0 start start start start
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 66
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 67
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝑣1(h) = .32 𝑣2(h) = .0448 𝑣3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h
P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3
* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 68
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Trellis q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝑣1(h) = .32 𝑣2(h) = .0448 𝑣3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h
P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3
* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 69
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Backtrace q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝑣1(h) = .32 𝑣2(h) = .0448 𝒗3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h
P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3
* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 70
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Backtrace q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝑣1(h) = .32 𝒗2(h) = .0448 𝒗3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h
P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3
* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 71
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Backtrace q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝒗1(h) = .32 𝒗2(h) = .0448 𝒗3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h
P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3
* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 72
.7
.8 B1
!(1|ℎ&'! ) .2
hot1 !(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
Viterbi Backtrace q0
.2
.3
.6
cold2
.4
B2
!(1|/&01! ) .5
!(2|/&01! ) = .4
!(3|/&01! ) .1
𝒗1(h) = .32 𝒗2(h) = .0448 𝒗3(h) = max(.0448 * .28, .048 * .16) = .01254
P(h|h) * P(1|h) P(h|h) * P(3|h)
.7* .2 .7* .4
q2 h h h h
h) h)
(1| P(c (3| P(c
)
|h
P .3* |h) * P .3* |h) * 𝑣3(c) = .max(.0448 * .03, .048 * .06) = .00288
(3
* 𝑣2(c) = .048 *
𝑣1(c) = .02 ) )
*P
.4* .4*
* . tar
.8 (h|s
q1 c c c c
P
3 1 3
o1 o2 CS 421
Natalie Parde - UIC
o3 73
The Viterbi algorithm is used in many
domains, even beyond text processing!
• Speech recognition
• Given an input acoustic signal, find the most likely sequence of words or
phonemes
• Digital error correction
• Given a received, potentially noisy signal, determine the most likely
transmitted message
• Computer vision
• Given noisy measurements in video sequences, estimate the most likely
trajectory of an object over time
• Economics
• Given historical data, predict financial market states at certain timepoints
Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward
Topics
Algorithm
Thursday
Tuesday
Parts of Speech
POS Tagsets
POS Tagging
.7
.8 B1
313 hot1
!(1|ℎ&'! ) .2
!(2|ℎ&'! ) = .4
!(3|ℎ&'! ) .4
213
333 q0 .3
.6
.4
322 B2
!(1|/&01! ) .5
cold2
112 .2
!(2|/&01! ) = .4
!(3|/&01! ) .1
78
compute • Get estimated probabilities by:
• Computing the forward probability
these for an observation
• Dividing that probability mass
outputs? among all the different paths that
contributed to this forward
Natalie Parde - UIC CS 421
probability (backward
probability)
Backward Algorithm
79
• We define the backward probability as follows:
• 𝛽6 𝑖 = 𝑃(𝑜6@2, 𝑜6@A, … , 𝑜B |𝑞6 = 𝑖, 𝜆)
• Probability of generating partial observations from time t+1 until
the end of the sequence, given that the HMM 𝜆 is in state i at time t
• Also computed using a trellis, but moves backwards instead
Natalie Parde - UIC CS 421
Backward Step
𝛽t+1(N)
𝛽t(i) = ∑&
$+! 𝛽%,! (𝑗)𝑎-$ 𝑏$ (𝑜%,! )
qN qN
𝑎-& 𝑏& (𝑜%,! )
qi
…
…
𝑎-" 𝛽t+1(2)
q2 q2 q2 𝑏" (𝑜%,! )
𝑎-! 𝛽t+1(1)
q1 q1 𝑏$ (𝑜% ) q1
𝑏! (𝑜%,! )
Ot-1 Ot Ot+1
𝛽. 𝑖 = 1
# Expectation Step
compute 𝛾6 (𝑗) for all t and j
compute 𝜁6 (𝑖, 𝑗) for all t, i, and j
# Maximization Step
𝛼GH = 𝑎> GH for all i and j
𝑏H (𝑣I ) = 𝑏@H (𝑣I ) for all j, and all 𝑣I in the output vocab V
Hidden
observations
• Determining the best sequence of hidden states
for an observed sequence
Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward
Topics
Algorithm
Thursday
Tuesday
Parts of Speech
POS Tagsets
POS Tagging
Parts of Speech
87
Parts of Speech
he, she, you…. on, above, to…. a, an, the…. oh, yikes, ah…. and, but, if….
What is part-of-
The process of automatically assigning grammatical
speech (POS) word classes to individual tokens in text.
tagging?
Natalie Parde - UIC CS 421 89
Why is
POS
tagging lead
useful?
Even when using end-
to-end approaches or
pretrained LLMs, POS
tagging is useful.
Offers an avenue for interpretable linguistic
analysis!
• Open
• Closed
Open class:
Closed class:
Closed Class
Modal
Determiners the some can Prepositions to with
had
Conjunctions and or
verb noun
determiner verb
Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward
Topics
Algorithm
Thursday
Tuesday
Parts of Speech
POS Tagsets
POS Tagging
Brown C5 C7
Corpus Tagset Tagset
~1 million
Text from the Text from the
words of
American British National British National
Corpus Corpus
English text
Week’s
Forward Algorithm
Viterbi Algorithm
Forward-Backward
Topics
Algorithm
Thursday
Tuesday
Parts of Speech
POS Tagsets
POS Tagging
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
FW Foreign word POS Possessive ending VBG Verb, gerund or present participle
Preposition or subordinating
IN PRP Personal pronoun VBN Verb, past participle
conjunction
JJ Adjective PRP$ Possessive pronoun VBP Verb, non-3rd person singular present
JJR Adjective, comparative RB Adverb VBZ Verb, 3rd person singular present
POS accuracy
• Tag every word with its most frequent
taggers
tag
• Tag unknown words as nouns
still work
quite well.
Natalie Parde - UIC CS 421 122
How do
POS
taggers
work?
Rule-Based POS Tagging
Example • PRP
• promised
Rule- • to
• VBN, VBD
Based
• TO
• back
Approach
• VB, JJ, RB, NN
• the
• DT
• bill
• NN, VB
but….
• Using a training corpus, determine the most frequent tag for each
word
• Assign POS tags to new words based on those frequencies
• Assign NN to new words for which there is no information from the
training corpus
• Using a training corpus, determine the most frequent tag for each
word
• Assign POS tags to new words based on those frequencies
• Assign NN to new words for which there is no information from the
training corpus
95% DT 90% IN 85% NN
95% PRP
I saw a wampimuk at the zoo yesterday!
75% VBD ??? 95% DT 90% NN
• Using a training corpus, determine the most frequent tag for each
word
• Assign POS tags to new words based on those frequencies
• Assign NN to new words for which there is no information from the
training corpus
PRP DT IN NN
Simple
Statistical
POS Tagger
Natalie Parde - UIC CS 421 134
Bigram HMM POS Tagger
• To determine the tag ti for a single word wi:
• 𝑡! = argmax 𝑃 𝑡( 𝑡!'" 𝑃 𝑤! 𝑡(
$/ ∈{$' ,$( ,…,$)*( }
a22
a24
Superman is expected to fly tomorrow a02 TO2
a14
a13
a24
Superman is expected to fly tomorrow a02 TO2
a14
a13
a34
VB1 NN3
a14
VB1 NN3
a14
a14
0.0027
0.0
012
NR
27
NN NN NN
00
0.
TO2 v1(VB) = 1.0 * 0.83 * 0.00012 = 9.96*10-5
7
004
VB VB VB
0.0
0.83 0.00047 fly
v0(TO) = 1.0 83
NR4 0.
VB 0.00012
TO TO TO
0.0012 NN 0.00057
noun
Modeling t2
adjective
t1
• Use a sequential or pretrained neural
network architecture determiner
h3
• Recurrent neural networks
• Transformers h2
input sequence h1
delicious
• If using a subword vocabulary,
you will need to merge the labels
predicted for all subwords in a h0 a
word
155
• Many factors can lead to your results being higher or lower than
expected!
• Some common factors:
• The size of the training dataset
• The specific characteristics of your tag set
Natalie Parde - UIC CS 421
156