13-Neuralcrf Pos Tagging
13-Neuralcrf Pos Tagging
Neural CRFs
Mausam
1
Types of Prediction Tasks
2
Sequence problems
4
POS Tagging Ambiguity
Foreign ORG
Ministry ORG
spokesman O
Shen
Guofang
PER
PER } Standard
evaluation
told O is per entity,
Reuters ORG not per token
: O
Precision/Recall/F1 for IE/NER
• Recall and precision are straightforward for tasks like
IR and text categorization, where there is only one
grain size (documents)
• The measure behaves a bit funnily for IE/NER when
there are boundary errors (which are common):
– First Bank of Chicago announced earnings …
• This counts as both a fp and a fn
• Selecting nothing would have been better
• Some other metrics (e.g., MUC scorer) give partial
credit (according to complex rules)
Encoding classes for NER
IO encoding IOB encoding
What is missing?
Still not modeling output structure!
Outputs are independent (of each other)
Why Model Interactions in Output?
• Consistency is important!
17
CRFs
18
Potential Functions
19
Linear Chain CRF (in practice)
BiLSTM-CRF
21
Properties
22
Decoding Problem
Given X=x1 …xT, what is “best” tagging y1 …yT?
Y* = argmaxY P( Y|X ) x1 x2 x3 xT
23
Most Likely Sequence
• Problem: find the most likely (Viterbi) sequence under the model
Given model parameters, we can score any sequence pair
NNP VBZ NN NNS CD NN .
Fed raises interest rates 0.5 percent .
N N N N N N
V V V V V V
J J J J J J
D D D D D D
e(N)
N N N N N N
V V V V V V
J J J J J J
D D D D D D
e( X , yi ) max W ( yi 1 , yi ) i 1 ( yi 1 )
yi1 28
Viterbi Algorithm
• Input: x1,…,xT, W() and e()
• Initialize: δ0(<s>) = 0, and –infinity for other labels
• For i=1 to T do
– For (y’) in all possible tagset
i ( y' ) e( X , y' ) max W ( y, y' ) i 1 ( y)
y
• Return
max W ( y' , / s ) T ( y' )
y'
i δi(s)
30
Terminating Viterbi
x1 x2 …………………………………………..xT
Tag 1
δ
2 δ
i δ
Choose
δ Maxy W(y,</s>)
+δT(y)
K
δ
31
Terminating Viterbi
x1 x2 ……………………………………………xT
State 1
2 δ* Max
Time: O(|Y|2T)
Linear in length of sequence
Space: O(|Y|T) 32
Training
• Find weights such that
is minimized
Log_sum_exp
(additive terms)
How to compute partition function?
(backward step handled by autograd)
33
BiLSTM-CRF w/ Features
34
MSQU: Multi-Sentence Qn Understanding
• “I am taking 15 Scouts to New Zealand over Christmas and
New Year. We are spending NYE in Auckland and are
looking for suggestions of restaurants (maybe buffet style)
which will be suitable for a large group? Ideally close to
somewhere where we can watch the fireworks from. Any
ideas would be welcome”
~Open Question Understanding
Model F1 F1 F1 F1
(type) (attribute) (location) (macro-avg)
CRF (with Features) 51.4 45.3 55.7 50.8
BiLSTM CRF 53.3 47.6 52.1 51.0
BiLSTM CRF + Features 58.4 48.1 62.0 56.2
Model F1 F1 F1 F1
(type) (attribute) (location) (macro-avg)
CRF 51.4 45.3 55.7 50.8
BiLSTM CRF 53.3 47.6 52.1 51.0
BERT 59.6 50.6 59.5 56.6
BERT + BiLSTM + CRF 63.4 56.5 72.4 64.4