0% found this document useful (0 votes)
19 views40 pages

13-Neuralcrf Pos Tagging

The document discusses sequence labeling in natural language processing (NLP), focusing on tasks such as part-of-speech tagging and named entity recognition. It highlights the importance of modeling output structure using Conditional Random Fields (CRFs) to account for local dependencies and improve prediction accuracy. Additionally, it compares various models, including BiLSTM-CRF and BERT, demonstrating the effectiveness of combining neural networks with CRFs for better performance in low data settings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views40 pages

13-Neuralcrf Pos Tagging

The document discusses sequence labeling in natural language processing (NLP), focusing on tasks such as part-of-speech tagging and named entity recognition. It highlights the importance of modeling output structure using Conditional Random Fields (CRFs) to account for local dependencies and improve prediction accuracy. Additionally, it compares various models, including BiLSTM-CRF and BERT, demonstrating the effectiveness of combining neural networks with CRFs for better performance in low data settings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Sequence Labeling

Neural CRFs

Mausam

1
Types of Prediction Tasks

2
Sequence problems

• Many problems in NLP have data which is a


sequence of characters, words, phrases, lines,
or sentences …
• We can think of our task as one of labeling
each item
VBG NN IN DT NN IN NN B B I I B I B I B B
Chasing opportunity in an age of upheaval
而 相对 于这 些 品牌 的价
POS tagging Word segmentation
Q
A
PERS O O O ORG ORG Q Text
Murdoch discusses future of News Corp.
A
A
segmen-
A tation
Named entity recognition Q
A
POS Tagging

DT NNP NN VBD VBN RP NN NNS


The Georgia branch had taken on loan commitments …

DT NN IN NN VBD NNS VBD


The average of interbank offered rates plummeted …

4
POS Tagging Ambiguity

• Words often have more than one POS: back


– The back door = JJ
– On my back = NN
– Win the voters back = RB
– Promised to back the bill = VB
• The POS tagging problem is to determine the
POS tag for a particular instance of a word.
Named Entity Recognition (NER)
• A very important sub-task: find and classify
names in text, for example:

– The decision by the independent MP Andrew


Wilkie to withdraw his support for the minority
Labor government sounded dramatic but it
should not further threaten its stability. When,
after the 2010 election, Wilkie, Rob Oakeshott,
Tony Windsor and the Greens agreed to support
Labor, they gave just two guarantees: confidence
and supply.
Named Entity Recognition (NER)
• A very important sub-task: find and classify
names in text, for example:

– The decision by the independent MP Andrew


Wilkie to withdraw his support for the minority
Labor government sounded dramatic but it
should not further threaten its stability. When,
after the 2010 election, Wilkie, Rob Oakeshott,
Tony Windsor and the Greens agreed to support
Labor, they gave just two guarantees: confidence
and supply.
Named Entity Recognition (NER)
Person
• A very important sub-task: find and classify Date
Location
names in text, for example: Organi-
zation
– The decision by the independent MP Andrew
Wilkie to withdraw his support for the minority
Labor government sounded dramatic but it
should not further threaten its stability. When,
after the 2010 election, Wilkie, Rob Oakeshott,
Tony Windsor and the Greens agreed to support
Labor, they gave just two guarantees: confidence
and supply.
The Named Entity Recognition Task
Task: Predict entities in a text

Foreign ORG
Ministry ORG
spokesman O
Shen
Guofang
PER
PER } Standard
evaluation
told O is per entity,
Reuters ORG not per token
: O
Precision/Recall/F1 for IE/NER
• Recall and precision are straightforward for tasks like
IR and text categorization, where there is only one
grain size (documents)
• The measure behaves a bit funnily for IE/NER when
there are boundary errors (which are common):
– First Bank of Chicago announced earnings …
• This counts as both a fp and a fn
• Selecting nothing would have been better
• Some other metrics (e.g., MUC scorer) give partial
credit (according to complex rules)
Encoding classes for NER
IO encoding IOB encoding

Fred PER B-PER


showed O O
Sue PER B-PER
Mengqiu PER B-PER
Huang PER I-PER
‘s O O
new O O
painting O O

Practically negligible differences in performance. BIO is more standard..


Sequence Labeling as
Independent Classification

Structured Prediction task


But not a Structured Prediction Model
Instead: independent multi-class classification
Sequence Labeling with
BiLSTM / Transformer

What is missing?
Still not modeling output structure!
Outputs are independent (of each other)
Why Model Interactions in Output?
• Consistency is important!

• Example 2: Paris Hilton


Conditional Random Fields
• Models w/ Local Dependencies

• Some independence assumptions on the


output space, but not entirely independent
(local dependencies)

• Exact and optimal decoding/training via


dynamic programs
16
Local vs Global Normalization

17
CRFs

18
Potential Functions

19
Linear Chain CRF (in practice)
BiLSTM-CRF

21
Properties

22
Decoding Problem
Given X=x1 …xT, what is “best” tagging y1 …yT?

Several possible meanings of ‘solution’


1. States which are individually most likely
2. Single best state sequence

We want sequence y1 …yT, 1 1 1 … 1


2 2 2 … 2
such that P(Y|X) is maximized … … … …
K K K … K

Y* = argmaxY P( Y|X ) x1 x2 x3 xT

23
Most Likely Sequence
• Problem: find the most likely (Viterbi) sequence under the model
 Given model parameters, we can score any sequence pair
NNP VBZ NN NNS CD NN .
Fed raises interest rates 0.5 percent .

 In principle, we’re done – list all possible tag sequences, score


each one, pick the best one (the Viterbi state sequence)
2T+1 operations
NNP VBZ NN NNS CD NN logP = -23 per sequence
NNP NNS NN NNS CD NN logP = -29
NNP VBZ VB NNS CD NN logP = -27

|Y|T tag sequences!


24
Finding the Best Trajectory
• Brute Force: Too many trajectories (state sequences) to list
• Option 1: Beam Search
Fed:N, raises:N
<s>,Fed:N
Fed:N, raises:V
<s>,<s> <s>,Fed:V
Fed:V, raises:N
<s>,Fed:J Fed:V, raises:V
– A beam is a set of partial hypotheses
– Start with just the single empty trajectory
– At each derivation step:
• Consider all continuations of previous hypotheses
• Discard most, keep top k

 Beam search works ok in practice


 … but sometimes you want the optimal answer
 … and there’s often a better option than naïve beams
25
State Lattice / Trellis
<s> <s> <s> <s> <s> <s>

N N N N N N

V V V V V V

J J J J J J

D D D D D D

</s> </s> </s> </s> </s> </s>

<s> Fed raise interest rates


26
State Lattice / Trellis
<s> <s> <s> <s> <s> <s>

e(N)
N N N N N N

V V V V V V

J J J J J J

D D D D D D

</s> </s> </s> </s> </s> </s>

<s> Fed raise interest rates


27
Dynamic Programming
• Decoding: Y *  arg max P(Y | X )  arg max score( X , Y )
Y T 1 Y T
 arg max W ( yt 1 , yt )   e( X , yt )
Y t 1 t 1

• First consider how to compute max


• Define  i ( yi )  max
y[1:i 1]
score( X , y[1..i ] )
– score of most likely label sequence ending with tag yi at
position i, given words x1, …, xT
 i ( yi )  max e( X , yi )  W ( yi 1 , yi )  score( X , y[1..i 1] )
y[1:i 1]

 e( X , yi )  max W ( yi 1 , yi )  max score( X , y[1..i 1] )


yi1 y[1:i  2 ]

 e( X , yi )  max W ( yi 1 , yi )   i 1 ( yi 1 )
yi1 28
Viterbi Algorithm
• Input: x1,…,xT, W() and e()
• Initialize: δ0(<s>) = 0, and –infinity for other labels
• For i=1 to T do
– For (y’) in all possible tagset
 i ( y' )  e( X , y' )  max W ( y, y' )   i 1 ( y)
y
• Return
max W ( y' ,  / s )   T ( y' )
y'

returns only the optimal value


keep backpointers
29
Viterbi Algorithm
x1 x2 …..xi-1 xi………………………………xT
Tag 1
Maxy’ δi-1(y’) + Wtrans+ eobs
2

i δi(s)

Remember: δ (y) = score of most likely


i
tag seq ending with y at time i

30
Terminating Viterbi
x1 x2 …………………………………………..xT

Tag 1
δ
2 δ
i δ
Choose
δ Maxy W(y,</s>)
+δT(y)
K
δ

31
Terminating Viterbi
x1 x2 ……………………………………………xT

State 1

2 δ* Max

How did we compute *? Maxs’ δT-1(y’) + Ptrans + Pobs

Now Backchain to Find Final Sequence

Time: O(|Y|2T)
Linear in length of sequence
Space: O(|Y|T) 32
Training
• Find weights such that

Loss( )   log PCRF (Y | X ; )

is minimized

Log_sum_exp
(additive terms)
How to compute partition function?
(backward step handled by autograd)
33
BiLSTM-CRF w/ Features

34
MSQU: Multi-Sentence Qn Understanding
• “I am taking 15 Scouts to New Zealand over Christmas and
New Year. We are spending NYE in Auckland and are
looking for suggestions of restaurants (maybe buffet style)
which will be suitable for a large group? Ideally close to
somewhere where we can watch the fireworks from. Any
ideas would be welcome”
~Open Question Understanding

select x where x.type = “restaurant” and


x.location IN “Auckland” and x.attribute = “buffet style” and
x.attribute = “suitable for large group” and
x.attribute PREF “somewhere we can watch fireworks from”

Key Issue: Only 150 labeled questions!


Human Insight: Features!
• Token level features
• Raw token, lexicalized features, POS Tag, NER Tags

• Hand designed features


• Indicator features for candidates that are likely to be types based on targets of
WH- POS words such as Which, Where etc
• Indicator features for candidates that are likely to be attributes by checking if
there is an edge in the dependency graph leading up to a candidate type.
• Indicator features for adj-noun phrases

• Cluster ids of word2vec clustered words

• Global word counts in post


Question Parsing Accuracy
[Contractor, Patra, Mausam, Singla JNLE’21]

Model F1 F1 F1 F1
(type) (attribute) (location) (macro-avg)
CRF (with Features) 51.4 45.3 55.7 50.8
BiLSTM CRF 53.3 47.6 52.1 51.0
BiLSTM CRF + Features 58.4 48.1 62.0 56.2

Neural + Features > Neural > Symbolic + Features


Question Parsing Accuracy
[Contractor, Patra, Mausam, Singla JNLE’21]

Model F1 F1 F1 F1
(type) (attribute) (location) (macro-avg)
CRF 51.4 45.3 55.7 50.8
BiLSTM CRF 53.3 47.6 52.1 51.0
BERT 59.6 50.6 59.5 56.6
BERT + BiLSTM + CRF 63.4 56.5 72.4 64.4

BERT + CRF > BERT


Summary
• BiLSTM+CRF (or more generally Neural CRFs)
– combines feature engineering of Neural models
– global reasoning of CRFs

• When are CRFs helpful?


– Joint inference
– Low data setting

You might also like