0% found this document useful (0 votes)
14 views18 pages

NLP 2024 406

The document discusses advancements in neural architectures for dependency parsing, highlighting the transition from linear models to neural networks, specifically focusing on the work of Chen and Manning (2014) and Kiperwasser and Goldberg (2016). It details the training methodologies, feature embeddings, and parsing accuracy improvements achieved through the use of feedforward networks and Bi-LSTM architectures. Additionally, Glavaš and Vulić (2021) are mentioned for their use of BERT encoders, further enhancing parsing performance.

Uploaded by

Daniel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views18 pages

NLP 2024 406

The document discusses advancements in neural architectures for dependency parsing, highlighting the transition from linear models to neural networks, specifically focusing on the work of Chen and Manning (2014) and Kiperwasser and Goldberg (2016). It details the training methodologies, feature embeddings, and parsing accuracy improvements achieved through the use of feedforward networks and Bi-LSTM architectures. Additionally, Glavaš and Vulić (2021) are mentioned for their use of BERT encoders, further enhancing parsing performance.

Uploaded by

Daniel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Natural Language Processing

Neural architectures for dependency parsing


Marco Kuhlmann
Department of Computer and Information Science

This work is licensed under a


Creative Commons Attribution 4.0 International License.
Learning problems in dependency parsing

• Learning a greedy transition-based dependency parser


amounts to learning the transition classi er.
Chen and Manning ( ), Kiperwasser and Goldberg ( )

• Learning an arc-factored graph-based dependency parser


amounts to learning the arc scores.
Kiperwasser and Goldberg ( ), Glavaš and Vulić ( )
2
0
1
4
2
0
1
6
fi
2
0
2
2
1
0
1
6
Chen and Manning (2014)

• Pre-neural transition classi ers relied on linear


models with hand-cra ed combination features.
Linear

• C & M propose to replace the linear model with ReLU


a two-layer feedforward network (FNN).
Linear

• e standard choice for the transfer function is


the recti ed linear unit (ReLU).
feedforward
C & M use the cube function, ( ) = 3.
neural network
Th
𝒚𝒙
fi
ft
𝑓
fi
𝑥
𝑥
I wanted to try someplace new

wanted to try someplace new

stack buffer

softmax scores for the transitions

FNN

concat

Embed Embed Embed

to try someplace

stack 2 stack 1 bu er 1
ff
I wanted to try someplace new

wanted try someplace new

stack buffer

softmax scores for the transitions

FNN

concat

Embed Embed Embed

wanted try someplace

stack 2 stack 1 bu er 1
ff
Chen and Manning (2014) – Features

• C & M embed the top words on the stack and bu er, as well as
certain descendants of the top words on the stack.

• In addition to word embeddings, they also use embeddings for


part-of-speech tags and dependency labels.
3
ff
Chen and Manning (2014) – Training

• To train their parser, C & M minimise cross-entropy loss relative


to the gold-standard action, plus an L regularisation term.

• To generate training examples for the transition classi er, they


use the static oracle for the arc-standard algorithm.
can be generated o -line
ff
2
fi
Parsing accuracy

UAS LAS

Baseline, transition-based . .

Baseline, graph-based . .

Chen and Manning ( ) . .

Weiss et al. ( ) . .

Parsing accuracy on the test set of the Penn Treebank + conversion to Stanford dependencies
8
8
9
8
9
8
9
9
9
7
0
7
3
1
1
9
4
3
7
6
2
2
8
6
2
0
1
5
2
0
1
4
Kiperwasser and Goldberg (2016)

• e neural parser of C & M learns useful feature combinations,


but the need to carefully design the core features remains.

• K & G propose to use a minimal set of core features based on


contextualised embeddings obtained from a Bi-LSTM.
Bi-LSTM is trained with the rest of the parser.

• ey show that this approach gives state-of-the-art accuracy both


for transition-based and for graph-based parsing.
Th
Th
I wanted to try someplace new

wanted to try someplace new

stack buffer

FNN scores for the transitions

concat

1 2 3 4 5 6

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Embed Embed Embed Embed Embed Embed

I wanted to try someplace new

stack 2 stack 1 bu er 1
𝒗
𝒗
𝒗
𝒗
𝒗
𝒗
ff
I wanted to try someplace new

wanted try someplace new

stack buffer

FNN scores for the transitions

concat

1 2 3 4 5 6

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Embed Embed Embed Embed Embed Embed

I wanted to try someplace new

stack 2 stack 1 bu er 1
𝒗
𝒗
𝒗
𝒗
𝒗
𝒗
ff
Features and training (transition-based parser)

• For their transition-based parser, K & G embed the top words


on the stack, as well as the rst word in the bu er.
both word and part-of-speech tag

• In contrast to C & M, they use a dynamic oracle, so they cannot


generate training examples in an o -line fashion.
fi
ff
ff
3
I wanted to try someplace new

FNN score for the arc

concat

1 2 3 4 5 6

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Embed Embed Embed Embed Embed Embed

I wanted to try someplace new

dependent head
𝒗
𝒗
𝒗
𝒗
𝒗
𝒗
I wanted to try someplace new

FNN score for the arc

concat

1 2 3 4 5 6

Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM Bi-LSTM

Embed Embed Embed Embed Embed Embed

I wanted to try someplace new

head dependent
𝒗
𝒗
𝒗
𝒗
𝒗
𝒗
Features and training (graph-based parser)

• For their graph-based parser, K & G embed the head and


dependent of each arc.
both word and part-of-speech tag

• e training objective is to maximise the margin between the


score of the gold tree * and the highest-scoring incorrect tree :
Th
𝑦
𝑦
Parsing accuracy

UAS LAS

Chen and Manning ( ) . .

Weiss et al. ( ) . .

K&G( ), graph-based . .

K&G( ), transition-based . .

Parsing accuracy on the test set of the Penn Treebank + conversion to Stanford dependencies
9
8
9
9
9
9
9
9
1
9
3
1
3
0
3
1
8
6
2
2
0
9
6
5
2
2
0
0
1
1
6
6
2
0
1
5
2
0
1
4
Glavaš and Vulić (2021)

• G & V adopt the basic architecture of K & G but use a BERT


encoder instead of a Bi-LSTM.
requires word-level average pooling of token representations

• e arc scores are computed using a bi-a ne layer:

score(𝑥, 𝑖 → 𝑗) = 𝒘𝑖 𝑾 𝒘𝑗⊤ + 𝒃
<latexit sha1_base64="WKW75ax5UuKj4l+rgeP0iwRzsII=">AAAFkXicjVRtT9swEA4b3Vj3BuPjvlirkGBLq6aglk6qhAChTVvXjPJSqekqJzmKIbGz2Cl0UX7Wfsy0r9v/mN1mQFOQsNTGfu65x+c7n+3AI1yUy7/mHjyczz16vPAk//TZ8xcvF5deHXEWhQ4cOsxjYcfGHDxC4VAQ4UEnCAH7tgfH9vmOsh8PIeSE0QMxCqDn4wElJ8TBQkL9xZYl4FLE3GEhJKuXOiLIEgydrSFLb1g6sobgxBdJX8Lp4jhBN/Czb5IeoHcTwE76i4VyqTweaHZipJOClg6zvzT/1nKZE/lAheNhzrtGORC9GIeCOB4keSviEGDnHA+gK6cU+8B78fjkCVqRiItOWCh/VKAxetMjxj73sTiVTPXheWsfvkckBDO1KzcegJPks6auzTy3yMXIg8bHdktX/tfLXhxR4jAXimP5vMVB+JhQpdfNI9QmP2APsIhC4KiBYgkhFCtUroqbpQ0dtQJ5Quyl2GaiT3EkpWisZ1mGkaEZ60WjXqrP8OpZniQVs6xaJZGkMfMzGVwFewAdBX2JfFveHBV9k1HGZUrBlRKe21ZZkG69uEmovEbIDNn/FIjT+6WgqqM9VbIGmmg0VXoPCB3paCyvDBJqOyEJxOQ/c6TqJI2zKjs4ELep3JHhWYF9GEQeDu+hcZX+WZF2ZJ/eR0EVpla5TWGX8CDDrlWKdzFv3UuV97pMzclV9UCobm1YfOTbJ1PtouonGPO4pAUhGxIXHOb7mLrWgAyBxhLfBdmnqk1kr7i78uXxiYCwY5qWud/a7hq92ExW1+IkXpG7WyFQuJjWsCijfByf9d5ylQCXp5MhhWLKpIJhgdpUnaNgJHn5uhjZt2R2clQpGdVS9etGYWs7fWcWtNfaG21VM7SatqV90EztUHO0n9pv7Y/2N7ecq+e2cin3wVzqs6xNjdynf6nquqU=</latexit>
Th
ffi
I liked the place

Bia ne score for the arc liked → I

1 2

0 21 22 3 4
Pooling mean

BERT FFN FFN FFN FFN FFN FFN

MHA MHA MHA MHA MHA MHA

[CLS] I lik ##ed the place

dependent head
𝒗
𝒗
𝒗
𝒗
𝒗
𝒗
𝒗
ffi

You might also like