0% found this document useful (0 votes)
50 views19 pages

18-Graph Based Dependency Parsing-19-09-2024

sc

Uploaded by

amaanmohdsyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views19 pages

18-Graph Based Dependency Parsing-19-09-2024

sc

Uploaded by

amaanmohdsyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Natural Language Understanding

Introduction to Dependency Parsing

1
Dependency parsing is different from constituent parsing

In ANLP and FNLP, we’ve already seen various parsing algorithms


for context-free languages (shift-reduce, CKY, active chart).

Why consider dependency parsing as a distinct topic?

• context-free parsing algorithms base their decisions on


adjacency;
• in a dependency structure, a dependent need not be adjacent
to its head (even if the structure is projective);
• we need new parsing algorithms to deal with non-adjacency
(and with non-projectivity if present).

11
There are many ways to parse dependencies

We will consider two types of dependency parsers:


1. graph-based dependency parsing, based on maximum
spanning trees (MST parser, ?);
2. transition-based dependency parsing, an extension of
shift-reduce parsing (MALT parser, ?).

Alternative 3: map dependency trees to phrase structure trees and


do standard CFG parsing (for projective trees) or LCFRS variants
(for non-projective trees). We will not cover this here.

Note that each of these approach arises from different views of


syntactic structure: as a set of constraints (MST), as the actions
of an automaton (transition-based), or as the derivations of a
grammar (CFG parsing). It is often possible to translate between
these views, with some effort. 12
Graph-based dependency parsing as tagging

Goal: find the highest scoring dependency tree in the space of all
possible trees for a sentence.

Let x = x1 ···xn be the input sentence, and y a dependency tree


for x. Here, y is a set of dependency edges, with (i , j) ∈ y if there
is an edge from xi to xj .

Intuition: since each word has exactly one parent, this is like a
tagging problem, where the possible tags are the other words in the
sentence (or a dummy node called root). If we edge factorize the
score of a tree so that it is simply the product of its edge scores,
then we can simply select the best incoming edge for each word...
subject to the constraint that the result must be a tree.

13
Formalizing graph-based dependency parsing

The score of a dependency edge (i , j) is a function s(i, j). We’ll


discuss the form of this function a little bit later.

Then the score of dependency tree y for sentence x is:


Σ
s(x, y) = s(i, j)
(i,j)∈y

Dependency parsing is the task of finding the tree y with highest


score for a given sentence x.

14
The best dependency parse is the maximum spanning tree

This task can be achieved using the following approach (?):

• start with a totally connected graph G, i.e., assume a directed


edge between every pair of words;
• assume you have a scoring function that assigns a score s(i, j)
to every edge (i , j);
• find the maximum spanning tree (MST) of G, i.e., the
directed tree with the highest overall score that includes all
nodes of G;
• this is possible in O(n2) time using the Chu-Liu-Edmonds
algorithm; it finds a MST which is not guaranteed to be
projective;
• the highest-scoring parse is the MST of G.

15
Chu-Liu-Edmonds (CLE) Algorithm

Example: x = John saw Mary, with graph Gx. Start with the fully
connected graph, with scores:

9
root 10

20 30
9 sa w

John 30 0 Mary

11

16
Chu-Liu-Edmonds (CLE) Algorithm

Each node j in the graph greedily selects the incoming edge with
the highest score s(i, j):
root

20 30
saw

John 30 Mary

If a tree results, it is the maximum spanning tree. If not, there


must be a cycle.
Intuition: We can break the cycle if we replace a single incoming
edge to one of the nodes in the cycle. Which one? Decide
recursively.
17
CLE Algorithm: Recursion

Identify the cycle and contract it into a single node and recalculate
scores of incoming and outgoing edges.
Intuition: edges into the cycle are the weight of the cycle with only
the dependency of the target word changed.

9
root 40

30
saw
wjs
John Mary

31

Now call CLE recursively on this contracted graph. MST on the


contracted graph is equivalent to MST on the original graph. 18
CLE Algorithm: Recursion

Again, greedily collect incoming edges to all nodes:

root 40

30
saw
wjs
John Mary

This is a tree, hence it must be the MST of the graph.

19
CLE Algorithm: Reconstruction

Now reconstruct the uncontracted graph: the edge from wjs to


Mary was from saw. The edge from ROOT to wjs was a tree from
ROOT to saw to John, so we include these edges too:

root
10
saw
30 30
John Mary

20
Where do we get edge scores s(i, j) from?

Σ
s(x, y) = s(i, j)
(i,j)∈y

21
Where do we get edge scores s(i, j) from?

Σ
s(x, y) = s(i, j)
(i,j)∈y

For the decade after 2005: linear model trained with clever variants
of SVMs, MIRA, etc.

21
Where do we get edge scores s(i, j) from?

Σ
s(x, y) = s(i, j)
(i,j)∈y

For the decade after 2005: linear model trained with clever variants
of SVMs, MIRA, etc.

More recently: neural networks, of course.

21
Scoring edges with a neural network

There are a few different formulations of this. An effective one


from Zhang and Lapata (2016):

exp(g(aj , a i ))
s(i, j) = Phead (wj |wi , x) = Σ |x|
k=0 exp(g(ak , a i ))
We get ai by concatenating the hidden states of a forward and
backward RNN at position i.

The function g(aj, ai ) computes an association score telling us


how much word wi prefers word wj as its head. A simple option
from among many:
g(aj, ai ) = v Ta ·tanh(Ua ·aj + Wa ·ai )
Association scores are a useful way to select from a dynamic group
of candidates, and underly the idea of attention used in MT. 22
Transition-based Dependency Parsing

An MST parser builds a dependency tree though graph surgery. An


alternative is transition-based parsing:

• for a given parse state, the transition system defines a set of


actions T which the parser can take;
• if more than one action is applicable, a classifier (e.g., an
SVM) is used to decide which action to take;
• just like in the MST model, this requires a mechanism to
compute scores over a set of (possibly dynamic) candidates.

23
Transition-based Dependency Parsing

The arc-standard transition system:


• configuration c = (s, b, A) with stack s, buffer b, set of
dependency arcs A;
• initial configuration for sentence w1, . . . , wn is
s = [ROOT],b = [w1, . . . , wn], A = ∅;
• c is terminal if buffer is empty, stack contains only ROOT, and
parse tree is given by Ac ;
• if si is the ith top element on stack, and bi the ith element on
buffer, then we have the following transitions:
• LEFT-ARC(l): adds arc s1 → s2 with label l and removes s2
from stack; precondition: |s| ≥ 2;
• RIGHT-ARC(l ): adds arc s2 → s1 with label l and removes s1
from stack; precondition: |s| ≥ 2;
• SHIFT: moves b1 from buffer to stack; recondition: |b| ≥ 1.
24
Transition-based Dependency Parsing

punct
root dobj

nsubj amod

ROOT He has good control .


PRP VBZ JJ NN .

Transition Stack Buffer A


[ROOT] [He has good control .] ∅
SHIFT [ROOT He] [has good control .]
SHIFT [ROOT He has] [good control .]
LEFT-ARC(nsubj) [ROOT has] [good control .] A∪ nsubj(has,He)
SHIFT [ROOT has good] [control .]
SHIFT [ROOT has good control] [.]
LEFT-ARC(amod) [ROOT has control] [.] A∪amod(control,good)
RIGHT-ARC(dobj) [ROOT has] [.] A∪ dobj(has,control)
... ... ... ...
RIGHT-ARC(root) [ROOT] [] A∪ root(ROOT,has)

25
Summary

Comparing MST and transition-based parsers:

• the MST parser selects the globally optimal tree, given a set
of edges with scores;
• it can naturally handle projective and non-projective trees;
• a transition-based parser makes a sequence of local decisions
about the best parse action;
• it can be extended to projective dependency trees by changing
the transition set;
• accuracies are similar, but transition-based is faster;
• both require dynamic classifiers, and these can be
implemented using neural networks, conditioned on
bidirectional RNN encodings of the sentence.

26

You might also like