0% found this document useful (0 votes)
63 views51 pages

Dependency Parsing

Uploaded by

longthaisona1k60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views51 pages

Dependency Parsing

Uploaded by

longthaisona1k60
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Natural Language Processing

with Deep Learning


CS224N/Ling284

Richard Socher
Lecture 7: Dependency Parsing
Organization
Reminders/comments:
• Final project discussion – come meet with us
• Extra credit for most prolific piazza student answerers
• Midterm in two weeks
• Practice exams are on the website

2 1/30/18
Lecture Plan
1. Syntactic Structure: Constituency and Dependency
2. Dependency Grammar
3. Transition-based dependency parsing
4. Neural dependency parsing

3 1/30/18
Two views of linguistic structure:
Constituency = phrase structure grammar
= context-free grammars (CFGs)
Phrase structure organizes words into nested constituents.

Basic unit: words


the, cat, cuddly, by, door

Words combine into phrases


the cuddly cat, by the door

Phrases can combine into bigger phrases


the cuddly cat by the door
4 1/30/18
Two views of linguistic structure:
Constituency = phrase structure grammar
= context-free grammars (CFGs)
Phrase structure organizes words into nested constituents.
Can represent the grammar with CFG rules

Basic unit: words


the, cat, cuddly, by, door
Det N Adj P N

Words combine into phrases


the cuddly cat, by the door
NP -> Det Adj N PP -> P NP

Phrases can combine into bigger phrases


the cuddly cat by the door
5 NP -> NP PP 1/30/18
Example Constituency Trees
• PP attachment ambiguities in constituency structure

6 1/30/18
Two views of linguistic structure:
Dependency structure
• Dependency structure shows which words depend on (modify or
are arguments of) which other words.

Look for the large barking dog by the door in a crate


7 1/30/18
Two views of linguistic structure:
Dependency structure
• Dependency structure shows which words depend on (modify or
are arguments of) which other words.
• Determiners, adjectives, and (sometimes) verbs modify nouns

Look for the large barking dog by the door in a crate


8 1/30/18
Two views of linguistic structure:
Dependency structure
• Dependency structure shows which words depend on (modify or
are arguments of) which other words.
• Determiners, adjectives, and (sometimes) verbs modify nouns
• We will also treat prepositions as modifying nouns

Look for the large barking dog by the door in a crate


9 1/30/18
Two views of linguistic structure:
Dependency structure
• Dependency structure shows which words depend on (modify or
are arguments of) which other words.
• Determiners, adjectives, and (sometimes) verbs modify nouns
• We will also treat prepositions as modifying nouns
• The prepositional phrases are modifying the main noun phrase

Look for the large barking dog by the door in a crate


10 1/30/18
Two views of linguistic structure:
Dependency structure
• Dependency structure shows which words depend on (modify or
are arguments of) which other words.
• Determiners, adjectives, and (sometimes) verbs modify nouns
• We will also treat prepositions as modifying nouns
• The prepositional phrases are modifying the main noun phrase
• The main noun phrase is an argument of “look”

Look for the large barking dog by the door in a crate


11 1/30/18
Ambiguity: PP attachments

Scientists study whales from space

12 1/30/18
PP attachment ambiguities in dependency
structure

Scientists study whales from space

13
Scientists study whales from space
1/30/18
Attachment ambiguities
• A key parsing decision is how we ‘attach’ various constituents
• PPs, adverbial or participial phrases, infinitives, coordinations,
etc.

14 1/30/18
Attachment ambiguities
• A key parsing decision is how we ‘attach’ various constituents
• PPs, adverbial or participial phrases, infinitives, coordinations,
etc.

• Catalan numbers: Cn = (2n)!/[(n+1)!n!]


• An exponentially growing series, which arises in many tree-like contexts
•15 But normally, we assume nesting. 1/30/18
The rise of annotated data:
Universal Dependencies treebanks
[Universal Dependencies: http://universaldependencies.org/ ;
cf. Marcus et al. 1993, The Penn Treebank, Computational Linguistics]

16 1/30/18
The rise of annotated data
Starting off, building a treebank seems a lot slower and less useful
than building a grammar

But a treebank gives us many things


• Reusability of the labor
• Many parsers, part-of-speech taggers, etc. can be built on it
• Valuable resource for linguistics
• Broad coverage, not just a few intuitions
• Frequencies and distributional information
• A way to evaluate systems

17 1/30/18
Dependency Grammar and
Dependency Structure

Dependency syntax postulates that syntactic structure consists of


relations between lexical items, normally binary asymmetric
relations (“arrows”) called dependencies
submitted

Bills were Brownback

ports
by Senator Republican

on and immigration
Kansas

18 1/30/18 of
Dependency Grammar and
Dependency Structure

Dependency syntax postulates that syntactic structure consists of


relations between lexical items, normally binary asymmetric
relations (“arrows”) called dependencies
submitted
nsubj:pass aux obl
The arrows are Bills were Brownback
commonly typed nmod
case appos
with the name of ports flat
grammatical case cc conj by Senator Republican
relations (subject, nmod
on and immigration
prepositional object, Kansas
apposition, etc.) case
19 1/30/18 of
Dependency Grammar and
Dependency Structure

Dependency syntax postulates that syntactic structure consists of


relations between lexical items, normally binary asymmetric
relations (“arrows”) called dependencies
submitted
The arrow connects a nsubj:pass aux obl
head (governor,
Bills were Brownback
superior, regent) with a nmod
dependent (modifier, case appos
ports flat
inferior, subordinate)
case cc conj by Senator Republican
Usually, dependencies on and immigration nmod
form a tree (connected, Kansas
acyclic, single-head) case
20 1/30/18 of
Dependency Relations

Selected dependency relations from the Universal Dependency set. (de Marneffe et al., 2014)
21 https://web.stanford.edu/~jurafsky/slp3/14.pdf 1/30/18
Pāṇini’s grammar
(c. 5th century BCE)

Gallery: http://wellcomeimages.org/indexplus/image/L0032691.html
CC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Welcome L0032691.jpg
22 1/30/18
Dependency Grammar/Parsing History

• The idea of dependency structure goes back a long way


• To Pāṇini’s grammar (c. 5th century BCE)
• Basic approach of 1st millennium Arabic grammarians
• Constituency/context-free grammars is a more recent invention
• 20th century (R.S. Wells, 1947)
• Modern dependency work often linked to work of L. Tesnière
(1959)
• Was dominant approach in “East” (Russia, China, …)
• Good for free-er word order languages
• Among the earliest kinds of parsers in NLP, even in the US:
• David Hays, one of the founders of U.S. computational linguistics, built
early (first?) dependency parser (Hays 1962)
23 1/30/18
Dependency Grammar and
Dependency Structure

ROOT Discussion of the outstanding issues was completed .

• Some people draw the arrows one way; some the other way!
• Tesnière had them point from head to dependent…
• Ours will point from head to dependent
• Usually add a fake ROOT so every word is a dependent of
precisely 1 other node
24 1/30/18
Dependency Conditioning Preferences

What are the sources of information for dependency parsing?


1. Bilexical affinities [discussion à issues] is plausible
2. Dependency distance mostly with nearby words
3. Intervening material
Dependencies rarely span intervening verbs or punctuation
4. Valency of heads
How many dependents on which side are usual for a head?

ROOT Discussion of the outstanding issues was completed


25 1/30/18 .
Dependency Parsing

• A sentence is parsed by choosing for each word what other


word (including ROOT) it is a dependent of
• i.e., find the right outgoing arrow from each word

• Usually some constraints:


• Only one word is a dependent of ROOT
• Don’t want cycles A → B, B → A
• This makes the dependencies a tree
• Final issue is whether arrows can cross (non-projective) or not

ROOT I ’ll give a talk tomorrow on bootstrapping


26 1/30/18
Methods of Dependency Parsing

1. Dynamic programming
2. Graph algorithms
You create a Minimum Spanning Tree for a sentence
McDonald et al.’s (2005) MSTParser scores dependencies independently
using an ML classifier (he uses MIRA, for online learning, but it can be
something else)
3. Constraint Satisfaction
Edges are eliminated that don’t satisfy hard constraints. Karlsson (1990), etc.
4. “Transition-based parsing” or “deterministic dependency
parsing”
Greedy choice of attachments guided by good machine learning classifiers
MaltParser (Nivre et al. 2008). Has proven highly effective.

27 1/30/18
4. Greedy transition-based parsing
[Nivre 2003]

• A simple form of greedy discriminative dependency parser


• The parser does a sequence of bottom up actions
• Roughly like “shift” or “reduce” in a shift-reduce parser, but the “reduce”
actions are specialized to create dependencies with head on left or right
• The parser has:
• a stack σ, written with top to the right
• which starts with the ROOT symbol
• a buffer β, written with top to the left
• which starts with the input sentence
• a set of dependency arcs A
• which starts off empty
• a set of actions
28 1/30/18
Basic transition-based dependency parser

Start: σ = [ROOT], β = w1, …, wn , A = ∅


1. Shift σ, wi|β, A è σ|wi, β, A
2. Left-Arcr σ|wi|wj, β, A è σ|wj, β, A∪{r(wj,wi)}
3. Right-Arcr σ|wi|wj, β, A è σ|wi, β, A∪{r(wi,wj)}
Finish: σ = [w], β = ∅

29 1/30/18
Arc-standard transition-based parser
(there are other transition schemes …)
Analysis of “I ate fish”
Start Start: σ = [ROOT], β = w1, …, wn , A = ∅
1. Shift σ, wi|β, A è σ|wi, β, A
2. Left-Arcr σ|wi|wj, β, A è
[root] I ate fish σ|wj, β, A∪{r(wj,wi)}
3. Right-Arcr σ|wi|wj, β, A è
σ|wi, β, A∪{r(wi,wj)}
Finish: β = ∅
Shift

[root] I ate fish

Shift

[root] I ate fish

30 1/30/18
Arc-standard transition-based parser
Analysis of “I ate fish”
Left Arc
A +=
[root] I ate [root] ate nsubj(ate → I)

Shift

[root] ate fish [root] ate fish


Right Arc
A +=
[root] ate fish [root] ate obj(ate → fish)

Right Arc
A +=
[root] ate [root] root([root] → ate)
31 Finish
1/30/18
MaltParser
[Nivre and Hall 2005]

• How could we choose the next action?


• Each action is predicted by a discriminative classifier (eg. SVM or
logistic regression classifier) over each legal move
• Features: top of stack word, POS; first in buffer word, POS; etc.
• There is NO search (in the simplest form)
• But you can profitably do a beam search if you wish (slower but better)
• It provides VERY fast linear time parsing
• The model’s accuracy is only slightly below the best dependency
parsers

32 1/30/18
Feature Representation

binary, sparse 0 0 0 1 0 0 1 0 …0 0 1 0
dim =106 ~ 107
Feature templates: usually a
combination of 1 ~ 3 elements from
the configuration.

Indicator features

33
Evaluation of Dependency Parsing:
(labeled) dependency accuracy
Acc = # correct deps
# of deps

UAS = 4 / 5 = 80%
ROOT She saw the video lecture
LAS = 2 / 5 = 40%
0 1 2 3 4 5

Gold Parsed
1 2 She nsubj 1 2 She nsubj
2 0 saw root 2 0 saw root
3 5 the det 3 4 the det
4 5 video nn 4 5 video nsubj
5 2 lecture obj 5 2 lecture ccomp

34 1/30/18
Dependency paths identify semantic
relations – e.g, for protein interaction
[Erkan et al. EMNLP 07, Fundel et al. 2007, etc.]

demonstrated
nsubj ccomp

results mark interacts nmod:with


det
that advmod SasA
nsubj case conj:and
The
KaiC rythmically with KaiA and KaiB
conj:and cc
KaiC çnsubj interacts nmod:with è SasA
KaiC çnsubj interacts nmod:with è SasA conj:andè KaiA
KaiC çnsubj interacts prep_withè SasA conj:andè KaiB

35 1/30/18
Projectivity

• Dependencies parallel to a CFG tree must be projective


• There must not be any crossing dependency arcs when the words are laid
out in their linear order, with all arcs above the words.
• But dependency theory normally does allow non-projective
structures to account for displaced constituents
• You can’t easily get the semantics of certain constructions right without
these nonprojective dependencies

Who did Bill buy the coffee from yesterday ?


36 1/30/18
Handling non-projectivity

• The arc-standard algorithm we presented only builds projective


dependency trees
• Possible directions:
1. Just declare defeat on nonprojective arcs
2. Use a dependency formalism which only admits projective
representations (a CFG doesn’t represent such structures…)
3. Use a postprocessor to a projective dependency parsing algorithm to
identify and resolve nonprojective links
4. Add extra transitions that can model at least most non-projective
structures (e.g., add an extra SWAP transition, cf. bubble sort)
5. Move to a parsing mechanism that does not use or require any
constraints on projectivity (e.g., the graph-based MSTParser)

37 1/30/18
Why train a neural dependency parser?
Indicator Features Revisited

• Problem #1: sparse


• Problem #2: incomplete
• Problem #3: expensive computation
dense
0.1 0.9 -0.2 0.3 … -0.1 -0.5
dim = ~1000
More than 95% of parsing time is consumed by
feature computation.

Our Approach:
learn a dense and compact feature representation

38
A neural dependency parser
[Chen and Manning 2014]

• English parsing to Stanford Dependencies:


• Unlabeled attachment score (UAS) = head
• Labeled attachment score (LAS) = head and label

Parser UAS LAS sent. / s


MaltParser 89.8 87.2 469
MSTParser 91.4 88.1 10
TurboParser 92.3* 89.6* 8
C & M 2014 92.0 89.7 654
39 1/30/18
Distributed Representations

• We represent each word as a d-dimensional dense vector


(i.e., word embedding)
• Similar words are expected to have close vectors.

• Meanwhile, part-of-speech tags (POS) and dependency labels


was
are also represented as d-dimensional vectors.
were

• The smaller discrete sets also exhibit many similarities.


good
is
come

NNS (plural noun) should be close to NN (singular noun).


go

num (numerical modifier) should be close to amod (adjective modifier).

40
Extracting Tokens and then vector
representations from configuration
• We extract a set of tokens based on the stack / buffer positions:

word POS dep.

s1 good JJ ∅
s2 has VBZ ∅
b1 control NN ∅
lc(s1) ∅ + ∅ + ∅
rc(s1) ∅ ∅ ∅
lc(s2) He PRP nsubj
rc(s2) ∅ ∅ ∅
• We convert them to vector embeddings and concatenate them 41
Model Architecture

Softmax probabilities
Output layer y cross-entropy error will be
y = softmax(Uh + b2) back-propagated to the
embeddings.
Hidden layer h
h = ReLU(Wx + b1)

Input layer x
lookup + concat

42
Non-linearities between layers:
Why they’re needed

• For logistic regression: map to probabilities


• Here: function approximation,
e.g., for regression or classification
• Without non-linearities, deep neural networks
can’t do anything more than a linear transform
• Extra layers could just be compiled down into
a single linear transform
• People use various non-linearities

43 1/30/18
Non-linearities: sigmoid and tanh

logistic (“sigmoid”) tanh

tanh is just a rescaled and shifted sigmoid tanh(z) = 2logistic(2z) −1


tanh is often used and often performs better for deep nets
44 • It’s output is symmetric around 0 1/30/18
Non-linearities: hard tanh

• Faster to compute than tanh


(no exps or division)
• But suffers from “dead
neurons”
• If our model is initialized such
that a neuron is always 1, it will
never change!
• “Saturated neurons” can also be
a problem for regular tanh –
initializing NNs right is really
important!

45 1/30/18
Non-linearities: ReLU

• Also fast to compute, but also


can cause dead neurons rect(z) = max(z, 0)
• Mega common: “go-to”
activation function
• Transfers a linear activation
when active
• Lots of variants: LReLU, SELU,
ELU, PReLU…

46 1/30/18
Dependency parsing for sentence structure

Neural networks can accurately determine the


structure of sentences, supporting interpretation

Chen and Manning (2014) was the first simple,


successful neural dependency parser

The dense representations let it outperform other


greedy parsers in both accuracy and speed
Lecture 1, Slide 47 1/30/18
Further developments in transition-based
neural dependency parsing

This work was further developed and improved by others,


including in particular at Google
• Bigger, deeper networks with better tuned hyperparameters
• Beam search
• Global, conditional random field (CRF)-style inference over
the decision sequence
Leading to SyntaxNet and the Parsey McParseFace model
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

Method UAS LAS (PTB WSJ SD 3.3


Chen & Manning 2014 92.0 89.7
Weiss et al. 2015 93.99 92.05
Andor et al. 2016 94.61 92.79
Lecture 1, Slide 48 1/30/18
Graph-based dependency parsers

• Compute a score for every possible dependency


• Then add an edge from each word to its highest-scoring
candidate head

0.5 0.8

0.3 2.0

ROOT The big cat sat

e.g., picking the head for “big”


Lecture 1, Slide 49 1/30/18
Graph-based dependency parsers

• Compute a score for every possible dependency


• Then add an edge from each word to its highest-scoring
candidate head

0.5 0.8

0.3 2.0

ROOT The big cat sat

e.g., picking the head for “big”


Lecture 1, Slide 50 1/30/18
Neural graph-based dependency parsers

• Compute a score for every possible dependency


• Then add an edge from each word to its highest-scoring
candidate head
• Really great results!
• But slower than transition-based parsers: there are n^2
possible dependencies in a sentence of length n.

Method UAS LAS (PTB WSJ SD 3.3


Chen & Manning 2014 92.0 89.7
Weiss et al. 2015 93.99 92.05
Andor et al. 2016 94.61 92.79
Dozat & Manning 2017 95.74 93.08
Lecture 1, Slide 51 1/30/18

You might also like