Natural Language Processing With Deep Learning CS224N/Ling284
Natural Language Processing With Deep Learning CS224N/Ling284
Christopher Manning
Lecture 5: Dependency Parsing
Lecture Plan
Linguistic Structure: Dependency parsing
1. Syntactic Structure: Consistency and Dependency (25 mins)
2. Dependency Grammar and Treebanks (15 mins)
3. Transition-based dependency parsing (15 mins)
4. Neural dependency parsing (15 mins)
Reminders/comments:
Assignment 2 was due just before class J
Assignment 3 (dep parsing) is out today L
Start installing and learning PyTorch (Ass 3 has scaffolding)
Final project discussions – come meet with us; focus of week 5
1. Two views of linguistic structure:
Constituency = phrase structure grammar
= context-free grammars (CFGs)
Phrase structure organizes words into nested constituents
the cat
a dog
large in a crate
barking on the table
cuddly by the door
large barking
talk to
walked behind
Two views of linguistic structure:
Dependency structure
• Dependency structure shows which words depend on (modify or
are arguments of) which other words.
✓
Scientists count whales from space
Shuttle veteran and longtime NASA executive Fred Gregory appointed to board
Shuttle veteran and longtime NASA executive Fred Gregory appointed to board
Coordination scope ambiguity
Adjectival Modifier Ambiguity
Verb Phrase (VP) attachment ambiguity
Dependency paths identify semantic
relations – e.g., for protein interaction
[Erkan et al. EMNLP 07, Fundel et al. 2007, etc.]
demonstrated
nsubj ccomp
ports
by Senator Republican
on and immigration
Kansas
of
Christopher Manning
Pāṇini’s grammar
(c. 5th century BCE)
Gallery: http://wellcomeimages.org/indexplus/image/L0032691.html
CC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Wellcome L0032691.jpg
24
Christopher Manning
• Some people draw the arrows one way; some the other way!
• Tesnière had them point from head to dependent…
• Usually add a fake ROOT so every word is a dependent of
precisely 1 other node
Christopher Manning
Starting off, building a treebank seems a lot slower and less useful
than building a grammar
Dependency Parsing
Projectivity
1. Dynamic programming
Eisner (1996) gives a clever algorithm with complexity O(n3), by producing parse
items with heads at the ends rather than in the middle
2. Graph algorithms
You create a Minimum Spanning Tree for a sentence
McDonald et al.’s (2005) MSTParser scores dependencies independently using an
ML classifier (he uses MIRA, for online learning, but it can be something else)
Neural graph-based parser: Dozat and Manning (2017)
3. Constraint Satisfaction
Edges are eliminated that don’t satisfy hard constraints. Karlsson (1990), etc.
4. “Transition-based parsing” or “deterministic dependency parsing”
Greedy choice of attachments guided by good machine learning classifiers
E.g., MaltParser (Nivre et al. 2008). Has proven highly effective.
Christopher Manning
Shift
Shift
Right Arc
A +=
[root] ate [root] root([root] → ate)
Finish
Christopher Manning
MaltParser
[Nivre and Hall 2005]
binary, sparse 0 0 0 1 0 0 1 0 …0 0 1 0
dim =106 ~ 107
Feature templates: usually a
combination of 1 ~ 3 elements from
the configuration.
Indicator features
Christopher Manning
UAS = 4 / 5 = 80%
ROOT She saw the video lecture
LAS = 2 / 5 = 40%
0 1 2 3 4 5
Gold Parsed
1 2 She nsubj 1 2 She nsubj
2 0 saw root 2 0 saw root
3 5 the det 3 4 the det
4 5 video nn 4 5 video nsubj
5 2 lecture obj 5 2 lecture ccomp
Christopher Manning
Handling non-projectivity
Our Approach:
learn a dense and compact feature representation
Christopher Manning
Distributed Representations
s1 good JJ ∅
s2 has VBZ ∅
b1 control NN ∅
lc(s1) ∅ + ∅ + ∅
rc(s1) ∅ ∅ ∅
lc(s2) He PRP nsubj
rc(s2) ∅ ∅ ∅
Christopher Manning
Model Architecture
Softmax probabilities
Output layer y cross-entropy error will be
y = softmax(Uh + b2) back-propagated to the
embeddings.
Hidden layer h
h = ReLU(Wx + b1)
Input layer x
lookup + concat
Dependency parsing for sentence structure
0.5 0.8
0.3 2.0
0.3 2.0