0% found this document useful (0 votes)
36 views

Natural Language Processing With Deep Learning CS224N/Ling284

This document summarizes a lecture on dependency parsing. It discusses two views of linguistic structure: constituency and dependency. Dependency structure shows binary asymmetric relations between words, forming a tree. It reviews the history of dependency grammar from Panini's Sanskrit grammar to modern work. The lecture also discusses dependency treebanks and sources of information for dependency parsing like bilexical affinities and dependency distance preferences.

Uploaded by

suman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Natural Language Processing With Deep Learning CS224N/Ling284

This document summarizes a lecture on dependency parsing. It discusses two views of linguistic structure: constituency and dependency. Dependency structure shows binary asymmetric relations between words, forming a tree. It reviews the history of dependency grammar from Panini's Sanskrit grammar to modern work. The lecture also discusses dependency treebanks and sources of information for dependency parsing like bilexical affinities and dependency distance preferences.

Uploaded by

suman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Natural Language Processing

with Deep Learning


CS224N/Ling284

Christopher Manning
Lecture 5: Dependency Parsing
Lecture Plan
Linguistic Structure: Dependency parsing
1. Syntactic Structure: Consistency and Dependency (25 mins)
2. Dependency Grammar and Treebanks (15 mins)
3. Transition-based dependency parsing (15 mins)
4. Neural dependency parsing (15 mins)

Reminders/comments:
Assignment 2 was due just before class J
Assignment 3 (dep parsing) is out today L
Start installing and learning PyTorch (Ass 3 has scaffolding)
Final project discussions – come meet with us; focus of week 5
1. Two views of linguistic structure:
Constituency = phrase structure grammar
= context-free grammars (CFGs)
Phrase structure organizes words into nested constituents

Starting unit: words


the, cat, cuddly, by, door

Words combine into phrases


the cuddly cat, by the door

Phrases can combine into bigger phrases


the cuddly cat by the door
1. Two views of linguistic structure:
Constituency = phrase structure grammar
= context-free grammars (CFGs)
Phrase structure organizes words into nested constituents
Can represent the grammar with CFG rules

Starting unit: words are given a category (part of speech = pos)


the, cat, cuddly, by, door
Det N Adj P N

Words combine into phrases with categories


the cuddly cat, by the door
NP → Det Adj N PP → P NP

Phrases can combine into bigger phrases recursively


the cuddly cat by the door
NP → NP PP
Two views of linguistic structure:
Constituency = phrase structure grammar
= context-free grammars (CFGs)
Phrase structure organizes words into nested constituents.

the cat
a dog
large in a crate
barking on the table
cuddly by the door
large barking
talk to
walked behind
Two views of linguistic structure:
Dependency structure
• Dependency structure shows which words depend on (modify or
are arguments of) which other words.

Look in the large crate in the kitchen by the door


Why do we need sentence structure?

We need to understand sentence structure in


order to be able to interpret language correctly

Humans communicate complex ideas by


composing words together into bigger units to
convey complex meanings

We need to know what is connected to what


Prepositional phrase attachment ambiguity
Prepositional phrase attachment
ambiguity


Scientists count whales from space

Scientists count whales from space


PP attachment ambiguities multiply
• A key parsing decision is how we ‘attach’ various constituents
• PPs, adverbial or participial phrases, infinitives, coordinations,
etc.

• Catalan numbers: Cn = (2n)!/[(n+1)!n!]


• An exponentially growing series, which arises in many tree-like contexts:
• E.g., the number of possible triangulations of a polygon with n+2 sides
• Turns up in triangulation of probabilistic graphical models (CS228)….
Coordination scope ambiguity

Shuttle veteran and longtime NASA executive Fred Gregory appointed to board

Shuttle veteran and longtime NASA executive Fred Gregory appointed to board
Coordination scope ambiguity
Adjectival Modifier Ambiguity
Verb Phrase (VP) attachment ambiguity
Dependency paths identify semantic
relations – e.g., for protein interaction
[Erkan et al. EMNLP 07, Fundel et al. 2007, etc.]

demonstrated
nsubj ccomp

results mark interacts nmod:with


det
that advmod SasA
nsubj case conj:and
The
KaiC rythmically with KaiA and KaiB
conj:and cc
KaiC çnsubj interacts nmod:with è SasA
KaiC çnsubj interacts nmod:with è SasA conj:andè KaiA
KaiC çnsubj interacts nmod:with è SasA conj:andè KaiB
Christopher Manning

2. Dependency Grammar and


Dependency Structure

Dependency syntax postulates that syntactic structure consists of


relations between lexical items, normally binary asymmetric
relations (“arrows”) called dependencies
submitted

Bills were Brownback

ports
by Senator Republican

on and immigration
Kansas

of
Christopher Manning

Dependency Grammar and


Dependency Structure

Dependency syntax postulates that syntactic structure consists of


relations between lexical items, normally binary asymmetric
relations (“arrows”) called dependencies
submitted
nsubj:pass aux obl
The arrows are Bills were Brownback
commonly typed nmod
case
ports appos
with the name of flat
grammatical case cc conj by Senator Republican
relations (subject, on and immigration nmod
prepositional object, Kansas
apposition, etc.) case
of
Christopher Manning

Dependency Grammar and


Dependency Structure

Dependency syntax postulates that syntactic structure consists of


relations between lexical items, normally binary asymmetric
relations (“arrows”) called dependencies
submitted
The arrow connects a nsubj:pass aux obl
head (governor,
Bills were Brownback
superior, regent) with a nmod
dependent (modifier, case appos
ports flat
inferior, subordinate)
case cc conj by Senator Republican
Usually, dependencies on and immigration nmod
form a tree (connected, Kansas
acyclic, single-head) case
of
Christopher Manning

Pāṇini’s grammar
(c. 5th century BCE)

Gallery: http://wellcomeimages.org/indexplus/image/L0032691.html
CC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Wellcome L0032691.jpg
24
Christopher Manning

Dependency Grammar/Parsing History

• The idea of dependency structure goes back a long way


• To Pāṇini’s grammar (c. 5th century BCE)
• Basic approach of 1st millennium Arabic grammarians
• Constituency/context-free grammars is a new-fangled invention
• 20th century invention (R.S. Wells, 1947; then Chomsky)
• Modern dependency work often sourced to L. Tesnière (1959)
• Was dominant approach in “East” in 20th Century (Russia, China, …)
• Good for free-er word order languages
• Among the earliest kinds of parsers in NLP, even in the US:
• David Hays, one of the founders of U.S. computational linguistics, built
early (first?) dependency parser (Hays 1962)
Christopher Manning

Dependency Grammar and


Dependency Structure

ROOT Discussion of the outstanding issues was completed .

• Some people draw the arrows one way; some the other way!
• Tesnière had them point from head to dependent…
• Usually add a fake ROOT so every word is a dependent of
precisely 1 other node
Christopher Manning

The rise of annotated data:


Universal Dependencies treebanks
[Universal Dependencies: http://universaldependencies.org/ ;
Earlier: Marcus et al. 1993, The Penn Treebank, Computational Linguistics]
Christopher Manning

The rise of annotated data

Starting off, building a treebank seems a lot slower and less useful
than building a grammar

But a treebank gives us many things


• Reusability of the labor
• Many parsers, part-of-speech taggers, etc. can be built on it
• Valuable resource for linguistics
• Broad coverage, not just a few intuitions
• Frequencies and distributional information
• A way to evaluate systems
Christopher Manning

Dependency Conditioning Preferences

What are the sources of information for dependency parsing?


1. Bilexical affinities [discussion à issues] is plausible
2. Dependency distance mostly with nearby words
3. Intervening material
Dependencies rarely span intervening verbs or punctuation
4. Valency of heads
How many dependents on which side are usual for a head?

ROOT Discussion of the outstanding issues was completed .


Christopher Manning

Dependency Parsing

• A sentence is parsed by choosing for each word what other


word (including ROOT) is it a dependent of

• Usually some constraints:


• Only one word is a dependent of ROOT
• Don’t want cycles A → B, B → A
• This makes the dependencies a tree
• Final issue is whether arrows can cross (non-projective) or not

ROOT I ’ll give a talk tomorrow on bootstrapping


30
Christopher Manning

Projectivity

• Defn: There are no crossing dependency arcs when the words


are laid out in their linear order, with all arcs above the words
• Dependencies parallel to a CFG tree must be projective
• Forming dependencies by taking 1 child of each category as head
• But dependency theory normally does allow non-projective
structures to account for displaced constituents
• You can’t easily get the semantics of certain constructions right without
these nonprojective dependencies

Who did Bill buy the coffee from yesterday ?


Christopher Manning

Methods of Dependency Parsing

1. Dynamic programming
Eisner (1996) gives a clever algorithm with complexity O(n3), by producing parse
items with heads at the ends rather than in the middle
2. Graph algorithms
You create a Minimum Spanning Tree for a sentence
McDonald et al.’s (2005) MSTParser scores dependencies independently using an
ML classifier (he uses MIRA, for online learning, but it can be something else)
Neural graph-based parser: Dozat and Manning (2017)
3. Constraint Satisfaction
Edges are eliminated that don’t satisfy hard constraints. Karlsson (1990), etc.
4. “Transition-based parsing” or “deterministic dependency parsing”
Greedy choice of attachments guided by good machine learning classifiers
E.g., MaltParser (Nivre et al. 2008). Has proven highly effective.
Christopher Manning

3. Greedy transition-based parsing


[Nivre 2003]

• A simple form of greedy discriminative dependency parser


• The parser does a sequence of bottom up actions
• Roughly like “shift” or “reduce” in a shift-reduce parser, but the “reduce”
actions are specialized to create dependencies with head on left or right
• The parser has:
• a stack σ, written with top to the right
• which starts with the ROOT symbol
• a buffer β, written with top to the left
• which starts with the input sentence
• a set of dependency arcs A
• which starts off empty
• a set of actions
Christopher Manning

Basic transition-based dependency parser

Start: σ = [ROOT], β = w1, …, wn , A = ∅


1. Shift σ, wi|β, A è σ|wi, β, A
2. Left-Arcr σ|wi|wj, β, A è σ|wj, β, A∪{r(wj,wi)}
3. Right-Arcr σ|wi|wj, β, A è σ|wi, β, A∪{r(wi,wj)}
Finish: σ = [w], β = ∅
Christopher Manning

Arc-standard transition-based parser


(there are other transition schemes …)
Analysis of “I ate fish”
Start Start: σ = [ROOT], β = w1, …, wn , A = ∅
1. Shift σ, wi|β, A è σ|wi, β, A
2. Left-Arcr σ|wi|wj, β, A è
[root] I ate fish σ|wj, β, A∪{r(wj,wi)}
3. Right-Arcr σ|wi|wj, β, A è
σ|wi, β, A∪{r(wi,wj)}
Finish: σ = [w], β = ∅
Shift

[root] I ate fish

Shift

[root] I ate fish


Christopher Manning

Arc-standard transition-based parser


Analysis of “I ate fish”
Left Arc
A +=
[root] I ate [root] ate nsubj(ate → I)

Shift

[root] ate fish [root] ate fish


Right Arc
A +=
[root] ate fish [root] ate obj(ate → fish)

Right Arc
A +=
[root] ate [root] root([root] → ate)
Finish
Christopher Manning

MaltParser
[Nivre and Hall 2005]

• We have left to explain how we choose the next action


• Answer: Stand back, I know machine learning!
• Each action is predicted by a discriminative classifier (e.g.,
softmax classifier) over each legal move
• Max of 3 untyped choices; max of |R| × 2 + 1 when typed
• Features: top of stack word, POS; first in buffer word, POS; etc.
• There is NO search (in the simplest form)
• But you can profitably do a beam search if you wish (slower but better):
You keep k good parse prefixes at each time step
• The model’s accuracy is fractionally below the state of the art in
dependency parsing, but
• It provides very fast linear time parsing, with great performance
Christopher Manning

Conventional Feature Representation

binary, sparse 0 0 0 1 0 0 1 0 …0 0 1 0
dim =106 ~ 107
Feature templates: usually a
combination of 1 ~ 3 elements from
the configuration.

Indicator features
Christopher Manning

Evaluation of Dependency Parsing:


(labeled) dependency accuracy
Acc = # correct deps
# of deps

UAS = 4 / 5 = 80%
ROOT She saw the video lecture
LAS = 2 / 5 = 40%
0 1 2 3 4 5

Gold Parsed
1 2 She nsubj 1 2 She nsubj
2 0 saw root 2 0 saw root
3 5 the det 3 4 the det
4 5 video nn 4 5 video nsubj
5 2 lecture obj 5 2 lecture ccomp
Christopher Manning

Handling non-projectivity

• The arc-standard algorithm we presented only builds projective


dependency trees
• Possible directions to head:
1. Just declare defeat on nonprojective arcs
2. Use dependency formalism which only has projective representations
• CFG only allows projective structures; you promote head of violations
3. Use a postprocessor to a projective dependency parsing algorithm to
identify and resolve nonprojective links
4. Add extra transitions that can model at least most non-projective
structures (e.g., add an extra SWAP transition, cf. bubble sort)
5. Move to a parsing mechanism that does not use or require any
constraints on projectivity (e.g., the graph-based MSTParser)
Christopher Manning

4. Why train a neural dependency


parser? Indicator Features Revisited

• Problem #1: sparse


• Problem #2: incomplete
• Problem #3: expensive computation
dense 0.1 0.9 -0.2 0.3 … -0.1 -0.5
dim = ~1000
More than 95% of parsing time is consumed by
feature computation.

Our Approach:
learn a dense and compact feature representation
Christopher Manning

A neural dependency parser


[Chen and Manning 2014]

• English parsing to Stanford Dependencies:


• Unlabeled attachment score (UAS) = head
• Labeled attachment score (LAS) = head and label

Parser UAS LAS sent. / s


MaltParser 89.8 87.2 469
MSTParser 91.4 88.1 10
TurboParser 92.3 89.6 8
C & M 2014 92.0 89.7 654
Christopher Manning

Distributed Representations

• We represent each word as a d-dimensional dense vector


(i.e., word embedding)
• Similar words are expected to have close vectors.

• Meanwhile, part-of-speech tags (POS)


was and dependency labels
were
are also represented as d-dimensional vectors.
• The smaller discrete sets also exhibit many semantical similarities.
good
is
come

NNS (plural noun) should be close to NN (singular noun).


go

num (numerical modifier) should be close to amod (adjective modifier).


Christopher Manning

Extracting Tokens and then vector


representations from configuration
• We extract a set of tokens based on the stack / buffer positions:

word POS dep.

s1 good JJ ∅
s2 has VBZ ∅
b1 control NN ∅
lc(s1) ∅ + ∅ + ∅
rc(s1) ∅ ∅ ∅
lc(s2) He PRP nsubj
rc(s2) ∅ ∅ ∅
Christopher Manning

Model Architecture
Softmax probabilities
Output layer y cross-entropy error will be
y = softmax(Uh + b2) back-propagated to the
embeddings.
Hidden layer h
h = ReLU(Wx + b1)

Input layer x
lookup + concat
Dependency parsing for sentence structure

Neural networks can accurately determine the


structure of sentences, supporting interpretation

Chen and Manning (2014) was the first simple,


successful neural dependency parser

The dense representations let it outperform other


greedy parsers in both accuracy and speed
Further developments in transition-based
neural dependency parsing

This work was further developed and improved by others,


including in particular at Google
• Bigger, deeper networks with better tuned hyperparameters
• Beam search
• Global, conditional random field (CRF)-style inference over
the decision sequence
Leading to SyntaxNet and the Parsey McParseFace model
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

Method UAS LAS (PTB WSJ SD 3.3)


Chen & Manning 2014 92.0 89.7
Weiss et al. 2015 93.99 92.05
Andor et al. 2016 94.61 92.79
Graph-based dependency parsers

• Compute a score for every possible dependency for each word


• Doing this well requires good “contextual” representations of
each word token, which we will develop in coming lectures

0.5 0.8

0.3 2.0

ROOT The big cat sat

e.g., picking the head for “big”


Graph-based dependency parsers

• Compute a score for every possible dependency for each word


• Then add an edge from each word to its highest-scoring
candidate head
• And repeat the same process for each other word
0.5 0.8

0.3 2.0

ROOT The big cat sat

e.g., picking the head for “big”


A Neural graph-based dependency parser
[Dozat and Manning 2017; Dozat, Qi, and Manning 2017]
• Revived graph-based dependency parsing in a neural world
• Design a biaffine scoring model for neural dependency
parsing
• Also using a neural sequence model, as we discuss next week
• Really great results!
• But slower than simple neural transition-based parsers
• There are n2 possible dependencies in a sentence of length n

Method UAS LAS (PTB WSJ SD 3.3


Chen & Manning 2014 92.0 89.7
Weiss et al. 2015 93.99 92.05
Andor et al. 2016 94.61 92.79
Dozat & Manning 2017 95.74 94.08

You might also like