A Fast and Accurate Dependency Parser Using Neural Networks
A Fast and Accurate Dependency Parser Using Neural Networks
Danqi Chen
Computer Science Department
Stanford University
[email protected]
Abstract
Almost all current dependency parsers
classify based on millions of sparse indicator features. Not only do these features
generalize poorly, but the cost of feature
computation restricts parsing speed significantly. In this work, we propose a novel
way of learning a neural network classifier
for use in a greedy, transition-based dependency parser. Because this classifier learns
and uses just a small number of dense features, it can work very fast, while achieving an about 2% improvement in unlabeled and labeled attachment scores on
both English and Chinese datasets. Concretely, our parser is able to parse more
than 1000 sentences per second at 92.2%
unlabeled attachment score on the English
Penn Treebank.
Introduction
Christopher D. Manning
Computer Science Department
Stanford University
[email protected]
Transition-based dependency parsing aims to predict a transition sequence from an initial configuration to some terminal configuration, which derives a target dependency parse tree, as shown in
Figure 1. In this paper, we examine only greedy
parsing, which uses a classifier to predict the correct transition based on features extracted from the
configuration. This class of parsers is of great interest because of their efficiency, although they
tend to perform slightly worse than the searchbased parsers because of subsequent error propagation. However, our greedy parser can achieve
comparable accuracy with a very good speed.1
As the basis of our parser, we employ the
arc-standard system (Nivre, 2004), one of the
most popular transition systems. In the arcstandard system, a configuration c = (s, b, A)
consists of a stack s, a buffer b, and a set of
dependency arcs A. The initial configuration
for a sentence w1 , . . . , wn is s = [ROOT], b =
[w1 , . . . , wn ], A = . A configuration c is terminal if the buffer is empty and the stack contains
the single node ROOT, and the parse tree is given
by Ac . Denoting si (i = 1, 2, . . .) as the ith top
element on the stack, and bi (i = 1, 2, . . .) as the
ith element on the buffer, the arc-standard system
defines three types of transitions:
LEFT-ARC(l): adds an arc s1 s2 with
label l and removes s2 from the stack. Precondition: |s| 2.
RIGHT-ARC(l): adds an arc s2 s1 with
label l and removes s1 from the stack. Precondition: |s| 2.
SHIFT: moves b1 from the buffer to the
stack. Precondition: |b| 1.
In the labeled version of parsing, there are in total
|T | = 2Nl + 1 transitions, where Nl is number
of different arc labels. Figure 1 illustrates an example of one transition sequence from the initial
configuration to a terminal one.
The essential goal of a greedy parser is to predict a correct transition from T , based on one
1
punct
nsubj
ROOT
dobj
root
amod
Stack
He
has good control .
PRP VBZ
JJ
NN
.
Transition
Stack
[ROOT]
SHIFT
[ROOT He]
SHIFT
[ROOT He has]
LEFT-ARC(nsubj)
[ROOT has]
SHIFT
[ROOT has good]
SHIFT
[ROOT has good control]
LEFT-ARC(amod)
[ROOT has control]
RIGHT-ARC(dobj)
[ROOT has]
...
...
RIGHT-ARC(root)
[ROOT]
ROOT
has VBZ
Buer
good JJ
control NN
..
nsubj
He PRP
Buffer
[He has good control .]
[has good control .]
[good control .]
[good control .]
[control .]
[.]
[.]
[.]
...
[]
A nsubj(has,He)
Aamod(control,good)
A dobj(has,control)
...
A root(ROOT,has)
Figure 1: An example of transition-based dependency parsing. Above left: a desired dependency tree,
above right: an intermediate configuration, bottom: a transition sequence of the arc-standard system.
Features
All features in Table 1
single-word & word-pair features
only single-word features
excluding all lexicalized features
UAS
88.0
82.7
76.9
81.5
Model
Figure 2 describes our neural network architecture. First, as usual word embeddings, we repred
sent each word as a d-dimensional vector ew
i R
w
dN
w
and the full embedding matrix is E R
where Nw is the dictionary size. Meanwhile,
we also map POS tags and arc labels to a ddimensional vector space, where eti , elj Rd are
the representations of ith POS tag and j th arc label. Correspondingly, the POS and label embedding matrices are E t RdNt and E l RdNl
where Nt and Nl are the number of distinct POS
tags and arc labels.
We choose a set of elements based on the
stack / buffer positions for each type of information (word, POS or label), which might
be useful for our predictions. We denote the
sets as S w , S t , S l respectively. For example,
given the configuration in Figure 2 and S t =
Softmax layer:
p = softmax(W2 h)
Hidden layer:
h = (W1w xw + W1t xt + W1l xl + b1 )3
POS tags
words
Stack
Configuration
arc labels
Buffer
control NN
..
nsubj
He PRP
Figure 2: Our neural network architecture.
{lc1 (s2 ).t, s2 .t, rc1 (s2 ).t, s1 .t}, we will extract
PRP, VBZ, NULL, JJ in order. Here we use a special token NULL to represent a non-existent element.
We build a standard neural network with one
hidden layer, where the corresponding embeddings of our chosen elements from S w , S t , S l will
be added to the input layer. Denoting nw , nt , nl as
the number of chosen elements of each type, we
w
w
add xw = [ew
w1 ; ew2 ; . . . ewnw ] to the input layer,
w
where S = {w1 , . . . , wnw }. Similarly, we add
the POS tag features xt and arc label features xl to
the input layer.
We map the input layer to a hidden layer with
dh nodes through a cube activation function:
h = (W1w xw + W1t xt + W1l xl + b1 )3
where W1w Rdh (dnw ) , W1t Rdh (dnt ) ,
W1l Rdh (dnl ) , and b1 Rdh is the bias.
A softmax layer is finally added on the top of
the hidden layer for modeling multi-class probabilities p = softmax(W2 h), where W2
R|T |dh .
POS and label embeddings
To our best knowledge, this is the first attempt to
introduce POS tag and arc label embeddings instead of discrete representations.
Although the POS tags P = {NN, NNP,
NNS, DT, JJ, . . .} (for English) and arc labels
L = {amod, tmod, nsubj, csubj, dobj, . . .}
(for Stanford Dependencies on English) are relatively small discrete sets, they still exhibit many
semantical similarities like words. For example,
NN (singular noun) should be closer to NNS (plural
0.5
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
cube
sigmoid
tanh
identity
0.5
i,j
Training
X
i
log pti +
kk2
2
285,791, coverage = 79.0%). We will also compare with random initialization of E w in Section
4. The training error derivatives will be backpropagated to these embeddings during the training process.
We use mini-batched AdaGrad (Duchi et al.,
2011) for optimization and also apply a dropout
(Hinton et al., 2012) with 0.5 rate. The parameters which achieve the best unlabeled attachment
score on the development set will be chosen for
final evaluation.
3.3
Parsing
4
4.1
Experiments
Datasets
Dataset
PTB: CD
PTB: SD
CTB
#Train
39,832
39,832
16,091
#Dev
1,700
1,700
803
#Test
2,416
2,416
1,910
#words (Nw )
44,352
44,389
34,577
#POS (Nt )
45
45
35
#labels (Nl )
17
45
12
projective (%)
99.4
99.9
100.0
Table 3: Data Statistics. Projective is the percentage of projective trees on the training set.
and Nugues, 2007) using the LTH Constituent-toDependency Conversion Tool3 and Stanford Basic
Dependencies (SD) (de Marneffe et al., 2006) using the Stanford parser v3.3.0.4 The POS tags are
assigned using Stanford POS tagger (Toutanova et
al., 2003) with ten-way jackknifing of the training
data (accuracy 97.3%).
For Chinese, we adopt the same split of CTB5
as described in (Zhang and Clark, 2008). Dependencies are converted using the Penn2Malt tool5
with the head-finding rules of (Zhang and Clark,
2008). And following (Zhang and Clark, 2008;
Zhang and Nivre, 2011), we use gold segmentation and POS tags for the input.
Table 3 gives statistics of the three datasets.6 In
particular, over 99% of the trees are projective in
all datasets.
4.2
Results
The following hyper-parameters are used in all experiments: embedding size d = 50, hidden layer
size h = 200, regularization parameter = 108 ,
initial learning rate of Adagrad = 0.01.
To situate the performance of our parser, we first
make a comparison with our own implementation of greedy arc-eager and arc-standard parsers.
These parsers are trained with structured averaged
perceptron using the early-update strategy. The
feature templates of (Zhang and Nivre, 2011) are
used for the arc-eager system, and they are also
adapted to the arc-standard system.7
Furthermore, we also compare our parser
with two popular, off-the-shelf parsers: MaltParser a greedy transition-based dependency
parser (Nivre et al., 2006),8 and MSTParser
3
http://nlp.cs.lth.se/software/treebank converter/
http://nlp.stanford.edu/software/lex-parser.shtml
5
http://stp.lingfil.uu.se/ nivre/research/Penn2Malt.html
6
Pennconverter and Stanford dependencies generate
slightly different tokenization, e.g., Pennconverter splits the
token WCRS\/Boston NNP into three tokens WCRS NNP /
CC Boston NNP.
7
Since arc-standard is bottom-up, we remove all features
using the head of stack elements, and also add the right child
features of the first stack element.
8
http://www.maltparser.org/
4
Dev
UAS LAS
89.9 88.7
90.3 89.2
90.0 88.8
90.1 88.9
92.1 90.8
92.2 91.0
Test
Speed
UAS LAS (sent/s)
89.7 88.3
51
89.9 88.6
63
89.9 88.5
560
90.1 88.7
535
92.0 90.5
12
92.0 90.7 1013
Parser
standard
eager
Malt:sp
Malt:eager
MSTParser
Our parser
Dev
UAS LAS
90.2 87.8
89.8 87.4
89.8 87.2
89.6 86.9
91.4 88.1
92.0 89.7
Test
Speed
UAS LAS (sent/s)
89.4 87.3
26
89.6 87.4
34
89.3 86.9
469
89.4 86.8
448
90.7 87.6
10
91.8 89.6
654
Dev
UAS LAS
82.4 80.9
81.1 79.7
82.4 80.5
81.2 79.3
84.0 82.1
84.0 82.4
Test
Speed
UAS LAS (sent/s)
82.7 81.2
72
80.3 78.7
80
82.4 80.6
420
80.2 78.4
393
83.0 81.2
6
83.9 82.4
936
Model Analysis
Last but not least, we will examine the parameters we have learned, and hope to investigate what
these dense features capture. We use the weights
learned from the English Penn Treebank using
Stanford dependencies for analysis.
What do E t , E l capture?
We first introduced E t and E l as the dense representations of all POS tags and arc labels, and
we wonder whether these embeddings could carry
some semantic information.
Figure 5 presents t-SNE visualizations (van der
Maaten and Hinton, 2008) of these embeddings.
It clearly shows that these embeddings effectively
exhibit the similarities between POS tags or arc
labels. For instance, the three adjective POS tags
JJ, JJR, JJS have very close embeddings, and
also the three labels representing clausal complements acomp, ccomp, xcomp are grouped together.
Since these embeddings can effectively encode
the semantic regularities, we believe that they can
be also used as alternative features of POS tags (or
arc labels) in other NLP tasks, and help boost the
performance.
Knowing that
and
(as well as the word emw
beddings E ) can capture semantic information
very well, next we hope to investigate what each
feature in the hidden layer has really learned.
Since we currently only have h = 200 learned
dense features, we wonder if it is sufficient to
learn the word conjunctions as sparse indicator
features, or even more. We examine the weights
W1w (k, ) Rdnw , W1t (k, ) Rdnt , W1l (k, )
Rdnl for each hidden unit k, and reshape them to
d nt , d nw , d nl matrices, such that the
weights of each column corresponds to the embeddings of one specific element (e.g., s1 .t).
We pick the weights with absolute value > 0.2,
and visualize them for each feature. Figure 6 gives
the visualization of three sampled features, and it
exhibits many interesting phenomena:
There have been several lines of earlier work in using neural networks for parsing which have points
of overlap but also major differences from our
work here. One big difference is that much early
work uses localist one-hot word representations
rather than the distributed representations of modern work. (Mayberry III and Miikkulainen, 1999)
explored a shift reduce constituency parser with
one-hot word representations and did subsequent
parsing work in (Mayberry III and Miikkulainen,
2005).
(Henderson, 2004) was the first to attempt to use
neural networks in a broad-coverage Penn Treebank parser, using a simple synchrony network to
predict parse decisions in a constituency parser.
More recently, (Titov and Henderson, 2007) applied Incremental Sigmoid Belief Networks to
constituency parsing and then (Garg and Henderson, 2011) extended this work to transition-based
dependency parsers using a Temporal Restricted
Boltzman Machine. These are very different neural network architectures, and are much less scalable and in practice a restricted vocabulary was
used to make the architecture practical.
There have been a number of recent uses of
deep learning for constituency parsing (Collobert,
2011; Socher et al., 2013). (Socher et al., 2014)
has also built models over dependency representations but this work has not attempted to learn neural networks for dependency parsing.
Most recently, (Stenetorp, 2013) attempted to
build recursive neural networks for transitionbased dependency parsing, however the empirical
performance of his model is still unsatisfactory.
Et
El
Related Work
Conclusion
We have presented a novel dependency parser using neural networks. Experimental evaluations
show that our parser outperforms other greedy
parsers using sparse indicator features in both accuracy and speed. This is achieved by representing all words, POS tags and arc labels as dense
vectors, and modeling their interactions through a
novel cube activation function. Our model only
relies on dense features, and is able to automatically learn the most useful feature conjunctions
for making predictions.
An interesting line of future work is to combine
our neural network based classifier with searchbased models to further improve accuracy. Also,
95
90
85
UAS score
90
UAS score
UAS score
90
85
85
80
75
80
70
80
PTB:CD
cube
PTB:SD
tanh
PTB:CD
CTB
sigmoid
identity
PTB:SD
pre-trained
CTB
PTB:CD
random
word+POS+label
PTB:SD
CTB
word+POS
word+label
word
Figure 4: Effects of different parser components. Left: comparison of different activation functions.
Middle: comparison of pre-trained word vectors and random initialization. Right: effects of POS and
label embeddings.
800
600
:
.
400
VBG
advmod
600
nsubj
csubj
nsubjpass
expl
,
400
amod
WP$
200
200
400
600
RB RBSRBR
WP WRB PRP
$
PRP$ WDT LS
NNPS
NNP
NN
POS
IN
NNS
TO
FW
SYM
ROOT
VBN
VB
RP
CC
JJR
(
DT
JJ
UH
PDT
JJS
EX
misc
#
CD
noun
VBD
VBZ
punctuation
VBP
verb
MD
200
aux
preconj cop
npadvmod
mwe
400
200
200
400
600
dobj
acomp
advcl
xcomp
ccomp
tmod
misc
clausal complement
noun premodifier
verbal auxiliaries
subject
preposition complement
noun postmodifier
600
800
1000
400
parataxis
punct
csubjpass
infmod root
neg
partmod
prep
number
200
pcomp
pobj
adverb
adjective
800
600
conj
discourse
cc
mark
0
num
predet
det
auxpass
poss
possessive
quantmod
prt
iobj
dep
nn
appos
rcmod
600
400
200
200
400
600
800
Figure 6: Three sampled features. In each feature, each row denotes a dimension of embeddings and
each column denotes a chosen element, e.g., s1 .t or lc(s1 ).w, and the parameters are divided into 3
zones, corresponding to W1w (k, :) (left), W1t (k, :) (middle) and W1l (k, :) (right). White and black dots
denote the most positive weights and most negative weights respectively.
there is still room for improvement in our architecture, such as better capturing word conjunctions,
or adding richer features (e.g., distance, valency).
Acknowledgments
Stanford University gratefully acknowledges the
support of the Defense Advanced Research
Projects Agency (DARPA) Deep Exploration and
Filtering of Text (DEFT) Program under Air
Force Research Laboratory (AFRL) contract no.
FA8750-13-2-0040 and the Defense Threat Reduction Agency (DTRA) under Air Force Research Laboratory (AFRL) contract no. FA865010-C-7020. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily
reflect the view of the DARPA, AFRL, or the US
government.
References
Bernd Bohnet. 2010. Very high accuracy and fast dependency parsing is not a contradiction. In Coling.
Ronan Collobert, Jason Weston, Leon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
scratch. Journal of Machine Learning Research.
Ronan Collobert. 2011. Deep learning for efficient
discriminative parsing. In AISTATS.
Marie-Catherine de Marneffe, Bill MacCartney, and
Christopher D. Manning. 2006. Generating typed
dependency parses from phrase structure parses. In
LREC.
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas
Lamar, Richard Schwartz, and John Makhoul. 2014.
Fast and robust neural network joint models for statistical machine translation. In ACL.
John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learning
and stochastic optimization. The Journal of Machine Learning Research.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang, and Chih-Jen Lin. 2008. Liblinear: A
library for large linear classification. The Journal of
Machine Learning Research.
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014.
Grounded compositional semantics for finding and
describing images with sentences. TACL.
He He, Hal Daume III, and Jason Eisner. 2013. Dynamic feature selection for dependency parsing. In
EMNLP.
Ivan Titov and James Henderson. 2007. Fast and robust multilingual dependency parsing with a generative latent variable model. In EMNLP-CoNLL.
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-ofspeech tagging with a cyclic dependency network.
In NAACL.
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-SNE. The Journal of Machine Learning Research.
Yue Zhang and Stephen Clark. 2008. A tale of
two parsers: Investigating and combining graphbased and transition-based dependency parsing using beam-search. In EMNLP.
Yue Zhang and Joakim Nivre. 2011. Transition-based
dependency parsing with rich non-local features. In
ACL.