0% found this document useful (0 votes)
163 views11 pages

A Fast and Accurate Dependency Parser Using Neural Networks

This document describes a new approach to dependency parsing that uses neural networks rather than sparse indicator features. The neural network classifier learns dense vector representations of words, parts-of-speech tags, and dependency labels. This allows the parser to work very fast while achieving improvements in accuracy over existing approaches. Specifically, the parser can parse over 1000 sentences per second with a 92.2% unlabeled attachment score on the English Penn Treebank dataset. The key contributions are developing a neural network architecture for parsing that provides good accuracy and speed, and introducing an activation function to better model higher-order interaction features.

Uploaded by

Non Sense
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views11 pages

A Fast and Accurate Dependency Parser Using Neural Networks

This document describes a new approach to dependency parsing that uses neural networks rather than sparse indicator features. The neural network classifier learns dense vector representations of words, parts-of-speech tags, and dependency labels. This allows the parser to work very fast while achieving improvements in accuracy over existing approaches. Specifically, the parser can parse over 1000 sentences per second with a 92.2% unlabeled attachment score on the English Penn Treebank dataset. The key contributions are developing a neural network architecture for parsing that provides good accuracy and speed, and introducing an activation function to better model higher-order interaction features.

Uploaded by

Non Sense
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Fast and Accurate Dependency Parser using Neural Networks

Danqi Chen
Computer Science Department
Stanford University
[email protected]

Abstract
Almost all current dependency parsers
classify based on millions of sparse indicator features. Not only do these features
generalize poorly, but the cost of feature
computation restricts parsing speed significantly. In this work, we propose a novel
way of learning a neural network classifier
for use in a greedy, transition-based dependency parser. Because this classifier learns
and uses just a small number of dense features, it can work very fast, while achieving an about 2% improvement in unlabeled and labeled attachment scores on
both English and Chinese datasets. Concretely, our parser is able to parse more
than 1000 sentences per second at 92.2%
unlabeled attachment score on the English
Penn Treebank.

Introduction

In recent years, enormous parsing success has


been achieved by the use of feature-based discriminative dependency parsers (Kubler et al., 2009).
In particular, for practical applications, the speed
of the subclass of transition-based dependency
parsers has been very appealing.
However, these parsers are not perfect. First,
from a statistical perspective, these parsers suffer
from the use of millions of mainly poorly estimated feature weights. While in aggregate both
lexicalized features and higher-order interaction
term features are very important in improving the
performance of these systems, nevertheless, there
is insufficient data to correctly weight most such
features. For this reason, techniques for introducing higher-support features such as word class features have also been very successful in improving
parsing performance (Koo et al., 2008). Second,
almost all existing parsers rely on a manually designed set of feature templates, which require a lot

Christopher D. Manning
Computer Science Department
Stanford University
[email protected]

of expertise and are usually incomplete. Third, the


use of many feature templates cause a less studied problem: in modern dependency parsers, most
of the runtime is consumed not by the core parsing algorithm but in the feature extraction step (He
et al., 2013). For instance, Bohnet (2010) reports
that his baseline parser spends 99% of its time doing feature extraction, despite that being done in
standard efficient ways.
In this work, we address all of these problems
by using dense features in place of the sparse indicator features. This is inspired by the recent success of distributed word representations in many
NLP tasks, e.g., POS tagging (Collobert et al.,
2011), machine translation (Devlin et al., 2014),
and constituency parsing (Socher et al., 2013).
Low-dimensional, dense word embeddings can effectively alleviate sparsity by sharing statistical
strength between similar words, and can provide
us a good starting point to construct features of
words and their interactions.
Nevertheless, there remain challenging problems of how to encode all the available information from the configuration and how to model
higher-order features based on the dense representations. In this paper, we train a neural network classifier to make parsing decisions within
a transition-based dependency parser. The neural network learns compact dense vector representations of words, part-of-speech (POS) tags, and
dependency labels. This results in a fast, compact classifier, which uses only 200 learned dense
features while yielding good gains in parsing accuracy and speed on two languages (English and
Chinese) and two different dependency representations (CoNLL and Stanford dependencies). The
main contributions of this work are: (i) showing
the usefulness of dense representations that are
learned within the parsing task, (ii) developing a
neural network architecture that gives good accuracy and speed, and (iii) introducing a novel acti-

vation function for the neural network that better


captures higher-order interaction features.

Transition-based Dependency Parsing

Transition-based dependency parsing aims to predict a transition sequence from an initial configuration to some terminal configuration, which derives a target dependency parse tree, as shown in
Figure 1. In this paper, we examine only greedy
parsing, which uses a classifier to predict the correct transition based on features extracted from the
configuration. This class of parsers is of great interest because of their efficiency, although they
tend to perform slightly worse than the searchbased parsers because of subsequent error propagation. However, our greedy parser can achieve
comparable accuracy with a very good speed.1
As the basis of our parser, we employ the
arc-standard system (Nivre, 2004), one of the
most popular transition systems. In the arcstandard system, a configuration c = (s, b, A)
consists of a stack s, a buffer b, and a set of
dependency arcs A. The initial configuration
for a sentence w1 , . . . , wn is s = [ROOT], b =
[w1 , . . . , wn ], A = . A configuration c is terminal if the buffer is empty and the stack contains
the single node ROOT, and the parse tree is given
by Ac . Denoting si (i = 1, 2, . . .) as the ith top
element on the stack, and bi (i = 1, 2, . . .) as the
ith element on the buffer, the arc-standard system
defines three types of transitions:
LEFT-ARC(l): adds an arc s1 s2 with
label l and removes s2 from the stack. Precondition: |s| 2.
RIGHT-ARC(l): adds an arc s2 s1 with
label l and removes s1 from the stack. Precondition: |s| 2.
SHIFT: moves b1 from the buffer to the
stack. Precondition: |b| 1.
In the labeled version of parsing, there are in total
|T | = 2Nl + 1 transitions, where Nl is number
of different arc labels. Figure 1 illustrates an example of one transition sequence from the initial
configuration to a terminal one.
The essential goal of a greedy parser is to predict a correct transition from T , based on one
1

Additionally, our parser can be naturally incorporated


with beam search, but we leave this to future work.

Single-word features (9)


s1 .w; s1 .t; s1 .wt; s2 .w; s2 .t;
s2 .wt; b1 .w; b1 .t; b1 .wt
Word-pair features (8)
s1 .wt s2 .wt; s1 .wt s2 .w; s1 .wts2 .t;
s1 .w s2 .wt; s1 .t s2 .wt; s1 .w s2 .w
s1 .t s2 .t; s1 .t b1 .t
Three-word feaures (8)
s2 .t s1 .t b1 .t; s2 .t s1 .t lc1 (s1 ).t;
s2 .t s1 .t rc1 (s1 ).t; s2 .t s1 .t lc1 (s2 ).t;
s2 .t s1 .t rc1 (s2 ).t; s2 .t s1 .w rc1 (s2 ).t;
s2 .t s1 .w lc1 (s1 ).t; s2 .t s1 .w b1 .t
Table 1: The feature templates used for analysis.
lc1 (si ) and rc1 (si ) denote the leftmost and rightmost children of si , w denotes word, t denotes
POS tag.
given configuration. Information that can be obtained from one configuration includes: (1) all
the words and their corresponding POS tags (e.g.,
has / VBZ); (2) the head of a word and its label
(e.g., nsubj, dobj) if applicable; (3) the position of a word on the stack/buffer or whether it has
already been removed from the stack.
Conventional approaches extract indicator features such as the conjunction of 1 3 elements
from the stack/buffer using their words, POS tags
or arc labels. Table 1 lists a typical set of feature
templates chosen from the ones of (Huang et al.,
2009; Zhang and Nivre, 2011).2 These features
suffer from the following problems:
Sparsity. The features, especially lexicalized
features are highly sparse, and this is a common problem in many NLP tasks. The situation is severe in dependency parsing, because it depends critically on word-to-word
interactions and thus the high-order features.
To give a better understanding, we perform a
feature analysis using the features in Table 1
on the English Penn Treebank (CoNLL representations). The results given in Table 2
demonstrate that: (1) lexicalized features are
indispensable; (2) Not only are the word-pair
features (especially s1 and s2 ) vital for predictions, the three-word conjunctions (e.g.,
{s2 , s1 , b1 }, {s2 , lc1 (s1 ), s1 }) are also very
important.
2

We exclude sophisticated features using labels, distance,


valency and third-order features in this analysis, but we will
include all of them in the final evaluation.

punct
nsubj

ROOT

Correct transition: SHIFT

dobj

root

amod

Stack

He
has good control .
PRP VBZ
JJ
NN
.

Transition

Stack
[ROOT]
SHIFT
[ROOT He]
SHIFT
[ROOT He has]
LEFT-ARC(nsubj)
[ROOT has]
SHIFT
[ROOT has good]
SHIFT
[ROOT has good control]
LEFT-ARC(amod)
[ROOT has control]
RIGHT-ARC(dobj)
[ROOT has]
...
...
RIGHT-ARC(root)
[ROOT]

ROOT

has VBZ

Buer
good JJ

control NN

..

nsubj
He PRP

Buffer
[He has good control .]
[has good control .]
[good control .]
[good control .]
[control .]
[.]
[.]
[.]
...
[]

A nsubj(has,He)
Aamod(control,good)
A dobj(has,control)
...
A root(ROOT,has)

Figure 1: An example of transition-based dependency parsing. Above left: a desired dependency tree,
above right: an intermediate configuration, bottom: a transition sequence of the arc-standard system.
Features
All features in Table 1
single-word & word-pair features
only single-word features
excluding all lexicalized features

UAS
88.0
82.7
76.9
81.5

Table 2: Performance of different feature sets.


UAS: unlabeled attachment score.
Incompleteness. Incompleteness is an unavoidable issue in all existing feature templates. Because even with expertise and manual handling involved, they still do not include the conjunction of every useful word
combination. For example, the conjunction of s1 and b2 is omitted in almost all
commonly used feature templates, however
it could indicate that we cannot perform a
RIGHT-ARC action if there is an arc from s1
to b2 .
Expensive feature computation. The feature generation of indicator features is generally expensive we have to concatenate
some words, POS tags, or arc labels for generating feature strings, and look them up in a
huge table containing several millions of features. In our experiments, more than 195% of
the time is consumed by feature computation
during the parsing process.
So far, we have discussed preliminaries of

transition-based dependency parsing and existing


problems of sparse indicator features. In the following sections, we will elaborate our neural network model for learning dense features along with
experimental evaluations that prove its efficiency.

Neural Network Based Parser

In this section, we first 1present our neural network


model and its main components. Later, we give
details of training and speedup of parsing process.
3.1

Model

Figure 2 describes our neural network architecture. First, as usual word embeddings, we repred
sent each word as a d-dimensional vector ew
i R
w
dN
w
and the full embedding matrix is E R
where Nw is the dictionary size. Meanwhile,
we also map POS tags and arc labels to a ddimensional vector space, where eti , elj Rd are
the representations of ith POS tag and j th arc label. Correspondingly, the POS and label embedding matrices are E t RdNt and E l RdNl
where Nt and Nl are the number of distinct POS
tags and arc labels.
We choose a set of elements based on the
stack / buffer positions for each type of information (word, POS or label), which might
be useful for our predictions. We denote the
sets as S w , S t , S l respectively. For example,
given the configuration in Figure 2 and S t =

Softmax layer:
p = softmax(W2 h)
Hidden layer:
h = (W1w xw + W1t xt + W1l xl + b1 )3

Input layer: [xw , xt , xl ]

POS tags

words
Stack
Configuration

ROOT has VBZ good JJ

arc labels
Buffer

control NN

..

nsubj
He PRP
Figure 2: Our neural network architecture.
{lc1 (s2 ).t, s2 .t, rc1 (s2 ).t, s1 .t}, we will extract
PRP, VBZ, NULL, JJ in order. Here we use a special token NULL to represent a non-existent element.
We build a standard neural network with one
hidden layer, where the corresponding embeddings of our chosen elements from S w , S t , S l will
be added to the input layer. Denoting nw , nt , nl as
the number of chosen elements of each type, we
w
w
add xw = [ew
w1 ; ew2 ; . . . ewnw ] to the input layer,
w
where S = {w1 , . . . , wnw }. Similarly, we add
the POS tag features xt and arc label features xl to
the input layer.
We map the input layer to a hidden layer with
dh nodes through a cube activation function:
h = (W1w xw + W1t xt + W1l xl + b1 )3
where W1w Rdh (dnw ) , W1t Rdh (dnt ) ,
W1l Rdh (dnl ) , and b1 Rdh is the bias.
A softmax layer is finally added on the top of
the hidden layer for modeling multi-class probabilities p = softmax(W2 h), where W2
R|T |dh .
POS and label embeddings
To our best knowledge, this is the first attempt to
introduce POS tag and arc label embeddings instead of discrete representations.
Although the POS tags P = {NN, NNP,
NNS, DT, JJ, . . .} (for English) and arc labels
L = {amod, tmod, nsubj, csubj, dobj, . . .}
(for Stanford Dependencies on English) are relatively small discrete sets, they still exhibit many
semantical similarities like words. For example,
NN (singular noun) should be closer to NNS (plural

0.5

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

cube
sigmoid
tanh
identity

0.5

Figure 3: Different activation functions used in


neural networks.
noun) than DT (determiner), and amod (adjective
modifier) should be closer to num (numeric modifier) than nsubj (nominal subject). We expect
these semantic meanings to be effectively captured
by the dense representations.
Cube activation function
As stated above, we introduce a novel activation
function: cube g(x) = x3 in our model instead
of the commonly used tanh or sigmoid functions
(Figure 3).
Intuitively, every hidden unit is computed by a
(non-linear) mapping on a weighted sum of input
units plus a bias. Using g(x) = x3 can model
the product terms of xi xj xk for any three different
elements at the input layer directly:
g(w1 x1 + . . . + wm xm + b) =
X
X
(wi wj wk )xi xj xk +
b(wi wj )xi xj . . .
i,j,k

i,j

In our case, xi , xj , xk could come from different


dimensions of three embeddings. We believe that
this better captures the interaction of three ele-

ments, which is a very desired property of dependency parsing.


Experimental results also verify the success of
the cube activation function empirically (see more
comparisons in Section 4). However, the expressive power of this activation function is still open
to investigate theoretically.
The choice of S w , S t , S l
Following (Zhang and Nivre, 2011), we pick a
rich set of elements for our final parser. In detail, S w contains nw = 18 elements: (1) The top 3
words on the stack and buffer: s1 , s2 , s3 , b1 , b2 , b3 ;
(2) The first and second leftmost / rightmost
children of the top two words on the stack:
lc1 (si ), rc1 (si ), lc2 (si ), rc2 (si ), i = 1, 2. (3)
The leftmost of leftmost / rightmost of rightmost children of the top two words on the stack:
lc1 (lc1 (si )), rc1 (rc1 (si )), i = 1, 2.
We use the corresponding POS tags for S t
(nt = 18), and the corresponding arc labels of
words excluding those 6 words on the stack/buffer
for Sl (nl = 12). A good advantage of our parser
is that we can add a rich set of elements cheaply,
instead of hand-crafting many more indicator features.
3.2

Training

We first generate training examples {(ci , ti )}m


i=1
from the training sentences and their gold parse
trees using a shortest stack oracle which always
prefers LEFT-ARCl over SHIFT, where ci is a
configuration, ti T is the oracle transition.
The final training objective is to minimize the
cross-entropy loss, plus a l2 -regularization term:
L() =

X
i

log pti +

kk2
2

where is the set of all parameters


{W1w , W1t , W1l , b1 , W2 , E w , E t , E l }.
A slight
variation is that we compute the softmax probabilities only among the feasible transitions in
practice.
For initialization of parameters, we use pretrained word embeddings to initialize E w and use
random initialization within (0.01, 0.01) for E t
and E l . Concretely, we use the pre-trained word
embeddings from (Collobert et al., 2011) for English (#dictionary = 130,000, coverage = 72.7%),
and our trained 50-dimensional word2vec embeddings (Mikolov et al., 2013) on Wikipedia
and Gigaword corpus for Chinese (#dictionary =

285,791, coverage = 79.0%). We will also compare with random initialization of E w in Section
4. The training error derivatives will be backpropagated to these embeddings during the training process.
We use mini-batched AdaGrad (Duchi et al.,
2011) for optimization and also apply a dropout
(Hinton et al., 2012) with 0.5 rate. The parameters which achieve the best unlabeled attachment
score on the development set will be chosen for
final evaluation.
3.3

Parsing

We perform greedy decoding in parsing. At each


step, we extract all the corresponding word, POS
and label embeddings from the current configuration c, compute the hidden layer h(c) Rdh ,
and pick the transition with the highest score:
t = arg maxt is feasible W2 (t, )h(c), and then execute c t(c).
Comparing with indicator features, our parser
does not need to compute conjunction features and
look them up in a huge feature table, and thus
greatly reduces feature generation time. Instead,
it involves many matrix addition and multiplication operations. To further speed up the parsing
time, we apply a pre-computation trick, similar
to (Devlin et al., 2014). For each position chosen from S w , we pre-compute matrix multiplications for most top frequent 10, 000 words. Thus,
computing the hidden layer only requires looking
up the table for these frequent words, and adding
the dh -dimensional vector. Similarly, we also precompute matrix computations for all positions and
all POS tags and arc labels. We only use this optimization in the neural network parser, but it is only
feasible for a parser like the neural network parser
which uses a small number of features. In practice, this pre-computation step increases the speed
of our parser 8 10 times.

4
4.1

Experiments
Datasets

We conduct our experiments on the English Penn


Treebank (PTB) and the Chinese Penn Treebank
(CTB) datasets.
For English, we follow the standard splits of
PTB3, using sections 2-21 for training, section
22 as development set and 23 as test set. We
adopt two different dependency representations:
CoNLL Syntactic Dependencies (CD) (Johansson

Dataset
PTB: CD
PTB: SD
CTB

#Train
39,832
39,832
16,091

#Dev
1,700
1,700
803

#Test
2,416
2,416
1,910

#words (Nw )
44,352
44,389
34,577

#POS (Nt )
45
45
35

#labels (Nl )
17
45
12

projective (%)
99.4
99.9
100.0

Table 3: Data Statistics. Projective is the percentage of projective trees on the training set.
and Nugues, 2007) using the LTH Constituent-toDependency Conversion Tool3 and Stanford Basic
Dependencies (SD) (de Marneffe et al., 2006) using the Stanford parser v3.3.0.4 The POS tags are
assigned using Stanford POS tagger (Toutanova et
al., 2003) with ten-way jackknifing of the training
data (accuracy 97.3%).
For Chinese, we adopt the same split of CTB5
as described in (Zhang and Clark, 2008). Dependencies are converted using the Penn2Malt tool5
with the head-finding rules of (Zhang and Clark,
2008). And following (Zhang and Clark, 2008;
Zhang and Nivre, 2011), we use gold segmentation and POS tags for the input.
Table 3 gives statistics of the three datasets.6 In
particular, over 99% of the trees are projective in
all datasets.
4.2

Results

The following hyper-parameters are used in all experiments: embedding size d = 50, hidden layer
size h = 200, regularization parameter = 108 ,
initial learning rate of Adagrad = 0.01.
To situate the performance of our parser, we first
make a comparison with our own implementation of greedy arc-eager and arc-standard parsers.
These parsers are trained with structured averaged
perceptron using the early-update strategy. The
feature templates of (Zhang and Nivre, 2011) are
used for the arc-eager system, and they are also
adapted to the arc-standard system.7
Furthermore, we also compare our parser
with two popular, off-the-shelf parsers: MaltParser a greedy transition-based dependency
parser (Nivre et al., 2006),8 and MSTParser
3

http://nlp.cs.lth.se/software/treebank converter/
http://nlp.stanford.edu/software/lex-parser.shtml
5
http://stp.lingfil.uu.se/ nivre/research/Penn2Malt.html
6
Pennconverter and Stanford dependencies generate
slightly different tokenization, e.g., Pennconverter splits the
token WCRS\/Boston NNP into three tokens WCRS NNP /
CC Boston NNP.
7
Since arc-standard is bottom-up, we remove all features
using the head of stack elements, and also add the right child
features of the first stack element.
8
http://www.maltparser.org/
4

a first-order graph-based parser (McDonald and


Pereira, 2006).9 In this comparison, for MaltParser, we select stackproj (arc-standard) and
nivreeager (arc-eager) as parsing algorithms,
and liblinear (Fan et al., 2008) for optimization.10
For MSTParser, we use default options.
On all datasets, we report unlabeled attachment scores (UAS) and labeled attachment scores
(LAS) and punctuation is excluded in all evaluation metrics.11 Our parser and the baseline arcstandard and arc-eager parsers are all implemented
in Java. The parsing speeds are measured on an
Intel Core i7 2.7GHz CPU with 16GB RAM and
the runtime does not include pre-computation or
parameter loading time.
Table 4, Table 5 and Table 6 show the comparison of accuracy and parsing speed on PTB
(CoNLL dependencies), PTB (Stanford dependencies) and CTB respectively.
Parser
standard
eager
Malt:sp
Malt:eager
MSTParser
Our parser

Dev
UAS LAS
89.9 88.7
90.3 89.2
90.0 88.8
90.1 88.9
92.1 90.8
92.2 91.0

Test
Speed
UAS LAS (sent/s)
89.7 88.3
51
89.9 88.6
63
89.9 88.5
560
90.1 88.7
535
92.0 90.5
12
92.0 90.7 1013

Table 4: Accuracy and parsing speed on PTB +


CoNLL dependencies.
Clearly, our parser is superior in terms of both
accuracy and speed. Comparing with the baselines of arc-eager and arc-standard parsers, our
parser achieves around 2% improvement in UAS
and LAS on all datasets, while running about 20
times faster.
It is worth noting that the efficiency of our
9
http://www.seas.upenn.edu/ strctlrn/MSTParser/
MSTParser.html
10
We do not compare with libsvm optimization, which is
known to be sightly more accurate, but orders of magnitude
slower (Kong and Smith, 2014).
11
A token is a punctuation if its gold POS tag is { : , .}
for English and PU for Chinese.

Parser
standard
eager
Malt:sp
Malt:eager
MSTParser
Our parser

Dev
UAS LAS
90.2 87.8
89.8 87.4
89.8 87.2
89.6 86.9
91.4 88.1
92.0 89.7

Test
Speed
UAS LAS (sent/s)
89.4 87.3
26
89.6 87.4
34
89.3 86.9
469
89.4 86.8
448
90.7 87.6
10
91.8 89.6
654

Table 5: Accuracy and parsing speed on PTB +


Stanford dependencies.
Parser
standard
eager
Malt:sp
Malt:eager
MSTParser
Our parser

Dev
UAS LAS
82.4 80.9
81.1 79.7
82.4 80.5
81.2 79.3
84.0 82.1
84.0 82.4

Test
Speed
UAS LAS (sent/s)
82.7 81.2
72
80.3 78.7
80
82.4 80.6
420
80.2 78.4
393
83.0 81.2
6
83.9 82.4
936

Table 6: Accuracy and parsing speed on CTB.


parser even surpasses MaltParser using liblinear,
which is known to be highly optimized, while our
parser achieves much better accuracy.
Also, despite the fact that the graph-based MSTParser achieves a similar result to ours on PTB
(CoNLL dependencies), our parser is nearly 100
times faster. In particular, our transition-based
parser has a great advantage in LAS, especially
for the fine-grained label set of Stanford dependencies.
4.3

Effects of Parser Components

Herein, we examine components that account for


the performance of our parser.
Cube activation function
We compare our cube activation function (x3 )
with two widely used non-linear functions: tanh
x ex
( eex +e
sigmoid ( 1+e1x ), and also the
x ),
identity function (x), as shown in Figure 4
(left).
In short, cube outperforms all other activation
functions significantly and identity works the
worst. Concretely, cube can achieve 0.8%
1.2% improvement in UAS over tanh and other
functions, thus verifying the effectiveness of the
cube activation function empirically.

Initialization of pre-trained word embeddings


We further analyze the influence of using pretrained word embeddings for initialization. Figure 4 (middle) shows that using pre-trained word
embeddings can obtain around 0.7% improvement on PTB and 1.7% improvement on CTB,
compared with using random initialization within
(0.01, 0.01). On the one hand, the pre-trained
word embeddings of Chinese appear more useful than those of English; on the other hand, our
model is still able to achieve comparable accuracy
without the help of pre-trained word embeddings.
POS tag and arc label embeddings
As shown in Figure 4 (right), POS embeddings
yield around 1.7% improvement on PTB and
nearly 10% improvement on CTB and the label
embeddings yield a much smaller 0.3% and 1.4%
improvement respectively.
However, we can obtain little gain from label embeddings when the POS embeddings are
present. This may be because the POS tags of two
tokens already capture most of the label information between them.
4.4

Model Analysis

Last but not least, we will examine the parameters we have learned, and hope to investigate what
these dense features capture. We use the weights
learned from the English Penn Treebank using
Stanford dependencies for analysis.
What do E t , E l capture?
We first introduced E t and E l as the dense representations of all POS tags and arc labels, and
we wonder whether these embeddings could carry
some semantic information.
Figure 5 presents t-SNE visualizations (van der
Maaten and Hinton, 2008) of these embeddings.
It clearly shows that these embeddings effectively
exhibit the similarities between POS tags or arc
labels. For instance, the three adjective POS tags
JJ, JJR, JJS have very close embeddings, and
also the three labels representing clausal complements acomp, ccomp, xcomp are grouped together.
Since these embeddings can effectively encode
the semantic regularities, we believe that they can
be also used as alternative features of POS tags (or
arc labels) in other NLP tasks, and help boost the
performance.

What do W1w , W1t , W1l capture?

Knowing that
and
(as well as the word emw
beddings E ) can capture semantic information
very well, next we hope to investigate what each
feature in the hidden layer has really learned.
Since we currently only have h = 200 learned
dense features, we wonder if it is sufficient to
learn the word conjunctions as sparse indicator
features, or even more. We examine the weights
W1w (k, ) Rdnw , W1t (k, ) Rdnt , W1l (k, )
Rdnl for each hidden unit k, and reshape them to
d nt , d nw , d nl matrices, such that the
weights of each column corresponds to the embeddings of one specific element (e.g., s1 .t).
We pick the weights with absolute value > 0.2,
and visualize them for each feature. Figure 6 gives
the visualization of three sampled features, and it
exhibits many interesting phenomena:

There have been several lines of earlier work in using neural networks for parsing which have points
of overlap but also major differences from our
work here. One big difference is that much early
work uses localist one-hot word representations
rather than the distributed representations of modern work. (Mayberry III and Miikkulainen, 1999)
explored a shift reduce constituency parser with
one-hot word representations and did subsequent
parsing work in (Mayberry III and Miikkulainen,
2005).
(Henderson, 2004) was the first to attempt to use
neural networks in a broad-coverage Penn Treebank parser, using a simple synchrony network to
predict parse decisions in a constituency parser.
More recently, (Titov and Henderson, 2007) applied Incremental Sigmoid Belief Networks to
constituency parsing and then (Garg and Henderson, 2011) extended this work to transition-based
dependency parsers using a Temporal Restricted
Boltzman Machine. These are very different neural network architectures, and are much less scalable and in practice a restricted vocabulary was
used to make the architecture practical.
There have been a number of recent uses of
deep learning for constituency parsing (Collobert,
2011; Socher et al., 2013). (Socher et al., 2014)
has also built models over dependency representations but this work has not attempted to learn neural networks for dependency parsing.
Most recently, (Stenetorp, 2013) attempted to
build recursive neural networks for transitionbased dependency parsing, however the empirical
performance of his model is still unsatisfactory.

Et

El

Different features have varied distributions of


the weights. However, most of the discriminative weights come from W1t (the middle
zone in Figure 6), and this further justifies the
importance of POS tags in dependency parsing.
We carefully examine many of the h = 200
features, and find that they actually encode
very different views of information. For the
three sampled features in Figure 6, the largest
weights are dominated by:
Feature 1: s1 .t, s2 .t, lc(s1 ).t.
Feautre 2: rc(s1 ).t, s1 .t, b1 .t.
Feature 3: s1 .t, s1 .w, lc(s1 ).t, lc(s1 ).l.
These features all seem very plausible, as observed in the experiments on indicator feature
systems. Thus our model is able to automatically identify the most useful information for
predictions, instead of hand-crafting them as
indicator features.
More importantly, we can extract features regarding the conjunctions of more than 3 elements easily, and also those not presented in
the indicator feature systems. For example,
the 3rd feature above captures the conjunction of words and POS tags of s1 , the tag of
its leftmost child, and also the label between
them, while this information is not encoded
in the original feature templates of (Zhang
and Nivre, 2011).

Related Work

Conclusion

We have presented a novel dependency parser using neural networks. Experimental evaluations
show that our parser outperforms other greedy
parsers using sparse indicator features in both accuracy and speed. This is achieved by representing all words, POS tags and arc labels as dense
vectors, and modeling their interactions through a
novel cube activation function. Our model only
relies on dense features, and is able to automatically learn the most useful feature conjunctions
for making predictions.
An interesting line of future work is to combine
our neural network based classifier with searchbased models to further improve accuracy. Also,

95
90

85

UAS score

90
UAS score

UAS score

90

85

85
80
75

80

70

80
PTB:CD
cube

PTB:SD
tanh

PTB:CD

CTB

sigmoid

identity

PTB:SD

pre-trained

CTB

PTB:CD

random

word+POS+label

PTB:SD

CTB

word+POS

word+label

word

Figure 4: Effects of different parser components. Left: comparison of different activation functions.
Middle: comparison of pre-trained word vectors and random initialization. Right: effects of POS and
label embeddings.

800

600

:
.
400

VBG

advmod

600

nsubj
csubj
nsubjpass
expl

,
400

amod

WP$
200

200

400

600

RB RBSRBR
WP WRB PRP
$
PRP$ WDT LS
NNPS
NNP
NN
POS
IN
NNS
TO
FW
SYM
ROOT
VBN
VB
RP
CC

JJR
(
DT
JJ
UH
PDT
JJS
EX
misc
#
CD
noun
VBD
VBZ
punctuation
VBP
verb
MD

200

aux
preconj cop
npadvmod
mwe

400

200

200

400

600

dobj
acomp
advcl
xcomp
ccomp

tmod

misc
clausal complement
noun premodifier
verbal auxiliaries
subject
preposition complement
noun postmodifier

600

800

1000

400

parataxis

punct
csubjpass
infmod root
neg
partmod
prep
number

200

pcomp
pobj

adverb
adjective

800
600

conj

discourse

cc
mark
0

num
predet
det
auxpass
poss
possessive
quantmod
prt
iobj
dep
nn

appos
rcmod

600

400

200

200

400

600

800

Figure 5: t-SNE visualization of POS and label embeddings.

Figure 6: Three sampled features. In each feature, each row denotes a dimension of embeddings and
each column denotes a chosen element, e.g., s1 .t or lc(s1 ).w, and the parameters are divided into 3
zones, corresponding to W1w (k, :) (left), W1t (k, :) (middle) and W1l (k, :) (right). White and black dots
denote the most positive weights and most negative weights respectively.

there is still room for improvement in our architecture, such as better capturing word conjunctions,
or adding richer features (e.g., distance, valency).

Acknowledgments
Stanford University gratefully acknowledges the
support of the Defense Advanced Research
Projects Agency (DARPA) Deep Exploration and
Filtering of Text (DEFT) Program under Air
Force Research Laboratory (AFRL) contract no.
FA8750-13-2-0040 and the Defense Threat Reduction Agency (DTRA) under Air Force Research Laboratory (AFRL) contract no. FA865010-C-7020. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily
reflect the view of the DARPA, AFRL, or the US
government.

References
Bernd Bohnet. 2010. Very high accuracy and fast dependency parsing is not a contradiction. In Coling.
Ronan Collobert, Jason Weston, Leon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
scratch. Journal of Machine Learning Research.
Ronan Collobert. 2011. Deep learning for efficient
discriminative parsing. In AISTATS.
Marie-Catherine de Marneffe, Bill MacCartney, and
Christopher D. Manning. 2006. Generating typed
dependency parses from phrase structure parses. In
LREC.
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas
Lamar, Richard Schwartz, and John Makhoul. 2014.
Fast and robust neural network joint models for statistical machine translation. In ACL.
John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learning
and stochastic optimization. The Journal of Machine Learning Research.

James Henderson. 2004. Discriminative training of a


neural network statistical parser. In ACL.
Geoffrey E. Hinton, Nitish Srivastava, Alex
Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by
preventing co-adaptation of feature detectors.
CoRR, abs/1207.0580.
Liang Huang, Wenbin Jiang, and Qun Liu. 2009.
Bilingually-constrained (monolingual) shift-reduce
parsing. In EMNLP.
Richard Johansson and Pierre Nugues. 2007. Extended constituent-to-dependency conversion for english. In Proceedings of NODALIDA, Tartu, Estonia.
Lingpeng Kong and Noah A. Smith. 2014. An empirical comparison of parsing methods for Stanford
dependencies. CoRR, abs/1404.4314.
Terry Koo, Xavier Carreras, and Michael Collins.
2008. Simple semi-supervised dependency parsing.
In ACL.
Sandra Kubler, Ryan McDonald, and Joakim Nivre.
2009. Dependency Parsing. Synthesis Lectures on
Human Language Technologies. Morgan & Claypool.
Marshall R. Mayberry III and Risto Miikkulainen.
1999. Sardsrn: A neural network shift-reduce
parser. In IJCAI.
Marshall R. Mayberry III and Risto Miikkulainen.
2005. Broad-coverage parsing with neural networks. Neural Processing Letters.
Ryan McDonald and Fernando Pereira. 2006. Online
learning of approximate dependency parsing algorithms. In EACL.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
Maltparser: A data-driven parser-generator for dependency parsing. In LREC.
Richard Socher, John Bauer, Christopher D Manning,
and Andrew Y Ng. 2013. Parsing with compositional vector grammars. In ACL.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang, and Chih-Jen Lin. 2008. Liblinear: A
library for large linear classification. The Journal of
Machine Learning Research.

Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014.
Grounded compositional semantics for finding and
describing images with sentences. TACL.

Nikhil Garg and James Henderson. 2011. Temporal


restricted boltzmann machines for dependency parsing. In ACL-HLT.

Pontus Stenetorp. 2013. Transition-based dependency


parsing using recursive neural networks. In NIPS
Workshop on Deep Learning.

He He, Hal Daume III, and Jason Eisner. 2013. Dynamic feature selection for dependency parsing. In
EMNLP.

Ivan Titov and James Henderson. 2007. Fast and robust multilingual dependency parsing with a generative latent variable model. In EMNLP-CoNLL.

Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-ofspeech tagging with a cyclic dependency network.
In NAACL.
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-SNE. The Journal of Machine Learning Research.
Yue Zhang and Stephen Clark. 2008. A tale of
two parsers: Investigating and combining graphbased and transition-based dependency parsing using beam-search. In EMNLP.
Yue Zhang and Joakim Nivre. 2011. Transition-based
dependency parsing with rich non-local features. In
ACL.

You might also like