0% found this document useful (0 votes)
18 views16 pages

Kiperwasser 16

The document presents a simple and effective approach for dependency parsing using bidirectional LSTMs (BiLSTMs). A BiLSTM is used to generate feature representations for each token, and these representations are then used in transition-based and graph-based parsing models. The resulting parsers achieve state-of-the-art accuracy on English and Chinese datasets.

Uploaded by

mehmetkse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views16 pages

Kiperwasser 16

The document presents a simple and effective approach for dependency parsing using bidirectional LSTMs (BiLSTMs). A BiLSTM is used to generate feature representations for each token, and these representations are then used in transition-based and graph-based parsing models. The resulting parsers achieve state-of-the-art accuracy on English and Chinese datasets.

Uploaded by

mehmetkse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Simple and Accurate Dependency Parsing

Using Bidirectional LSTM Feature Representations

Eliyahu Kiperwasser Yoav Goldberg


Computer Science Department Computer Science Department
Bar-Ilan University Bar-Ilan University
Ramat-Gan, Israel Ramat-Gan, Israel
[email protected] [email protected]

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


Abstract arc-factored (first order) models (McDonald, 2006),
in which the scoring function for a tree decomposes
We present a simple and effective scheme
over the individual arcs of the tree. More elaborate
for dependency parsing which is based on
bidirectional-LSTMs (BiLSTMs). Each sen- models look at larger (overlapping) parts, requiring
tence token is associated with a BiLSTM vec- more sophisticated inference and training algorithms
tor representing the token in its sentential con- (Martins et al., 2009; Koo and Collins, 2010). The
text, and feature vectors are constructed by basic transition-based parsers work in a greedy man-
concatenating a few BiLSTM vectors. The ner, performing a series of locally-optimal decisions,
BiLSTM is trained jointly with the parser ob- and boast very fast parsing speeds. More advanced
jective, resulting in very effective feature ex-
transition-based parsers introduce some search into
tractors for parsing. We demonstrate the ef-
fectiveness of the approach by applying it to the process using a beam (Zhang and Clark, 2008)
a greedy transition-based parser as well as to or dynamic programming (Huang and Sagae, 2010).
a globally optimized graph-based parser. The Regardless of the details of the parsing frame-
resulting parsers have very simple architec- work being used, a crucial step in parser design is
tures, and match or surpass the state-of-the-art choosing the right feature function for the underly-
accuracies on English and Chinese.
ing statistical model. Recent work (see Section 2.2
for an overview) attempt to alleviate parts of the fea-
1 Introduction ture function design problem by moving from lin-
The focus of this paper is on feature represen- ear to non-linear models, enabling the modeler to
tation for dependency parsing, using recent tech- focus on a small set of “core” features and leav-
niques from the neural-networks (“deep learning”) ing it up to the machine-learning machinery to come
literature. Modern approaches to dependency pars- up with good feature combinations (Chen and Man-
ing can be broadly categorized into graph-based ning, 2014; Pei et al., 2015; Lei et al., 2014; Taub-
and transition-based parsers (Kübler et al., 2009). Tabib et al., 2015). However, the need to carefully
Graph-based parsers (McDonald, 2006) treat pars- define a set of core features remains. For exam-
ing as a search-based structured prediction prob- ple, the work of Chen and Manning (2014) uses 18
lem in which the goal is learning a scoring func- different elements in its feature function, while the
tion over dependency trees such that the correct tree work of Pei et al. (2015) uses 21 different elements.
is scored above all other trees. Transition-based Other works, notably Dyer et al. (2015) and Le and
parsers (Nivre, 2004; Nivre, 2008) treat parsing as Zuidema (2014), propose more sophisticated feature
a sequence of actions that produce a parse tree, and representations, in which the feature engineering is
a classifier is trained to score the possible actions at replaced with architecture engineering.
each stage of the process and guide the parsing pro- In this work, we suggest an approach which is
cess. Perhaps the simplest graph-based parsers are much simpler in terms of both feature engineering

313

Transactions of the Association for Computational Linguistics, vol. 4, pp. 313–327, 2016. Action Editor: Marco Kuhlmann.
Submission batch: 2/2016; Published 7/2016.
c 2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
and architecture engineering. Our proposal (Section niques such as tri-training (Weiss et al., 2015).
3) is centered around BiRNNs (Irsoy and Cardie, When also including pre-trained word embeddings,
2014; Schuster and Paliwal, 1997), and more specif- we obtain further improvements, with accuracies of
ically BiLSTMs (Graves, 2008), which are strong 93.9 UAS (English) and 87.6 UAS (Chinese) for a
and trainable sequence models (see Section 2.3). greedy transition-based parser with 11 features, and
The BiLSTM excels at representing elements in a 93.6 UAS (En) / 87.4 (Ch) for a greedy transition-
sequence (i.e., words) together with their contexts, based parser with 4 features.
capturing the element and an “infinite” window
around it. We represent each word by its BiLSTM 2 Background and Notation
encoding, and use a concatenation of a minimal set

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


of such BiLSTM encodings as our feature function, Notation We use x1:n to denote a sequence of n
which is then passed to a non-linear scoring function vectors x1 , · · · , xn . Fθ (·) is a function parameter-
(multi-layer perceptron). Crucially, the BiLSTM is ized with parameters θ. We write FL (·) as shorthand
trained with the rest of the parser in order to learn for FθL – an instantiation of F with a specific set of
a good feature representation for the parsing prob- parameters θL . We use ◦ to denote a vector con-
lem. If we set aside the inherent complexity of the catenation operation, and v[i] to denote an indexing
BiLSTM itself and treat it as a black box, our pro- operation taking the ith element of a vector v.
posal results in a pleasingly simple feature extractor.
We demonstrate the effectiveness of the approach 2.1 Feature Functions in Dependency Parsing
by using the BiLSTM feature extractor in two pars- Traditionally, state-of-the-art parsers rely on linear
ing architectures, transition-based (Section 4) as models over hand-crafted feature functions. The fea-
well as a graph-based (Section 5). In the graph- ture functions look at core components (e.g. “word
based parser, we jointly train a structured-prediction on top of stack”, “leftmost child of the second-to-
model on top of a BiLSTM, propagating errors from top word on the stack”, “distance between the head
the structured objective all the way back to the and the modifier words”), and are comprised of sev-
BiLSTM feature-encoder. To the best of our knowl- eral templates, where each template instantiates a bi-
edge, we are the first to perform such end-to-end nary indicator function over a conjunction of core
training of a structured prediction model and a recur- elements (resulting in features of the form “word on
rent feature extractor for non-sequential outputs.1 top of stack is X and leftmost child is Y and . . . ”).
Aside from the novelty of the BiLSTM feature The design of the feature function – which compo-
extractor and the end-to-end structured training, we nents to consider and which combinations of com-
rely on existing models and techniques from the ponents to include – is a major challenge in parser
parsing and structured prediction literature. We design. Once a good feature function is proposed
stick to the simplest parsers in each category – in a paper it is usually adopted in later works, and
greedy inference for the transition-based architec- sometimes tweaked to improve performance. Ex-
ture, and a first-order, arc-factored model for the amples of good feature functions are the feature-set
graph-based architecture. Despite the simplicity proposed by Zhang and Nivre (2011) for transition-
of the parsing architectures and the feature func- based parsing (including roughly 20 core compo-
tions, we achieve near state-of-the-art parsing ac- nents and 72 feature templates), and the feature-
curacies in both English (93.1 UAS) and Chinese set proposed by McDonald et al. (2005) for graph-
(86.6 UAS), using a first-order parser with two fea- based parsing, with the paper listing 18 templates
tures and while training solely on Treebank data, for a first-order parser, while the first order feature-
without relying on semi-supervised signals such as extractor in the actual implementation’s code (MST-
pre-trained word embeddings (Chen and Manning, Parser2 ) includes roughly a hundred feature tem-
2014), word-clusters (Koo et al., 2008), or tech- plates.
1
Structured training of sequence tagging models over RNN-
2
based representations was explored by Chiu and Nichols (2016) http://www.seas.upenn.edu/~strctlrn/
and Lample et al. (2016). MSTParser/MSTParser.html

314
The core features in a transition-based parser usu- 12 dependency-label vectors.3
ally look at information such as the word-identity The above works tackle the effort in hand-crafting
and part-of-speech (POS) tags of a fixed number of effective feature combinations. A different line of
words on top of the stack, a fixed number of words work attacks the feature-engineering problem by
on the top of the buffer, the modifiers (usually left- suggesting novel neural-network architectures for
most and right-most) of items on the stack and on the encoding the parser state, including intermediately-
buffer, the number of modifiers of these elements, built subtrees, as vectors which are then fed to non-
parents of words on the stack, and the length of the linear classifiers. Titov and Henderson encode the
spans spanned by the words on the stack. The core parser state using incremental sigmoid-belief net-
features of a first-order graph-based parser usually works (2007). In the work of Dyer et al. (2015), the

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


take into account the word and POS of the head entire stack and buffer of a transition-based parser
and modifier items, as well as POS-tags of the items are encoded as a stack-LSTMs, where each stack el-
around the head and modifier, POS tags of items be- ement is itself based on a compositional represen-
tween the head and modifier, and the distance and tation of parse trees. Le and Zuidema (2014) en-
direction between the head and modifier. code each tree node as two compositional represen-
tations capturing the inside and outside structures
around the node, and feed the representations into
2.2 Related Research Efforts a reranker. A similar reranking approach, this time
based on convolutional neural networks, is taken by
Coming up with a good feature-set for a parser is a Zhu et al. (2015). Finally, in Kiperwasser and Gold-
hard and time consuming task, and many researchers berg (2016) we present an Easy-First parser based
attempt to reduce the required manual effort. The on a novel hierarchical-LSTM tree encoding.
work of Lei et al. (2014) suggests a low-rank ten- In contrast to these, the approach we present in
sor representation to automatically find good feature this work results in much simpler feature functions,
combinations. Taub-Tabib et al. (2015) suggest a without resorting to elaborate network architectures
kernel-based approach to implicitly consider all pos- or compositional tree representations.
sible feature combinations over sets of core-features. Work by Vinyals et al. (2015) employs a
The recent popularity of neural networks prompted sequence-to-sequence with attention architecture for
a move from templates of sparse, binary indicator constituency parsing. Each token in the input sen-
features to dense core feature encodings fed into tence is encoded in a deep-BiLSTM representation,
non-linear classifiers. Chen and Manning (2014) en- and then the tokens are fed as input to a deep-
code each core feature of a greedy transition-based LSTM that predicts a sequence of bracketing ac-
parser as a dense low-dimensional vector, and the tions based on the already predicted bracketing as
vectors are then concatenated and fed into a non- well as the encoded BiLSTM vectors. A trainable
linear classifier (multi-layer perceptron) which can attention mechanism is used to guide the parser to
potentially capture arbitrary feature combinations. relevant BiLSTM vectors at each stage. This ar-
Weiss et al. (2015) showed further gains using the chitecture shares with ours the use of BiLSTM en-
same approach coupled with a somewhat improved coding and end-to-end training. The sequence of
set of core features, a more involved network archi- bracketing actions can be interpreted as a sequence
tecture with skip-layers, beam search-decoding, and of Shift and Reduce operations of a transition-based
careful hyper-parameter tuning. Pei et al. (2015) parser. However, while the parser of Vinyals et al.
apply a similar methodology to graph-based pars-
ing. While the move to neural-network classi- 3
In all of these neural-network based approaches, the vec-
fiers alleviates the need for hand-crafting feature- tor representations of words were initialized using pre-trained
combinations, the need to carefully define a set of word-embeddings derived from a large corpus external to the
training data. This puts the approaches in the semi-supervised
core features remain. For example, the feature rep- category, making it hard to tease apart the contribution of the au-
resentation in Chen and Manning (2014) is a con- tomatic feature-combination component from that of the semi-
catenation of 18 word vectors, 18 POS vectors and supervised component.

315
relies on a trainable attention mechanism for fo- R NNθ (x1:n ) to be the vector hn .
cusing on specific BiLSTM vectors, parsers in the A bidirectional RNN is composed of two RNNs,
transition-based family we use in Section 4 use a hu- R NNF and R NNR , one reading the sequence in its
man designed stack and buffer mechanism to manu- regular order, and the other reading it in reverse.
ally direct the parser’s attention. While the effec- Concretely, given a sequence of vectors x1:n and a
tiveness of the trainable attention approach is im- desired index i, the function B I R NNθ (x1:n , i) is de-
pressive, the stack-and-buffer guidance of transition- fined as:
based parsers results in more robust learning. In-
deed, work by Cross and Huang (2016), published B I R NNθ (x1:n , i) = R NNF (x1:i ) ◦ R NNR (xn:i )
while working on the camera-ready version of this
The vector vi = B I R NN(x1:n , i) is then a represen-

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


paper, show that the same methodology as ours
tation of the ith item in x1:n , taking into account
is highly effective also for greedy, transition-based
both the entire history x1:i and the entire future xi:n
constituency parsing, surpassing the beam-based ar-
by concatenating the matching R NNs. We can view
chitecture of Vinyals et al. (88.3F vs. 89.8F points)
the BiRNN encoding of an item i as representing the
when trained on the Penn Treebank dataset and with-
item i together with a context of an infinite window
out using orthogonal methods such as ensembling
around it.
and up-training.
Computational Complexity Computing the
2.3 Bidirectional Recurrent Neural Networks BiRNN vectors encoding of the ith element of a
Recurrent neural networks (RNNs) are statistical sequence x1:n requires O(n) time for computing
learners for modeling sequential data. An RNN al- the two RNNs and concatenating their outputs.
lows one to model the ith element in the sequence A naive approach of computing the bidirectional
based on the past – the elements x1:i up to and in- representation of all n elements result in O(n2 )
cluding it. The RNN model provides a framework computation. However, it is trivial to compute
for conditioning on the entire history x1:i without the BiRNN encoding of all sequence items in
resorting to the Markov assumption which is tradi- linear time by pre-computing RNNF (x1:n ) and
tionally used for modeling sequences. RNNs were RNNR (xn:1 ), keeping the intermediate representa-
shown to be capable of learning to count, as well as tions, and concatenating the required elements as
to model line lengths and complex phenomena such needed.
as bracketing and code indentation (Karpathy et al.,
BiRNN Training Initially, the BiRNN encodings
2015). Our proposed feature extractors are based on
vi do not capture any particular information. During
a bidirectional recurrent neural network (BiRNN),
training, the encoded vectors vi are fed into further
an extension of RNNs that take into account both the
network layers, until at some point a prediction is
past x1:i and the future xi:n . We use a specific flavor
made, and a loss is incurred. The back-propagation
of RNN called a long short-term memory network
algorithm is used to compute the gradients of all the
(LSTM). For brevity, we treat RNN as an abstrac-
parameters in the network (including the BiRNN pa-
tion, without getting into the mathematical details of
rameters) with respect to the loss, and an optimizer
the implementation of the RNNs and LSTMs. For
is used to update the parameters according to the
further details on RNNs and LSTMs, the reader is
gradients. The training procedure causes the BiRNN
referred to Goldberg (2015) and Cho (2015).
function to extract from the input sequence x1:n the
The recurrent neural network (RNN) abstraction
relevant information for the task task at hand.
is a parameterized function R NNθ (x1:n ) mapping a
sequence of n input vectors x1:n , xi ∈ Rdin to a se- Going deeper We use a variant of deep
quence of n output vectors h1:n , hi ∈ Rdout . Each bidirectional RNN (or k-layer BiRNN)
output vector hi is conditioned on all the input vec- which is composed of k BiRNN functions
tors x1:i , and can be thought of as a summary of the B I R NN1 , · · · , B I R NNk that feed into each other: the
prefix x1:i of x1:n . In our notation, we ignore the output B I R NN` (x1:n , 1), . . . , B I R NN` (x1:n , n) of
intermediate vectors h1:n−1 and take the output of B I R NN` becomes the input of B I R NN`+1 . Stacking

316
BiRNNs in this way has been empirically shown to input element as its (deep) BiLSTM vector, vi :
be effective (Irsoy and Cardie, 2014). In this work,
we use BiRNNs and deep-BiRNNs interchangeably, vi = B I L STM(x1:n , i)
specifying the number of layers when needed.
Our feature function φ is then a concatenation of a
Historical Notes RNNs were introduced by El- small number of BiLSTM vectors. The exact fea-
man (1990), and extended to BiRNNs by Schus- ture function is parser dependent and will be dis-
ter and Paliwal (1997). The LSTM variant of cussed when discussing the corresponding parsers.
RNNs is due to Hochreiter and Schmidhuber (1997). The resulting feature vectors are then scored using a
BiLSTMs were recently popularized by Graves non-linear function, namely a multi-layer perceptron
(2008), and deep BiRNNs were introduced to NLP with one hidden layer (MLP):

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


by Irsoy and Cardie (2014), who used them for se-
quence tagging. In the context of parsing, Lewis et M LPθ (x) = W 2 · tanh(W 1 · x + b1 ) + b2
al. (2016) and Vaswani et al. (2016) use a BiLSTM
sequence tagging model to assign a CCG supertag where θ = {W 1 , W 2 , b1 , b2 } are the model parame-
for each token in the sentence. Lewis et al. (2016) ters.
feeds the resulting supertags sequence into an A* Beside using the BiLSTM-based feature func-
CCG parser. Vaswani et al. (2016) adds an addi- tions, we make use of standard parsing techniques.
tional layer of LSTM which receives the BiLSTM Crucially, the BiLSTM is trained jointly with the rest
representation together with the k-best supertags of the parsing objective. This allows it to learn rep-
for each word and outputs the most likely supertag resentations which are suitable for the parsing task.
given previous tags, and then feeds the predicted su- Consider a concatenation of two BiLSTM vectors
pertags to a discriminitively trained parser. In both (vi ◦ vj ) scored using an MLP. The scoring function
works, the BiLSTM is trained to produce accurate has access to the words and POS-tags of vi and vj , as
CCG supertags, and is not aware of the global pars- well as the words and POS-tags of the words in an
ing objective. infinite window surrounding them. As LSTMs are
known to capture length and sequence position in-
3 Our Approach formation, it is very plausible that the scoring func-
tion can be sensitive also to the distance between i
We propose to replace the hand-crafted feature func- and j, their ordering, and the sequential material be-
tions in favor of minimally-defined feature functions tween them.
which make use of automatically learned Bidirec-
tional LSTM representations. Parsing-time Complexity Once the BiLSTM is
Given n-words input sentence s with words trained, parsing is performed by first computing the
w1 , . . . , wn together with the corresponding POS BiLSTM encoding vi for each word in the sentence
tags t1 , . . . , tn ,4 we associate each word wi and POS (a linear time operation).5 Then, parsing proceeds as
ti with embedding vectors e(wi ) and e(ti ), and cre- usual, where the feature extraction involves a con-
ate a sequence of input vectors x1:n in which each catenation of a small number of the pre-computed vi
xi is a concatenation of the corresponding word and vectors.
POS vectors:
4 Transition-based Parser
xi = e(wi ) ◦ e(pi ) We begin by integrating the feature extractor in a
transition-based parser (Nivre, 2008). We follow
The embeddings are trained together with the model.
the notation in Goldberg and Nivre (2013). The
This encodes each word in isolation, disregarding its
5
context. We introduce context by representing each While the BiLSTM computation is quite efficient as it is,
as demonstrated by Lewis et al. (2016), if using a GPU imple-
4
In this work the tag sequence is assumed to be given, and mentation the BiLSTM encoding can be efficiently performed
in practice is predicted by an external model. Future work will over many of sentences in parallel, making its computation cost
address relaxing this assumption. almost negligible.

317
Configuration:
s2 s1 s0 b0 b1 b2 b3

the jumped over the lazy dog ROOT

fox

brown

Scoring:
(ScoreLef tArc , ScoreRightArc , ScoreShif t )

MLP

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


Vthe Vbrown Vfox Vjumped Vover Vthe Vlazy Vdog VROOT

concat concat concat concat concat concat concat concat concat

s8 s7 s6 s5 s4 s3 s2 s1 s0
LST M b LST M b LST M b LST M b LST M b LST M b LST M b LST M b LST M b

LST M f LST M f LST M f LST M f LST M f LST M f LST M f LST M f LST M f

xthe xbrown xfox xjumped xover xthe xlazy xdog xROOT

Figure 1: Illustration of the neural model scheme of the transition-based parser when calculating the scores of the
possible transitions in a given configuration. The configuration (stack and buffer) is depicted on the top. Each transition
is scored using an MLP that is fed the BiLSTM encodings of the first word in the buffer and the three words at the top
of the stack (the colors of the words correspond to colors of the MLP inputs above), and a transition is picked greedily.
Each xi is a concatenation of a word and a POS vector, and possibly an additional external embedding vector for the
word. The figure depicts a single-layer BiLSTM, while in practice we use two layers. When parsing a sentence, we
iteratively compute scores for all possible transitions and apply the best scoring action until the final configuration is
reached.

transition-based parsing framework assumes a tran- Algorithm 1 Greedy transition-based parsing


sition system, an abstract machine that processes 1: Input: sentence s = w1 , . . . , xw , t1 , . . . , tn ,
sentences and produces parse trees. The transition parameterized function S COREθ (·) with param-
system has a set of configurations and a set of tran- eters θ.
sitions which are applied to configurations. When 2: c ← I NITIAL(s)
parsing a sentence, the system is initialized to an ini- 3: while not T ERMINAL(c) do

tial configuration based on the input sentence, and 4: t̂ ← arg maxt∈L EGAL(c) S COREθ φ(c), t
transitions are repeatedly applied to this configura- 5: c ← t̂(c)
tion. After a finite number of transitions, the system 6: return tree(c)
arrives at a terminal configuration, and a parse tree
is read off the terminal configuration. In a greedy
parser, a classifier is used to choose the transition scores the possible transitions t, and the highest
to take in each configuration, based on features ex- scoring transition t̂ is chosen (line 4). The transition
tracted from the configuration itself. The parsing al- t̂ is applied to the configuration, resulting in a new
gorithm is presented in Algorithm 1 below. parser configuration. The process ends when reach-
Given a sentence s, the parser is initialized with ing a final configuration, from which the resulting
the configuration c (line 2). Then, a feature func- parse tree is read and returned (line 6).
tion φ(c) represents the configuration c as a vector, Transition systems differ by the way they define
which is fed to a scoring function S CORE assign- configurations, and by the particular set of transi-
ing scores to (configuration,transition) pairs. S CORE tions available to them. A parser is determined by

318
the choice of a transition system, a feature function (. . . |s2 |s1 |s0 , b0 | . . . , T ) the feature extractor
φ and a scoring function S CORE. Our choices are is defined as:
detailed below.
φ(c) = vs2 ◦ vs1 ◦ vs0 ◦ vb0
The Arc-Hybrid System Many transition systems
exist in the literature. In this work, we use the arc- vi = B I L STM(x1:n , i)
hybrid transition system (Kuhlmann et al., 2011),
which is similar to the more popular arc-standard This feature function is rather minimal: it takes
system (Nivre, 2004), but for which an efficient dy- into account the BiLSTM representations of s1 , s0
namic oracle is available (Goldberg and Nivre, 2012; and b0 , which are the items affected by the possible
Goldberg and Nivre, 2013). In the arc-hybrid sys- transitions being scored, as well as one extra stack

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


tem, a configuration c = (σ, β, T ) consists of a context s2 .6 Figure 1 depicts transition scoring with
stack σ, a buffer β, and a set T of dependency arcs. our architecture and this feature function. Note that,
Both the stack and the buffer hold integer indices unlike previous work, this feature function does not
pointing to sentence elements. Given a sentence take into account T , the already built structure. The
s = w1 , . . . , wn , t1 , . . . , tn , the system is initial- high parsing accuracies in the experimental sections
ized with an empty stack, an empty arc set, and suggest that the BiLSTM encoding is capable of es-
β = 1, . . . , n, ROOT , where ROOT is the special root timating a lot of the missing information based on
index. Any configuration c with an empty stack and the provided stack and buffer elements and the se-
a buffer containing only ROOT is terminal, and the quential content between them.
parse tree is given by the arc set Tc of c. The arc- While not explored in this work, relying on
hybrid system allows 3 possible transitions, S HIFT, only four word indices for scoring an action re-
L EFT` and R IGHT` , defined as: sults in very compact state signatures, making
S HIFT[(σ, b0 |β, T )] = (σ|b0 , β, T )
our proposed feature representation very appeal-
L EFT` [(σ|s1 |s0 , b0 |β, T )] = (σ|s1 , b0 |β, T ∪ {(b0 , s0 , `)}) ing for use in transition-based parsers that employ
R IGHT` [(σ|s1 |s0 , β, T )] = (σ|s1 , β, T ∪ {(s1 , s0 , `)}) dynamic-programming search (Huang and Sagae,
2010; Kuhlmann et al., 2011).
The S HIFT transition moves the first item of the
buffer (b0 ) to the stack. The L EFT` transition re-
Extended Feature Function One of the benefits
moves the first item on top of the stack (s0 ) and
of the greedy transition-based parsing framework is
attaches it as a modifier to b0 with label `, adding
precisely its ability to look at arbitrary features from
the arc (b0 , s0 , `). The R IGHT` transition removes
the already built tree. If we allow somewhat less
s0 from the stack and attaches it as a modifier to the
minimal feature function, we could add the BiLSTM
next item on the stack (s1 ), adding the arc (s1 , s0 , `).
vectors corresponding to the right-most and left-
Scoring Function Traditionally, the scoring func- most modifiers of s0 , s1 and s2 , as well as the left-
tion S COREθ (x, t) is a discriminative linear model most modifier of b0 , reaching a total of 11 BiLSTM
of the form S COREW (x, t) = (W · x)[t]. The lin- vectors. We refer to this as the extended feature set.
earity of S CORE required the feature function φ(·) As we’ll see in Section 6, using the extended set
to encode non-linearities in the form of combination does indeed improve parsing accuracies when using
features. We follow Chen and Manning (2014) and pre-trained word embeddings, but has a minimal ef-
replace the linear scoring model with an MLP. fect in the fully-supervised case.7

S COREθ (x, t) = M LPθ (x)[t] 6


An additional buffer context is not needed, as b1 is by def-
inition adjacent to b0 , a fact that we expect the BiLSTM en-
Simple Feature Function The feature function coding of b0 to capture. In contrast, b0 , s0 , s1 and s2 are not
φ(c) is typically complex (see Section 2.1). Our necessarily adjacent to each other in the original sentence.
7
We did not experiment with other feature configurations. It
feature function is the concatenated BiLSTM vec- is well possible that not all of the additional 7 child encodings
tors of the top 3 items on the stack and the first are needed for the observed accuracy gains, and that a smaller
item on the buffer. I.e., for a configuration c = feature set will yield similar or even better improvements.

319
4.1 Details of the Training Algorithm test time. Instead, in error-exploration training the
The training objective is to set the score of correct parser follows the highest scoring action in A dur-
transitions above the scores of incorrect transitions. ing training even if this action is incorrect, exposing
We use a margin-based objective, aiming to maxi- it to configurations that result from erroneous deci-
mize the margin between the highest scoring correct sions. This strategy requires defining the set G such
action and the highest scoring incorrect action. The that the correct actions to take are well-defined also
hinge loss at each parsing configuration c is defined for states that cannot lead to the gold tree. Such
as: a set G is called a dynamic oracle. We perform
error-exploration training using the dynamic-oracle
  defined by Goldberg and Nivre (2013).

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


max 0, 1− max M LP φ(c) [to ]
to ∈G
  Aggressive Exploration We found that even when
+ max M LP φ(c) [tp ] using error-exploration, after one iteration the model
tp ∈A\G
remembers the training set quite well, and does not
where A is the set of possible transitions and G make enough errors to make error-exploration effec-
is the set of correct (gold) transitions at the cur- tive. In order to expose the parser to more errors,
rent stage. At each stage of the training process we follow an aggressive-exploration scheme: we
the parser scores the possible transitions A, incurs sometimes follow incorrect transitions also if they
a loss, selects a transition to follow, and moves to score below correct transitions. Specifically, when
the next configuration based on it. The local losses the score of the correct transition is greater than that
are summed throughout the parsing process of a sen- of the wrong transition but the difference is smaller
tence, and the parameters are updated with respect than a margin constant, we chose to follow the incor-
to the sum of the losses at sentence boundaries.8 rect action with probability pagg (we use pagg = 0.1
The gradients of the entire network (including the in our experiments).
MLP and the BiLSTM) with respect to the sum of Summary The greedy transition-based parser
the losses are calculated using the backpropagation follows standard techniques from the literature
algorithm. As usual, we perform several training it- (margin-based objective, dynamic oracle training,
erations over the training corpus, shuffling the order error exploration, MLP-based non-linear scoring
of sentences in each iteration. function). We depart from the literature by re-
Error-Exploration and Dynamic Oracle Training placing the hand-crafted feature function over care-
We follow Goldberg and Nivre (2013);Goldberg and fully selected components of the configuration with
Nivre (2012) in using error exploration training with a concatenation of BiLSTM representations of a few
a dynamic-oracle, which we briefly describe below. prominent items on the stack and the buffer, and
At each stage in the training process, the parser training the BiLSTM encoder jointly with the rest
assigns scores to all the possible transitions t ∈ A. It of the network.
then selects a transition, applies it, and moves to the
5 Graph-based Parser
next step. Which transition should be followed? A
common approach follows the highest scoring tran- Graph-based parsing follows the common structured
sition that can lead to the gold tree. However, when prediction paradigm (Taskar et al., 2005; McDonald
training in this way the parser sees only configura- et al., 2005):
tions that result from following correct actions, and
as a result tends to suffer from error propagation at predict(s) = arg max scoreglobal (s, y)
y∈Y(s)
8
To increase gradient stability and training speed, we simu- X
late mini-batch updates by only updating the parameters when
scoreglobal (s, y) = scorelocal (s, part)
the sum of local losses contains at least 50 non-zero elements. part∈y
Sums of fewer elements are carried across sentences. This as-
sures us a sufficient number of gradient samples for every up- Given an input sentence s (and the corresponding
date thus minimizing the effect of gradient instability. sequence of vectors x1:n ) we look for the highest-

320
+
M LP M LP M LP M LP

Vthe Vbrown Vfox Vjumped V∗

concat concat concat concat concat

s4 s3 s2 s1 s0
LST M b LST M b LST M b LST M b LST M b

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


LST M f LST M f LST M f LST M f LST M f

xthe xbrown xfox xjumped x∗

Figure 2: Illustration of the neural model scheme of the graph-based parser when calculating the score of a given parse
tree. The parse tree is depicted below the sentence. Each dependency arc in the sentence is scored using an MLP that
is fed the BiLSTM encoding of the words at the arc’s end points (the colors of the arcs correspond to colors of the
MLP inputs above), and the individual arc scores are summed to produce the final score. All the MLPs share the same
parameters. The figure depicts a single-layer BiLSTM, while in practice we use two layers. When parsing a sentence,
we compute scores for all possible n2 arcs, and find the best scoring tree using a dynamic-programming algorithm.

scoring parse tree y in the space Y(s) of valid de- ifier word:
pendency trees over s. In order to make the search
tractable, the scoring function is decomposed to the φ(s, h, m) = B I R NN(x1:n , h) ◦ B I R NN(x1:n , m)
sum of local scores for each part independently.
In this work, we focus on arc-factored graph
based approach presented in McDonald et al. (2005). The final model is:
Arc-factored parsing decomposes the score of a tree
to the sum of the score of its head-modifier arcs parse(s) = arg max scoreglobal (s, y)
(h, m): y∈Y(s)
X 
X  = arg max score φ(s, h, m)
parse(s) = arg max score φ(s, h, m) y∈Y(s) (h,m)∈y
y∈Y(s) (h,m)∈y
X
= arg max M LP (vh ◦ vm )
Given the scores of the arcs the highest scoring pro- y∈Y(s) (h,m)∈y
jective tree can be efficiently found using Eisner’s
vi = B I R NN(x1:n , i)
decoding algorithm (1996). McDonald et al. and
most subsequent work estimate the local score of an
arc by a linear model parameterized by a weight vec- The architecture is illustrated in Figure 2.
tor w, and a feature function φ(s, h, m) assigning a
sparse feature vector for an arc linking modifier m Training The training objective is to set the score
to head h. We follow Pei et al. (2015) and replace function such that correct tree y is scored above in-
the linear scoring function with an MLP. correct ones. We use a margin-based objective (Mc-
The feature extractor φ(s, h, m) is usually com- Donald et al., 2005; LeCun et al., 2006), aiming to
plex, involving many elements (see Section 2.1). maximize the margin between the score of the gold
In contrast, our feature extractor uses merely the tree y and the highest scoring incorrect tree y 0 . We
BiLSTM encoding of the head word and the mod- define a hinge loss with respect to a gold tree y as:

321
order to remedy this, we found it useful to use loss
 X augmented inference (Taskar et al., 2005). The in-
max 0, 1 − max
0
M LP (vh ◦ vm ) tuition behind loss augmented inference is to update
y 6=y
(h,m)∈y 0
 against trees which have high model scores and are
X
+ M LP (vh ◦ vm ) also very wrong. This is done by augmenting the
(h,m)∈y score of each part not belonging to the gold tree by
adding a constant to its score. Formally, the loss
Each of the tree scores is then calculated by acti- transforms as follows:
vating the MLP on the arc representations. The en-
tire loss can viewed as the sum of multiple neural max(0, 1 + score(x, y)−
X

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


networks, which is sub-differentiable. We calculate max (scorelocal (x, part) + 1part6∈y ))
the gradients of the entire network (including to the 0
y 6=y
part∈y 0
BiLSTM encoder and word embeddings).
Speed improvements The arc-factored model re-
Labeled Parsing Up to now, we described unla-
quires the scoring of n2 arcs. Scoring is performed
beled parsing. A possible approach for adding la-
using an MLP with one hidden layer, resulting in n2
bels is to score the combination of an unlabeled arc
matrix-vector multiplications from the input to the
(h, m) and its label ` by considering the label as part
hidden layer, and n2 multiplications from the hid-
of the arc (h, m, `). This results in |Labels|×|Arcs|
den to the output layer. The first n2 multiplications
parts that need to be scored, leading to slow parsing
involve larger dimensional input and output vectors,
speeds and arguably a harder learning problem.
and are the most time consuming. Fortunately, these
Instead, we chose to first predict the unlabeled
can be reduced to 2n multiplications and n2 vec-
structure using the model given above, and then pre-
tor additions, by observing that the multiplication
dict the label of each resulting arc. Using this ap-
W · (vh ◦ vm ) can be written as W 1 · vh + W 2 · vm
proach, the number of parts stays small, enabling
where W 1 and W 1 are are the first and second half
fast parsing.
of the matrix W and reusing the products across dif-
The labeling of an arc (h, m) is performed using
ferent pairs.
the same feature representation φ(s, h, m) fed into a
Summary The graph-based parser is straight-
different MLP predictor:
forward first-order parser, trained with a margin-
label(h, m) = arg max M LPLBL (vh ◦ vm )[`] based hinge-loss and loss-augmented inference. We
`∈labels depart from the literature by replacing the hand-
crafted feature function with a concatenation of
As before we use a margin based hinge loss. The la-
BiLSTM representations of the head and modifier
beler is trained on the gold trees.9 The BiLSTM en-
words, and training the BiLSTM encoder jointly
coder responsible for producing vh and vm is shared
with the structured objective. We also introduce a
with the arc-factored parser: the same BiLSTM en-
novel multi-task learning approach for labeled pars-
coder is used in the parer and the labeler. This
ing by training a second-stage arc-labeler sharing the
sharing of parameters can be seen as an instance of
same BiLSTM encoder with the unlabeled parser.
multi-task learning (Caruana, 1997). As we show
in Section 6, the sharing is effective: training the 6 Experiments and Results
BiLSTM feature encoder to be good at predicting
arc-labels significantly improves the parser’s unla- We evaluated our parsing model on English and Chi-
beled accuracy. nese data. For comparison purposes we follow the
setup of Dyer et al. (2015).
Loss augmented inference In initial experiments,
the network learned quickly and overfit the data. In Data For English, we used the Stanford Depen-
9
When training the labeled parser, we calculate the structure
dency (SD) (de Marneffe and Manning, 2008) con-
loss and the labeling loss for each training sentence, and sum version of the Penn Treebank (Marcus et al., 1993),
the losses prior to computing the gradients. using the standard train/dev/test splits with the

322
System Method Representation Emb PTB-YM PTB-SD CTB
UAS UAS LAS UAS LAS
This work graph, 1st order 2 BiLSTM vectors – – 93.1 91.0 86.6 85.1
This work transition (greedy, dyn-oracle) 4 BiLSTM vectors – – 93.1 91.0 86.2 85.0
This work transition (greedy, dyn-oracle) 11 BiLSTM vectors – – 93.2 91.2 86.5 84.9
ZhangNivre11 transition (beam) large feature set (sparse) – 92.9 – – 86.0 84.4
Martins13 (TurboParser) graph, 3rd order+ large feature set (sparse) – 92.8 93.1 – – –
Pei15 graph, 2nd order large feature set (dense) – 93.0 – – – –
Dyer15 transition (greedy) Stack-LSTM + composition – – 92.4 90.0 85.7 84.1
Ballesteros16 transition (greedy, dyn-oracle) Stack-LSTM + composition – – 92.7 90.6 86.1 84.5
This work graph, 1st order 2 BiLSTM vectors YES – 93.0 90.9 86.5 84.9
This work transition (greedy, dyn-oracle) 4 BiLSTM vectors YES – 93.6 91.5 87.4 85.9
This work transition (greedy, dyn-oracle) 11 BiLSTM vectors YES – 93.9 91.9 87.6 86.1

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


Weiss15 transition (greedy) large feature set (dense) YES – 93.2 91.2 – –
Weiss15 transition (beam) large feature set (dense) YES – 94.0 92.0 – –
Pei15 graph, 2nd order large feature set (dense) YES 93.3 – – – –
Dyer15 transition (greedy) Stack-LSTM + composition YES – 93.1 90.9 87.1 85.5
Ballesteros16 transition (greedy, dyn-oracle) Stack-LSTM + composition YES – 93.6 91.4 87.6 86.2
LeZuidema14 reranking /blend inside-outside recursive net YES 93.1 93.8 91.5 – –
Zhu15 reranking /blend recursive conv-net YES 93.8 – – 85.7 –

Table 1: Test-set parsing results of various state-of-the-art parsing systems on the English (PTB) and Chinese (CTB) datasets. The
systems that use embeddings may use different pre-trained embeddings. English results use predicted POS tags (different systems
use different taggers), while Chinese results use gold POS tags. PTB-YM: English PTB, Yamada and Matsumoto head rules.
PTB-SD: English PTB, Stanford Dependencies (different systems may use different versions of the Stanford converter). CTB:
Chinese Treebank. reranking /blend in Method column indicates a reranking system where the reranker score is interpolated with
the base-parser’s score. The different systems and the numbers reported from them are taken from: ZhangNivre11: (Zhang and
Nivre, 2011); Martins13: (Martins et al., 2013); Weiss15 (Weiss et al., 2015); Pei15: (Pei et al., 2015); Dyer15 (Dyer et al., 2015);
Ballesteros16 (Ballesteros et al., 2016); LeZuidema14 (Le and Zuidema, 2014); Zhu15: (Zhu et al., 2015).

same predicted POS-tags as used in Dyer et al. The word and POS embeddings e(wi ) and e(pi )
(2015);Chen and Manning (2014). This dataset con- are initialized to random values and trained together
tains a few non-projective trees. Punctuation sym- with the rest of the parsers’ networks. In some ex-
bols are excluded from the evaluation. periments, we introduce also pre-trained word em-
For Chinese, we use the Penn Chinese Treebank beddings. In those cases, the vector representa-
5.1 (CTB5), using the train/test/dev splits of (Zhang tion of a word is a concatenation of its randomly-
and Clark, 2008; Dyer et al., 2015) with gold part- initialized vector embedding with its pre-trained
of-speech tags, also following (Dyer et al., 2015; word vector. Both are tuned during training. We
Chen and Manning, 2014). use the same word vectors as in Dyer et al. (2015)
When using external word embeddings, we also During training, we employ a variant of word
use the same data as Dyer et al. (2015).10 dropout (Iyyer et al., 2015), and replace a word with
Implementation Details The parsers are imple- the unknown-word symbol with probability that is
mented in python, using the PyCNN toolkit11 for inversely proportional to the frequency of the word.
neural network training. The code is available at A word w appearing #(w) times in the training cor-
the github repository https://github.com/ pus is replaced with the unknown symbol with prob-
α
elikip/bist-parser. We use the LSTM vari- ability punk (w) = #(w)+α . If a word was dropped
ant implemented in PyCNN, and optimize using the the external embedding of the word is also dropped
Adam optimizer (Kingma and Ba, 2015). Unless with probability 0.5.
otherwise noted, we use the default values provided We train the parsers for up to 30 iterations, and
by PyCNN (e.g. for random initialization, learning choose the best model according to the UAS accu-
rates etc). racy on the development set.
10
We thank Dyer et al. for sharing their data with us.
11
https://github.com/clab/cnn/tree/ Hyperparameter Tuning We performed a very
master/pycnn minimal hyper-parameter search with the graph-

323
based parser, and use the same hyper-parameters for sults, with the extended feature set yielding the best
both parsers. The hyper-parameters of the final net- reported results for Chinese, and ranked second for
works used for all the reported experiments are de- English, after the heavily-tuned beam-based parser
tailed in Table 2. of Weiss et al. (2015).
Word embedding dimension 100 Additional Results We perform some ablation ex-
POS tag embedding dimension 25 periments in order to quantify the effect of the dif-
Hidden units in M LP 100
Hidden units in M LPLBL 100
ferent components on our best models (Table 3).
BI-LSTM Layers 2
PTB CTB
BI-LSTM Dimensions (hidden/output) 125 / 125
UAS LAS UAS LAS
α (for word dropout) 0.25

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


Graph (no ext. emb) 93.3 91.0 87.0 85.4
pagg (for exploration training) 0.1 –POS 92.9 89.8 80.6 76.8
–ArcLabeler 92.7 – 86.2 –
Table 2: Hyper-parameter values used in experiments –Loss Aug. 81.3 79.4 52.6 51.7
Greedy (ext. emb) 93.8 91.5 87.8 86.0
–POS 93.4 91.2 83.4 81.6
Main Results Table 1 lists the test-set accuracies of –DynOracle 93.5 91.4 87.5 85.9
our best parsing models, compared to other state-of-
the-art parsers from the literature.12 Table 3: Ablation experiments results (dev set) for the graph-
It is clear that our parsers are very competitive, based parser without external embeddings and the greedy parser
with external embeddings and extended feature set.
despite using very simple parsing architectures and
minimal feature extractors. When not using external
embeddings, the first-order graph-based parser with Loss augmented inference is crucial for the success
2 features outperforms all other systems that are not of the graph-based parser, and the multi-task learn-
using external resources, including the third-order ing scheme for the arc-labeler contributes nicely
TurboParser. The greedy transition based parser to the unlabeled scores. Dynamic oracle training
with 4 features also matches or outperforms most yields nice gains for both English and Chinese.
other parsers, including the beam-based transition
parser with heavily engineered features of Zhang
7 Conclusion
and Nivre (2011) and the Stack-LSTM parser of We presented a pleasingly effective approach for
Dyer et al. (2015), as well as the same parser when feature extraction for dependency parsing based on
trained using a dynamic oracle (Ballesteros et al., a BiLSTM encoder that is trained jointly with the
2016). Moving from the simple (4 features) to the parser, and demonstrated its effectiveness by inte-
extended (11 features) feature set leads to some grating it into two simple parsing models: a greedy
gains in accuracy for both English and Chinese. transition-based parser and a globally optimized
Interestingly, when adding external word embed- first-order graph-based parser, yielding very com-
dings the accuracy of the graph-based parser de- petitive parsing accuracies in both cases.
grades. We are not sure why this happens, and leave
the exploration of effective semi-supervised parsing Acknowledgements This research is supported by
with the graph-based model for future work. The the Intel Collaborative Research Institute for Com-
greedy parser does manage to benefit from the ex- putational Intelligence (ICRI-CI) and the Israeli Sci-
ternal embeddings, and using them we also see gains ence Foundation (grant number 1555/15). We thank
from moving from the simple to the extended feature Lillian Lee for her important feedback and efforts
set. Both feature sets result in very competitive re- invested in editing this paper. We also thank the re-
viewers for their valuable comments.
12
Unfortunately, many papers still report English parsing
results on the deficient Yamada and Matsumoto head rules
(PTB-YM) rather than the more modern Stanford-dependencies References
(PTB-SD). We note that the PTB-YM and PTB-SD results are
not strictly comparable, and in our experience the PTB-YM re- Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and
sults are usually about half a UAS point higher. Noah A. Smith. 2016. Training with explo-

324
ration improves a greedy stack-LSTM parser. CoRR, Alex Graves. 2008. Supervised sequence labelling with
abs/1603.03793. recurrent neural networks. Ph.D. thesis, Technical
Rich Caruana. 1997. Multitask learning. Machine University Munich.
Learning, 28:41–75, July. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
Danqi Chen and Christopher Manning. 2014. A fast and short-term memory. Neural Computation, 9(8):1735–
accurate dependency parser using neural networks. 1780.
In Proceedings of the 2014 Conference on Empirical Liang Huang and Kenji Sagae. 2010. Dynamic pro-
Methods in Natural Language Processing (EMNLP), gramming for linear-time incremental parsing. In Pro-
pages 740–750, Doha, Qatar, October. Association for ceedings of the 48th Annual Meeting of the Associa-
Computational Linguistics. tion for Computational Linguistics, pages 1077–1086,
Jason P.C. Chiu and Eric Nichols. 2016. Named entity Uppsala, Sweden, July. Association for Computational

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


recognition with bidirectional LSTM-CNNs. Transac- Linguistics.
tions of the Association for Computational Linguistics, Ozan Irsoy and Claire Cardie. 2014. Opinion mining
4. To appear. with deep recurrent neural networks. In Proceedings
Kyunghyun Cho. 2015. Natural language under- of the 2014 Conference on Empirical Methods in Nat-
standing with distributed representation. CoRR, ural Language Processing (EMNLP), pages 720–728,
abs/1511.07916. Doha, Qatar, October. Association for Computational
James Cross and Liang Huang. 2016. Incremental pars- Linguistics.
ing with minimal features using bi-directional LSTM. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,
In Proceedings of the 54th Annual Meeting of the As- and Hal Daumé III. 2015. Deep unordered composi-
sociation for Computational Linguistics, Berlin, Ger- tion rivals syntactic methods for text classification. In
many, August. Association for Computational Lin- Proceedings of the 53rd Annual Meeting of the Associ-
guistics. ation for Computational Linguistics and the 7th Inter-
Marie-Catherine de Marneffe and Christopher D. Man- national Joint Conference on Natural Language Pro-
ning. 2008. Stanford dependencies manual. Techni- cessing (Volume 1: Long Papers), pages 1681–1691,
cal report, Stanford University. Beijing, China, July. Association for Computational
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Linguistics.
Matthews, and Noah A. Smith. 2015. Transition-
Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015.
based dependency parsing with stack long short-term
Visualizing and understanding recurrent networks.
memory. In Proceedings of the 53rd Annual Meet-
CoRR, abs/1506.02078.
ing of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Language Processing (Volume 1: Long Papers), pages method for stochastic optimization. In Proceedings of
334–343, Beijing, China, July. Association for Com- the 3rd International Conference for Learning Repre-
putational Linguistics. sentations, San Diego, California.
Jason Eisner. 1996. Three new probabilistic models for Eliyahu Kiperwasser and Yoav Goldberg. 2016.
dependency parsing: An exploration. In 16th Interna- Easy-first dependency parsing with hierarchical tree
tional Conference on Computational Linguistics, Pro- LSTMs. Transactions of the Association for Compu-
ceedings of the Conference, COLING 1996, Center for tational Linguistics, 4. To appear.
Sprogteknologi, Copenhagen, Denmark, August 5-9, Terry Koo and Michael Collins. 2010. Efficient third-
1996, pages 340–345. order dependency parsers. In Proceedings of the 48th
Jeffrey L. Elman. 1990. Finding structure in time. Cog- Annual Meeting of the Association for Computational
nitive Science, 14(2):179–211. Linguistics, pages 1–11, Uppsala, Sweden, July. Asso-
Yoav Goldberg and Joakim Nivre. 2012. A dynamic ora- ciation for Computational Linguistics.
cle for arc-eager dependency parsing. In Proceedings Terry Koo, Xavier Carreras, and Michael Collins. 2008.
of COLING 2012, pages 959–976, Mumbai, India, De- Simple semi-supervised dependency parsing. In Pro-
cember. The COLING 2012 Organizing Committee. ceedings of the 46th Annual Meeting of the Associ-
Yoav Goldberg and Joakim Nivre. 2013. Training ation for Computational Linguistics, pages 595–603,
deterministic parsers with non-deterministic oracles. Columbus, Ohio, June. Association for Computational
Transactions of the Association for Computational Linguistics.
Linguistics, 1:403–414. Sandra Kübler, Ryan T. McDonald, and Joakim Nivre.
Yoav Goldberg. 2015. A primer on neural net- 2009. Dependency Parsing. Synthesis Lectures on
work models for natural language processing. CoRR, Human Language Technologies. Morgan & Claypool
abs/1510.00726. Publishers.

325
Marco Kuhlmann, Carlos Gómez-Rodríguez, and Gior- Sofia, Bulgaria, August. Association for Computa-
gio Satta. 2011. Dynamic programming algorithms tional Linguistics.
for transition-based dependency parsers. In Proceed- Ryan McDonald, Koby Crammer, and Fernando Pereira.
ings of the 49th Annual Meeting of the Association for 2005. Online large-margin training of dependency
Computational Linguistics: Human Language Tech- parsers. In Proceedings of the 43rd Annual Meet-
nologies, pages 673–682, Portland, Oregon, USA, ing of the Association for Computational Linguistics
June. Association for Computational Linguistics. (ACL’05), pages 91–98, Ann Arbor, Michigan, June.
Guillaume Lample, Miguel Ballesteros, Sandeep Subra- Association for Computational Linguistics.
manian, Kazuya Kawakami, and Chris Dyer. 2016. Ryan McDonald. 2006. Discriminative Training and
Neural architectures for named entity recognition. In Spanning Tree Algorithms for Dependency Parsing.
Proceedings of the 2016 Conference of the North Ph.D. thesis, University of Pennsylvania.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


American Chapter of the Association for Computa- Joakim Nivre. 2004. Incrementality in deterministic de-
tional Linguistics: Human Language Technologies, pendency parsing. In Frank Keller, Stephen Clark,
pages 260–270, San Diego, California, June. Associ- Matthew Crocker, and Mark Steedman, editors, Pro-
ation for Computational Linguistics. ceedings of the ACL Workshop Incremental Parsing:
Phong Le and Willem Zuidema. 2014. The inside- Bringing Engineering and Cognition Together, pages
outside recursive neural network model for depen- 50–57, Barcelona, Spain, July. Association for Com-
dency parsing. In Proceedings of the 2014 Conference putational Linguistics.
on Empirical Methods in Natural Language Process-
Joakim Nivre. 2008. Algorithms for deterministic incre-
ing (EMNLP), pages 729–739, Doha, Qatar, October.
mental dependency parsing. Computational Linguis-
Association for Computational Linguistics.
tics, 34(4):513–553.
Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio
Wenzhe Pei, Tao Ge, and Baobao Chang. 2015. An ef-
Ranzato, and Fu Jie Huang. 2006. A tutorial on
fective neural network model for graph-based depen-
energy-based learning. Predicting structured data, 1.
dency parsing. In Proceedings of the 53rd Annual
Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and
Meeting of the Association for Computational Linguis-
Tommi Jaakkola. 2014. Low-rank tensors for scor-
tics and the 7th International Joint Conference on Nat-
ing dependency structures. In Proceedings of the
ural Language Processing (Volume 1: Long Papers),
52nd Annual Meeting of the Association for Compu-
pages 313–322, Beijing, China, July. Association for
tational Linguistics (Volume 1: Long Papers), pages
Computational Linguistics.
1381–1391, Baltimore, Maryland, June. Association
for Computational Linguistics. Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec-
Mike Lewis, Kenton Lee, and Luke Zettlemoyer. 2016. tional recurrent neural networks. IEEE Trans. Signal
LSTM CCG parsing. In Proceedings of the 2016 Con- Processing, 45(11):2673–2681.
ference of the North American Chapter of the Associa- Benjamin Taskar, Vassil Chatalbashev, Daphne Koller,
tion for Computational Linguistics: Human Language and Carlos Guestrin. 2005. Learning structured pre-
Technologies, pages 221–231, San Diego, California, diction models: A large margin approach. In Machine
June. Association for Computational Linguistics. Learning, Proceedings of the Twenty-Second Interna-
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann tional Conference (ICML 2005), Bonn, Germany, Au-
Marcinkiewicz. 1993. Building a large annotated cor- gust 7-11, 2005, pages 896–903.
pus of English: The Penn Treebank. Computational Hillel Taub-Tabib, Yoav Goldberg, and Amir Glober-
Linguistics, 19(2):313–330. son. 2015. Template kernels for dependency pars-
Andre Martins, Noah A. Smith, and Eric Xing. 2009. ing. In Proceedings of the 2015 Conference of the
Concise integer linear programming formulations for North American Chapter of the Association for Com-
dependency parsing. In Proceedings of the Joint Con- putational Linguistics: Human Language Technolo-
ference of the 47th Annual Meeting of the ACL and gies, pages 1422–1427, Denver, Colorado, May–June.
the 4th International Joint Conference on Natural Lan- Association for Computational Linguistics.
guage Processing of the AFNLP, pages 342–350, Sun- Ivan Titov and James Henderson. 2007. A latent variable
tec, Singapore, August. Association for Computational model for generative dependency parsing. In Proceed-
Linguistics. ings of the Tenth International Conference on Parsing
Andre Martins, Miguel Almeida, and Noah A. Smith. Technologies, pages 144–155, Prague, Czech Repub-
2013. Turning on the turbo: Fast third-order non- lic, June. Association for Computational Linguistics.
projective turbo parsers. In Proceedings of the 51st Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and Ryan
Annual Meeting of the Association for Computational Musa. 2016. Supertagging with LSTMs. In Pro-
Linguistics (Volume 2: Short Papers), pages 617–622, ceedings of the 15th Annual Conference of the North

326
American Chapter of the Association for Computa-
tional Linguistics (Short Papers), San Diego, Califor-
nia, June.
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov,
Ilya Sutskever, and Geoffrey E. Hinton. 2015. Gram-
mar as a foreign language. In Advances in Neural In-
formation Processing Systems 28: Annual Conference
on Neural Information Processing Systems 2015, De-
cember 7-12, 2015, Montreal, Quebec, Canada, pages
2773–2781.
David Weiss, Chris Alberti, Michael Collins, and Slav

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023


Petrov. 2015. Structured training for neural network
transition-based parsing. In Proceedings of the 53rd
Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference
on Natural Language Processing (Volume 1: Long Pa-
pers), pages 323–333, Beijing, China, July. Associa-
tion for Computational Linguistics.
Yue Zhang and Stephen Clark. 2008. A tale of two
parsers: Investigating and combining graph-based and
transition-based dependency parsing. In Proceedings
of the 2008 Conference on Empirical Methods in Nat-
ural Language Processing, pages 562–571, Honolulu,
Hawaii, October. Association for Computational Lin-
guistics.
Yue Zhang and Joakim Nivre. 2011. Transition-based
dependency parsing with rich non-local features. In
Proceedings of the 49th Annual Meeting of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies, pages 188–193, Portland, Ore-
gon, USA, June. Association for Computational Lin-
guistics.
Chenxi Zhu, Xipeng Qiu, Xinchi Chen, and Xuanjing
Huang. 2015. A re-ranking model for dependency
parser with recursive convolutional neural network. In
Proceedings of the 53rd Annual Meeting of the Associ-
ation for Computational Linguistics and the 7th Inter-
national Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 1159–1168,
Beijing, China, July. Association for Computational
Linguistics.

327
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00101/1567410/tacl_a_00101.pdf by Kirklareli University user on 12 December 2023

328

You might also like