Multi-Tagging For Transition-Based Dependency Parsing
Multi-Tagging For Transition-Based Dependency Parsing
Multi-Tagging For Transition-Based Dependency Parsing
Alexander Whillas
1 Introduction
In Natural Language Processing (NLP) two fundamental tasks are part-of-speech (POS)
tagging, assigning lexical categories to words in a sentence such as Verb or Noun, and
sentence parsing which is preforming a grammatical analysis on the sentence the output
of which is typically a parse tree representing the structure of the words. Typically the
POS taggers output is feed into the parser as input creating an NLP pipeline.
The parse trees that the parser produces depends on the grammar formalism that
is being used to model the language. These can be dived into two main categories:
phrase-structure grammars [Chomsky, 1957] which adhere to the constituency relation
and dependency grammars [Tesnière, 1959] which are based on the dependency relation
(without intermittent constituents in the tree).
Dependency parsing produces a connected, directed acyclic graph (DAG)1 with
nodes that are all the words in a sentence, typically with the addition of an artificial
ROOT node, and with arcs that are the dependency relations. There are three types
of parser methods for producing these trees: transition based methods, similar to finite
state automata; graph based methods2 which find a maximum spanning tree (MST); and
grammar-based methods, which use predefined projective dependency grammar similar
1
aka arborescence
2
or ‘ArcFactored models’
to context-free grammars and thus parsing becomes a constraint satisfaction problem.
[Kübler et al., 2009]
To avoid over-fitting the training data a regularisation prior term was added to the
objective function.
3 Data
The Penn Tree Bank (PTB) corpus which has become the de facto standard data set
for comparing POS tagging and parser performance. The problem with this data is
that it is not freely available to the general public and the process of obtaining it via
official channels is also not straight forward. This is a problem for new and underfunded
4
See the Viterbi algorithm for most probable state sequence.
5
or tags in this case which was 50
researchers. Fortunately a free alternative corpus has just been released. albeit a third
the size, the development was done with this corpus.6
The corpus used was the English part of the Universal Dependencies version 1.1
[Agić et al., 2015] which was in turn build from the English Web Treebank [Silveira et al., 2014].
“The corpus comprises 254,830 words and 16,622 sentences, taken from various web
media including weblogs, newsgroups, emails, reviews, and Yahoo! answers”. Web cor-
pus usually have loser and more diverse grammar than commercial publications such as
the Wall Street Journal and so are perhaps more challenging for NLP tasks.
Table 2: Pseudo-words classes that unseen words are mated to, in order of precedence
as per [Bikel et al., 1999]
4 Multi-tagging
The multi-tagging approach taken here is heavily influenced by [Curran et al., 2006] in
which they used a multi tagging approach in a POS-tagger and an intermediate super-
6
Not having access to the PTB means that results are not comparable to previous work but this
wasn’t too important as the results only had to compare to themselves.
tagger for a CCG 7 parser and got a 1.6% word accuracy and almost 20% sentence
accuracy improvement by adding an extra 1 tag in 20 words.
We adopted their tag selection process taking all tags for a word with in a factor γ
of the most probable tag for each word.
5 Dependency parsing
A dependency parser can be thought of as an abstract machine8 which takes as input a
sequence of words and mapping them to a matching set of head indices (see Table 3).
Typically an artificial ROOT word is added which takes the first (zero index).
indices: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
words: ROOT I have never hated a man enough to give his diamonds back .
heads: - 4 4 4 0 6 4 4 9 4 11 9 9 4
Projectivity is a property of a dependency tree that put simply means that the arcs of the
tree do not cross as in Figure1. We assume that all the dependency tree considered have
this property and those that did not were filtered out of the training set. A non-projective
tree is shown in Figure2. These made up about 4% of the corpus. Non-projectivity is
more prevalent in languages like Czech but are not considered a problem in English.
PU
ROOT
TMP
ATT
PC
ATT SBJ VC
ATT
A transition-based dependency parser treats the parsing task as a search for an op-
timal transition sequence for a given sentence. The transition sequence is broken down
into a series of decisions from a small set of options (see figure 3). At each step a classi-
fier is trained using an oracle that, given a gold parse tree, determines optimal transition
sequences and thus a correct choice can be know and trained against.
A transition system state is defined as having a configuration c = (σ, β, A) where
σ is a stack, β is a buffer and A is a set of dependency arcs. Initially the system has
an empty stack and set of arcs and the buffer is is set to the word in the sentence to be
parsed β = V = w1 , w2 , ..., wn , ROOT . The final terminal state is a configuration in
which the stack is empty and the buffer only contains ROOT 9 .
An Arc-Eager [Goldberg and Nivre, 2013] transition system has four transitions:
There is a precondition on the RIGHT and LEF T transitions that b 6= ROOT
and also the stack σ must not be empty. LEF T is also only legal if s does not have a
parent in the existing arcs A and REDU CE must have a parent in in A. This system
9
and A contains the set of arcs of the parse tree
SHIF T [(σ, b|β, A)] = (σ|b, β, A)
RIGHT [(σ|s, b|β, A)] = (σ|s|b, β, A ∪ (s, b))
LEF T [(σ|s, b|β, A)] = (σ, b|β, A ∪ (b, s))
REDU CE[(σ|s, β, A)] = (σ, β, A)
collects all of its left dependants first then its right dependants and is “eager” because it
adds arcs as early as possible10 .
Dependancy relations can also include labels to indicate the nature of the relation,
such as ‘subject’ or ‘object’. The Arc-Eager system defined above can be extended to
include the labels which simply requires a new set of LEF T and RIGHT transitions
for each label. All the work here can be trivially extended to this case.
The averaged perceptron classifier was introduced to NLP in [Collins, 2002]. Its be-
come the dominant classifier for transition based dependancy parsing because it is a
simple design and yet very effective. It is also linear to train and predict with while
maintaining an competitive accuracy. It works the same way the classical perceptron
algorithm, updating the weights of features on bad predictions: incrementing those that
lead to the right prediction and decrementing those that gave the wrong answer. The
main difference is at the end of the training cycle the average of each weights history
over all the training examples is taken.
A dynamic oracle , introduced by [Goldberg and Nivre, 2012], is used at training time
to predict the optimal transition sequence given any previous transition history includ-
ing those that deviate from the gold tree11 parse. This is done non-deterministically
because a set transitions is returned for a given parse state and gold tree. This allows
a greedy parser to learn how to recover from mistakes and reduces the effects of error
propagation and results in better parse accuracy.
6 Experiments
The data was split into three sets: a large training set, a cross validation set and a testing
set from which the final accuracies are draw.
The tag dictionary, which recorded words and their corresponding tag frequencies
was used as a baseline with unknown words using the highest scoring class they fell
into. This gave a baseline tagging accuracy of 84%.
10
unlike the Arc-Standard model which lacks the REDU CE transition and builds its trees
bottom up i.e. each word collects its dependence before attaching itself to its head word.
11
“gold” standard i.e. the best, most reliable data
The MEMM model was trained on the training set and then its regularisation pa-
rameter was tuned using the cross validation set. A value of 0.66 as settled on see table
4.
The training data was multi-tagged using the MEMM model using 10-fold cross
validation. The leave-one-out set that got multi-tagged and saved to use in the the parser
training. The dynamic oracle of the transition parser requires tagged data that has the
mistakes the tagger is likely to make at test time so it can learn to recover from transition
histories that incorporate mistakes. This has been shown [Goldberg and Nivre, 2012] to
give 1.5 - 3% improvement in accuracy.
Distributed processing was employed as the MEMM tagger took over 26 hours to run
on 2000 sentences. Doing this ten times for the 10-fold cross validation was thus not
feasible without distributed processing power. Training and tagging of the training set
was done one 49 computers each with 4 cores12
6.1 Parsing
The final test set was made up of 96.34% projective trees. The parses were using the
standard unlabelled attachment score (UAS) which simply counts the number of correct
head dependencies returned by the parser. The sentence accuracy was also counted
which requires the parser to get the dependencies right for a whole sentence.
This was using a MEMM POS tagger trained on the full training set and which
achieved 89.94% word accuracy and 44.63% sentence accuracy on the testing set. The
values for the parser accuracy here are well under state of the art and have not been
tuned.
12
This was possible thank to the dispy python module. http://dispy.sourceforge.
net/
Ambiguity Baseline Multi tagging Difference
γ tags/word UAS UAS Sent. UAS UAS Sent. Word Sent.
0.0 1.92 78.86 46.41 79.34 47.28 0.48 0.87
0.1 1.40 79.08 46.75 79.38 47.21 0.30 0.46
0.2 1.31 79.09 46.88 79.35 46.97 0.26 0.09
0.3 1.27 79.11 46.92 79.38 46.89 0.27 -0.03
0.4 1.24 79.13 46.94 79.31 46.81 0.18 -0.13
0.5 1.22 79.04 46.82 79.28 46.79 0.24 -0.03
0.6 1.21 79.06 46.81 79.28 46.81 0.22 0.00
0.7 1.19 79.07 46.78 79.26 46.74 0.19 -0.04
0.8 1.18 79.07 46.83 79.24 46.69 0.17 -0.14
0.9 1.17 79.07 46.73 79.21 46.66 0.14 -0.07
1.0 1.00 79.08 46.73 79.21 46.70 0.13 -0.03
Table 5: Unlabelled attachment Score (UAS) for word and sentence for both the base-
line without multi-tags and with multi-tagging. The difference column is the difference
between the two sets.
7 Conclusion
The results listed in Table 5 indicate that as the number of tags per word increases (i.e.
as gamma approaches zero the ambiguity increases) the UAS increases as well.
The MEMM tagger was also bellow the published 96% accuracy. Some of this is
the result of the data used here being web data and only about a third the size of the
Penn Wall Street Journal treebank. Due to the sparsity of the features in the MEMM
model they tent to benefit from more data.
Another problem with the MEMM tagger used here is that its run time is much
slower than the parser that it generates it’s multiple tags for i.e. polynomial verse lin-
ear. Regardless the aim of the experiment was to see if multi tagging features have
some benefit to a feature based dependency parser. Other, faster, methods of generating
probability distributions could be explored in other work such as the SoftMax layer of
[Weiss et al., 2015] for example.
The results were not as significant as those reported by [Curran et al., 2006] on
which this work was inspired. There a probabilistic based parser, the CYK algorithm
that uses a Probabilistic Context Free Grammar (PCFG), to generate a packed chart
of all possible parses over the given tagged sentences[Steedman, 2000]. These are then
ranked by a log-linear model similar to a MEMM. Given that the parser was using prob-
abilities in its calculations the effect of injecting real-valued features would have been
more prominent.
This small experiment shows that there is potential along this line of enquiry if
multi-tagging can be done in a more efficient manor.
Acknowledgements
I would like to thank Alan Blair for supervising this project. Also Matthew Honnibal
for the initial idea behind this line of enquiry and he use of this dependency parser code.
References
Agić et al., 2015. Agić, Ž., Aranzabe, M. J., Atutxa, A., Bosco, C., Choi, J., de Marneffe, M.-C.,
Dozat, T., Farkas, R., Foster, J., Ginter, F., Goenaga, I., Gojenola, K., Goldberg, Y., Hajič, J.,
Johannsen, A. T., Kanerva, J., Kuokkala, J., Laippala, V., Lenci, A., Lindén, K., Ljubešić, N.,
Lynn, T., Manning, C., Martı́nez, H. A., McDonald, R., Missilä, A., Montemagni, S., Nivre,
J., Nurmi, H., Osenova, P., Petrov, S., Piitulainen, J., Plank, B., Prokopidis, P., Pyysalo, S.,
Seeker, W., Seraji, M., Silveira, N., Simi, M., Simov, K., Smith, A., Tsarfaty, R., Vincze, V.,
and Zeman, D. (2015). Universal dependencies 1.1.
Bikel et al., 1999. Bikel, D., Schwartz, R., and Weischedel, R. (1999). An algorithm that learns
what’s in a name. Machine Learning, 34(1-3):211–231.
Charniak et al., 1996. Charniak, E., Carroll, G., Adcock, J., Cassandra, A., Gotoh, Y., Katz, J.,
Littman, M., and McCann, J. (1996). Taggers for parsers. Artificial Intelligence, 85(1):45–57.
Chomsky, 1957. Chomsky, N. (1957). Syntactic structures. Walter de Gruyter.
Collins, 2002. Collins, M. (2002). Discriminative training methods for hidden markov models:
Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference
on Empirical methods in natural language processing-Volume 10, pages 1–8. Association for
Computational Linguistics.
Curran et al., 2006. Curran, J. R., Clark, S., and Vadas, D. (2006). Multi-tagging for lexicalized-
grammar parsing. In Proceedings of the 21st International Conference on Computational Lin-
guistics and the 44th annual meeting of the Association for Computational Linguistics, pages
697–704. Association for Computational Linguistics.
Goldberg and Nivre, 2012. Goldberg, Y. and Nivre, J. (2012). A dynamic oracle for arc-eager
dependency parsing. In COLING, pages 959–976.
Goldberg and Nivre, 2013. Goldberg, Y. and Nivre, J. (2013). Training deterministic parsers
with non-deterministic oracles. Transactions of the association for Computational Linguistics,
1:403–414.
Kübler et al., 2009. Kübler, S., McDonald, R., and Nivre, J. (2009). Dependency parsing. Syn-
thesis Lectures on Human Language Technologies, 1(1):1–127.
Manning, 2011. Manning, C. D. (2011). Part-of-speech tagging from 97% to 100%: is it time
for some linguistics? In Computational Linguistics and Intelligent Text Processing, pages 171–
189. Springer.
Rabiner, 1989. Rabiner, L. (1989). A tutorial on hidden markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257–286.
Ratnaparkhi et al., 1996. Ratnaparkhi, A. et al. (1996). A maximum entropy model for part-of-
speech tagging. In Proceedings of the conference on empirical methods in natural language
processing, volume 1, pages 133–142. Philadelphia, USA.
Silveira et al., 2014. Silveira, N., Dozat, T., de Marneffe, M.-C., Bowman, S., Connor, M., Bauer,
J., and Manning, C. D. (2014). A gold standard dependency corpus for English. In Proceedings
of the Ninth International Conference on Language Resources and Evaluation (LREC-2014).
Steedman, 2000. Steedman, M. (2000). The syntactic process, volume 24. MIT Press.
Tesnière, 1959. Tesnière, L. (1959). Eléments de syntaxe structurale. Librairie C. Klincksieck.
Weiss et al., 2015. Weiss, D., Alberti, C., Collins, M., and Petrov, S. (2015). Structured training
for neural network transition-based parsing. arXiv preprint arXiv:1506.06158.