Art Burlot Yvon
Art Burlot Yvon
Art Burlot Yvon
Abstract
When translating between a morphologically rich language (MRL) and English, word forms
in the MRL often encode grammatical information that is irrelevant with respect to English,
leading to data sparsity issues. This problem can be mitigated by removing from the MRL
irrelevant information through normalization. Such preprocessing is usually performed in a
deterministic fashion, using hand-crafted rules and yielding suboptimal representations. We
introduce here a simple way to automatically compute an appropriate normalization of the
MRL and show that it can improve machine translation in both directions.
1. Introduction
• The increase of word forms in the MRL means that each form has a smaller
occurrence count than its English counterpart(s), yielding poor probability es-
timates for infrequent words;
• An even more extreme case is the translation of word forms unseen in train-
ing. Even if other forms of the same lemma are known, the MT system cannot
generalize and will produce an erroneous output.
A well-known way to mitigate this problem is to “simplify” the MRL by remov-
ing information that is deemed redundant with respect to English. This solution has
been repeatedly used to translate into the MRL (eg. in (Ney and Popovic, 2004; Dur-
gar El-Kahlout and Yvon, 2010) for German, (Goldwater and McClosky, 2005) for
Czech), and is adopted in recent systems competing at WMT (e.g. (Allauzen et al.,
2016; Lo et al., 2016) for Russian), as well as in the reverse direction (Minkov et al.,
2007; Toutanova et al., 2008; Fraser et al., 2012) with the additional complexity that
the simplified MT output needs to be augmented with the missing information (“re-
inflected” in the MT jargon). One downside of these procedures is that they are en-
tirely dependent on the language pairs under study, and rely on hand-crafted rules
that need to be adapted for each new language. It is also likely that rule-based nor-
malization is suboptimal with respect to the task, as it does not take the peculiarities
of the training data into account.
We introduce (Section 3) a new way to automatically perform such normalization,
by clustering together MRL forms.1 Clustering is performed on a per lemma basis
and groups together morphological variants that tend to translate into the same tar-
get word(s). We show in Section 4 that this normalization helps when translating into
English. A second contribution is a new neural reinflection system, which is crucially
able to also take advantage of source-side information, yielding significant improve-
ments when translating into a MRL (Section 5).
2. Related Work
The normalization of the vocabulary on the MRL side mostly consists in remov-
ing word information that is deemed redundant with respect to English. Most of the
time, normalization relies on expert knowledge specifying which MRL words can be
merged without generating confusion in English, (see eg. (Ney and Popovic, 2004;
Goldwater and McClosky, 2005; Durgar El-Kahlout and Yvon, 2010)). An alternative,
which does not require user expertise is introduced by Talbot and Osborne (2006),
who proposed to use model selection techniques to identify useful clusters in the
MRL vocabulary. Even though we start from the same intuition (to cluster forms hav-
ing similar translation distributions), our model is much simpler and more explicitly
oriented toward morphological variation, which makes it also easier to invert.
50
F. Burlot, F. Yvon Learning Morphological Normalization (49–60)
The same kind of solution is also useful when translating in the reverse direc-
tion; it additionally requires a two-step MT architecture addressing morphology as
a post-processing step. Minkov et al. (2007) and Toutanova et al. (2008) translate from
English into Russian and Arabic stems, which are used to generate full paradigms,
then disambiguated using a classifier. Similarly, Chahuneau et al. (2013) augment the
translation model with synthetic phrases obtained by re-inflecting target stems. Bojar
(2007) cascade two Statistical MT systems: the first one translates from English into
Czech lemmas decorated with source-side information and the second one performs
a monotone translation into fully inflected Czech.
Fraser et al. (2012) represent German words as lemmas followed by a sequence of
tags and introduce a linguistically motivated selection of these in order to translate
from English. The second step consists in predicting the tags that have been previ-
ously removed, using a dedicated model for each morphological attribute. Finally,
word forms are produced by looking-up in a morphological dictionary. El Kholy and
Habash (2012a; 2012b) propose a similar approach for Arabic.
3. Source-side Clustering
Our goal is to cluster together MRL forms that translate into the same target word(s).
We assume that each MRL form f is a combination of a lemma, a part of speech (PoS)
and a sequence of morphological tags,2 and that a word aligned parallel corpus is
available, from which lexical translation probabilities p(e|f) and unigram probabili-
ties p(f) can be readily computed. We first consider the simple case where the corpus
contains one single lemma for each PoS. We denote respectively f the set of word forms
(or, equivalently, of positions in the paradigm) for this lemma, and E the complete En-
glish vocabulary. The conditional entropy (CE) of the translation model is:
∑ ∑ p(f) ∑
H(E|f) = p(f)H(E|f) = −p(e|f) log2 p(e|f), (1)
log2 |Eaf |
f∈f f∈f e∈Eaf
where Eaf is the set of words aligned with f. The normalizer (log2 |Eaf |) ensures that
all the entropy values are comparable, no matter the number of aligned target words.
From an initial state where each form is a singleton cluster, and proceeding bottom-
up, we repeatedly try to merge cluster pairs (f1 and f2 ) so as to reduce the CE. We
therefore compute the information gain (IG) of the merge operation:
2 For instance, the Czech autem (by car) is represented as: auto + Noun + neutral + singular + instrumental.
51
PBML 108 JUNE 2017
where f′ is the resulting aggregate. IG (∈ [−1, +1]) measures the difference between
the combined CEs of clusters f1 and f2 before and after merging in f′ . If the corre-
sponding forms have similar translation distributions, the information gain is pos-
itive; conversely when their translations are different, it is negative and the merge
leads to a loss of information. Note that the total entropy H(E|f) of the translation
model can be recomputed incrementally after merging (f1 , f2 ) by:
IG can also be interpreted as a measure of similarity between two word forms and
can be readily used in any clustering model, such as k-means. Doing so would however
require to fix the total number of clusters, which we would rather like to determine
based on the available data. We have therefore opted for an agglomerative clustering
procedure, which we now fully describe.
In practice, our algorithm is applied at the level of PoS, rather than individual
lemmas: we therefore assume that for a given PoS p, all lemmas have the same number
np of possible morphological variants (cells in their paradigm). This means that IG
computations will be aggregated over all lemmas of a given PoS, based on statistics
maintained on a per lemma basis. For each lemma of PoS p, the starting point is a
matrix Ll ∈ [−1 : 1]np ×np , with Ll (i, j) the IG resulting from merging forms li and lj
of lemma l. The average of these matrices over all lemmas defines the PoS level matrix
Mp ∈ [−1 : 1]np ×np containing the average information gain resulting from merging
two cells.
Algorithm 1: A bottom-up clustering algorithm
1 C(p) ← {1, ..., np }
2 i, j ← arg maxi′ ,j′ ∈C(p)2 Mp (i′ , j′ )
3 repeat
4 Merge i and j in C(p)
5 for l ∈ Vlem do
6 Remove Ll (i, j), create Ll (ij)
7 Compute p(ij), p(E|ij) and H(E|ij)
8 Compute Ll (ij, k) for k ∈ C(p)
∑
9 Mp ← l∈Vlem Ll
10 i, j ← arg maxi′ ,j′ ∈C(p)2 Mp (i′ , j′ )
11 until Mp (i, j) < m or |C(p)| = 1
52
F. Burlot, F. Yvon Learning Morphological Normalization (49–60)
the statistics for the new cluster (unigram probability, translation probability and en-
tropy) are recomputed for all lemmas and used to update the PoS-level IG matrix Mp .
When the procedure halts, a clustering C(p) is obtained for PoS p, which can then be
applied to normalize the source data in various ways (see Section 4.3).
In practice, we obtained slightly better results and a much better runtime than the
exact computation of algorithm 1 with an alternative update regime for the IG Matrix
Mp , which dispenses with the costly update of all the matrices Ll (lines 5–8). Once
initialized, Mp is treated like a similarity matrix and updated using a procedure rem-
iniscent of the linkage clustering algorithm. The aggregated matrix cell for clusters c1
and c2 is thus computed as the average IG of all possible 2-way merging operations:
∑ ∑
f1 ∈c1 f2 ∈c2 M(f1 , f2 )
Mp (c1 , c2 ) = . (4)
|c1 | × |c2 |
We assess the normalization model on MT tasks for three language pairs in both
directions: Czech-English, Russian-English and Czech-French; note that the latter in-
volves two MRLs.
Tokenization of English and French uses in-house tools. We used the script from
the Moses toolkit (Koehn et al., 2007) for Czech and TreeTagger (Schmid, 1994) for
Russian. The MT models are trained using Moses with various datasets from WMT
20163 (Table 1). 4-gram language models were trained with KenLM (Heafield, 2011)
over the monolingual datasets. These systems are optimized with KB-MIRA (Cherry
and Foster, 2012) using WMT Newstest-2015 and tested on Newstest-2016. The Czech-
French systems were tuned on Newstest-2014 and tested on Newstest-2013.
3 http://www.statmt.org/wmt16
53
PBML 108 JUNE 2017
obtained with Morphodita (Straková et al., 2014) for Czech and TreeTagger (Schmid,
1994; Sharoff et al., 2008) for Russian. Filtering the MRL lemmas when performing
clustering yields better results and we have excluded lemmas appearing less than 100
times, as well as word forms occurring less than 10 times in the training set in order
to mitigate the noise in the initial alignments. When clustering paradigm cells (see
Section 3.2), we set the minimum IG value m = 0.
4.3. MT experiments
The results for all Czech systems are in Table 2 and are reported based on different
applications of the normalization model. Indeed, normalization can be used to train
both the alignment (ali cx) or the full system (cx2en), yielding a total improvement of
1.36 BLEU in the Small conditions. Using it only for alignments or only for the MT sys-
tem gives worse results, still outperforming the baseline (cs2en). This shows that both
tasks take advantage of the source normalization. Another way to apply the cluster-
ing model is to exclude from normalization the 100 most frequent lemmas (100 freq),
which gives the best result for this setup. For the other direction (en2cs), the Czech
normalization was used to train the alignments and gives only a slight improvement
over the baseline. Results for the translation into normalized Czech (en2cx) after a
reinflection step are reported in Section 5.2.
The same tendency holds for the Larger Czech-English system, even though the
contrasts in BLEU scores are slightly less visible, due to the larger amount of training
54
F. Burlot, F. Yvon Learning Morphological Normalization (49–60)
data, which reduces sparsity. For this setup, we also have tried different values of the
minimum IG m (see Section 3.2). Our results suggest that the optimal value for m is
close to 0. Indeed, higher values produce more clusters, which leads to more OOVs
(1761 OOVs for 10−4 , vs. 1604 for m = −10−4 ), thus hurting the overall performance.
In the Largest Czech-English setup, using normalization to train both the align-
ments and the translation system hurts the performance (-0.43 BLEU). On the other
hand, using it only to train the alignments does give a small improvement. In the re-
verse direction (en2cs), training the alignments over normalized Czech does not give
any significant improvement.
Results for a manual normalization (manual) are also reported. The normalization
rules are close to the ones used in (Burlot et al., 2016) where nouns are distinguished by
number and negation, adjectives by negation and degree of comparison, etc. We also
applied rules for verb clusters that are distinguished by tense and negation, except the
singular third person present tense that is kept. This manual normalization improves
the baseline (+0.61), but not as much as our best system (+1.00).
The results for Russian-English follow the same tendency as Czech-English, except
that keeping the word forms for the 100 most frequent lemmas did not improve over
55
PBML 108 JUNE 2017
the full normalization of the training set. Finally, we note in Table 3 that the Czech nor-
malization towards French also helps to improve the translation, even though the tar-
get language is morphologically richer than English. The improvements are smaller,
though, than when translating into English. We assume that this is due to a degree of
normalization that is lower when the source shares certain properties with the target,
such as adjective inflection, which leads our model to create more classes. Indeed, the
model distinguishes nouns by their number, just like with English, but moreover cre-
ates separate clusters for each adjective gender. This reduced degree of normalization
did not help the training of alignments when translating into Czech (fr2cs).
5. Morphological Reinflection
When translating into a MRL, using normalization to train just the alignments did
not prove very helpful (Section 4.3). We now consider using it for the complete trans-
lation system. Translating from English into fully inflected Czech however requires
a non-trivial post-processing step for reinflection. In this section, we introduce our
solution to this problem and provide results for several English-Czech systems.
For this sequence labeling task, we used a bidirectional recurrent neural network
(RNN) that considers both normalized Czech words as well as source-side English
56
F. Burlot, F. Yvon Learning Morphological Normalization (49–60)
tokens to make its predictions (see Figure 1). It computes a probability distribution
over the output PoS tags y at time t, given both the Czech (f) and the English (e)
sentences, as well as the previous prediction: p(yt |f, e, yt−1 ).
For each word ft in the Czech sentence, we need to encode the English words that
generated ft during the translation process. As there can be an arbitrary number of
them (denoted It below), we used a RNN layer,4 where each state Si inputs a source to-
ken representation at,i and the previous hidden state Si−1 . The last state (at time It ) of
that layer is used to represent the sequence of aligned tokens: St,I = A(St,It −1 , at,It ).
Each normalized Czech word representation is decomposed into a lemma em-
bedding lt and a cluster embedding ct , which are represented in distinct continous
spaces. These vectors are concatenated with the source representation St,It , defining
the input to the bidirectional RNN5 performing PoS tagging. A forward layer hidden
→
−
state H at time t is therefore computed as: H t = R(Ht−1 , [St,It ; lt ; ct ]). Finally, both
forward and backward layers are concatenated with the representation of the preced-
ing PoS tag yt−1 6 and the result is passed through a last feed-forward layer to which
a softmax is applied. All the model parameters, including embeddings, are trained
jointly in order to maximize the log-likelihood of the training data.
The reinflection systems introduced in this section were trained with the parallel
English-Czech data used for the Small setup (News-Commentary). The fine-grained
PoS tags are the same as the ones used to train the normalization in Section 4 (Straková
et al., 2014).7 The word alignments used for the training and validation sets were ob-
tained with fast_align (Dyer et al., 2013). At test time, we use the alignments produced
by the MT decoder. Since the Czech side of the parallel data must be normalized prior
to training, the results below were obtained with two versions of the RNN model:
with the Small data normalization and with the Larger data one (see Section 4).
Each normalized Czech word is associated with a sequence of source English words
that we collect as follows: using word alignments, we take the English words that are
linked to the current position, as well as surrounding unaligned words. These un-
aligned words often contain essential information: as shown in (Burlot and Yvon,
2015), many of them have a grammatical content that is helpful to predict the correct
inflection on target side. For instance, the English preposition of is an important pre-
4 Encoding the sequence of aligned tokens with a “bag of words” model, where we just sum the embed-
57
PBML 108 JUNE 2017
dictor of the Czech genitive case. This type of grammatical information is the only
one that matters for this task, since the lexical content of the Czech words is already
computed by the MT system and can not be changed. In fact, replacing the English
content words by their PoS and keeping only words in a list of stopwords proved to
work better than keeping all the words. Decoding used a beam search of size 5, and
the final lookup uses the Morphodita morphological generator.
We consider here three English-Czech MT systems with reinflection. The training
data is the same as the Small, Larger and Largest systems described in Section 4,
except that the Czech target side is now normalized. The reinflection model can also
be used in different ways. One can use it to process the one-best hypothesis of the
MT system, or the n-best hypotheses (n = 300 in our experiments). A third approach
reinflects n-best lists and outputs k-best hypotheses from the reinflection model (k = 5
in our experiments). These are finally scored by a language model trained on the same
data as the one used in the MT system – albeit with fully inflected words. This score
is added to the ones given by the MT system. With nk-best reinflection, we also add
the scores given by the reinflection model (log-probability of the predicted sequence).
All these scores are finally interpolated using Mira optimization over Newstest-2015
set and produce a single best output sentence.
58
F. Burlot, F. Yvon Learning Morphological Normalization (49–60)
6. Conclusion
We have introduced a simple language agnostic way to automatically infer the nor-
malization of a morphologically rich language with respect to the target language that
consists in clustering together words that share the same translation, and have shown
that it improves machine translation in both directions. Future work will consist in
testing our model on neural machine translation systems.
Acknowledgments
This work has been partly funded by the European Union’s Horizon 2020 research
and innovation programme under grant agreement No. 645452 (QT21).
Bibliography
Allauzen, Alexandre, Lauriane Aufrant, Franck Burlot, Ophélie Lacroix, Elena Knyazeva,
Thomas Lavergne, Guillaume Wisniewski, and François Yvon. LIMSI@WMT16: Machine
Translation of News. In Proc. WMT, pages 239–245, Berlin, Germany, 2016.
Bojar, Ondřej. English-to-Czech Factored Machine Translation. In Proc. of the 2nd WMT, pages
232–239, Prague, Czech Republic, 2007.
Bojar, Ondřej, Yvette Graham, Amir Kamran, and Miloš Stanojević. Results of the WMT16
Metrics Shared Task. In Proc. WMT, pages 199–231, Berlin, Germany, 2016.
Burlot, Franck and François Yvon. Morphology-Aware Alignments for Translation to and from
a Synthetic Language. In Proc. IWSLT, pages 188–195, Da Nang, Vietnam, 2015.
Burlot, Franck, Elena Knyazeva, Thomas Lavergne, and François Yvon. Two-Step MT: Predict-
ing Target Morphology. In Proc. IWSLT, Seattle, USA, 2016.
Chahuneau, Victor, Eva Schlinger, Noah A. Smith, and Chris Dyer. Translating into Morpho-
logically Rich Languages with Synthetic Phrases. In EMNLP, pages 1677–1687, 2013.
Cherry, Colin and George Foster. Batch Tuning Strategies for Statistical Machine Translation.
In Proceedings of the NAACL-HLT, pages 427–436, Montreal, Canada, 2012.
Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On
the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proc.
SSST@EMNLP, pages 103–111, Doha, Qatar, 2014.
Durgar El-Kahlout, Ilknur and François Yvon. The pay-offs of preprocessing for German-
English Statistical Machine Translation. In Proc. IWSLT, pages 251–258, Paris, France, 2010.
Dyer, Chris, Victor Chahuneau, and Noah A. Smith. A Simple, Fast, and Effective Reparame-
terization of IBM Model 2. In Proc. NAACL, pages 644–648, Atlanta, Georgia, 2013.
El Kholy, Ahmed and Nizar Habash. Translate, Predict or Generate: Modeling Rich Morphol-
ogy in Statistical Machine Translation. In Proc. EAMT, pages 27–34, Trento, Italy, 2012a.
El Kholy, Ahmed and Nizar Habash. Rich Morphology Generation Using Statistical Machine
Translation. In Proc. INLG, pages 90–94, 2012b.
59
PBML 108 JUNE 2017
Fraser, Alexander, Marion Weller, Aoife Cahill, and Fabienne Cap. Modeling Inflection and
Word-Formation in SMT. In Proc. EACL, pages 664–674, Avignon, France, 2012.
Goldwater, Sharon and David McClosky. Improving Statistical MT through Morphological
Analysis. In Proc. HLT–EMNLP, pages 676–683, Vancouver, Canada, 2005.
Heafield, Kenneth. KenLM: Faster and Smaller Language Model Queries. In Proc. WMT, pages
187–197, Edinburgh, Scotland, 2011.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola
Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej
Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open Source Toolkit for Statistical
MT. In Proc. ACL:Systems Demos, pages 177–180, Prague, Czech Republic, 2007.
Lo, Chi-kiu, Colin Cherry, George Foster, Darlene Stewart, Rabib Islam, Anna Kazantseva, and
Roland Kuhn. NRC Russian-English Machine Translation System for WMT 2016. In Proc.
WMT, pages 326–332, Berlin, Germany, 2016.
Minkov, Einat, Kristina Toutanova, and Hisami Suzuki. Generating Complex Morphology for
Machine Translation. In Proc. ACL, pages 128–135, Prague, Czech Republic, 2007.
Ney, Hermann and Maja Popovic. Improving Word Alignment Quality using Morpho-syntactic
Information. In Proc. COLING, pages 310–314, Geneva, Switzerland, 2004.
Rosa, Rudolf. Automatic post-editing of phrase-based machine translation outputs. Master’s
thesis, Institute of Formal and Applied Linguistics, Charles University, 2013.
Schmid, Helmut. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of
International Conference on New Methods in Language Processing, Manchester, UK, 1994.
Sharoff, Serge, Mikhail Kopotev, Tomaz Erjavec, Anna Feldman, and Dagmar Divjak. Design-
ing and Evaluating a Russian Tagset. In Proc. LREC, pages 279–285, Marrakech, Marocco,
2008.
Stanojević, Miloš and Khalil Sima’an. Fitting Sentence Level Translation Evaluation with Many
Dense Features. In Proc. EMNLP, pages 202–206, Doha, Qatar, 2014.
Straková, Jana, Milan Straka, and Jan Hajič. Open-Source Tools for Morphology, Lemmatiza-
tion, POS Tagging and Named Entity Recognition. In Proc. ACL: System Demos, pages 13–18,
Baltimore, MA, 2014.
Talbot, David and Miles Osborne. Modelling Lexical Redundancy for Machine Translation. In
Proc. ACL, pages 969–976, Sydney, Australia, 2006.
Toutanova, Kristina, Hisami Suzuki, and Achim Ruopp. Applying Morphology Generation
Models to Machine Translation. In Proc. ACL-08: HLT, pages 514–522, Columbus, OH, 2008.
Wang, Weiyue, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. CharacTer: Trans-
lation Edit Rate on Character Level. In Proc. WMT, pages 505–510, Berlin, Germany, 2016.
60