Mārcis Pinnis, Andrejs Vasiļjevs, Rihards Kalniņš, Roberts Rozis, Raivis Skadiņš, Valters Šics
Vienı̄bas gatve 75A, Riga, Latvia, LV-1004
{marcis.pinnis, rihards.kalnins, roberts.rozis, raivis.skadins, valters.sics, andrejs}@tilde.lv
In this paper, we present Tilde MT, a custom machine translation (MT) platform that provides linguistic data storage (parallel,
monolingual corpora, multilingual term collections), data cleaning and normalisation, statistical and neural machine translation system
training and hosting functionality, as well as wide integration capabilities (a machine user API and popular computer-assisted translation
tool plugins). We provide details for the most important features of the platform, as well as elaborate typical MT system training
workflows for client-specific MT solution development.
2.5. CAT Tool Plugins and External API
Client layer
Professional translators often use specific computer-
Web CAT Mobile Web assisted translation tools in their professional duties. De-
browser Tools applications widget pending on specific projects or customers, translators may
have to use different CAT tools. Therefore, it is crucial to
the success of an MT platform to provide integration ca-
Tilde MT Interface layer External pabilities for at least the most popular CAT tools. MT sys-
tems trained on Tilde MT can be accessed from at least four
Web site API popular CAT tools5 : MateCat (Federico et al., 2014), SDL
Trados Studio, Memsource6 , and memoQ7 .
to the filters used for SMT systems, for NMT systems Tilde 6. Morphology-driven word splitting (MWS) (Pinnis et
MT performs also the following filtering steps to ensure that al., 2017b) or byte-pair encoding (BPE) (Sennrich et
the parallel corpora are filtered more strictly (Pinnis et al., al., 2015). For NMT systems, tokens can be split using
2017c): a morphological analyser8 and processed with BPE.
1. Incorrect language filtering using a language detection 7. Source side factorisation. Tilde MT supports NMT
tool (Shuyo, 2010). models that use linguistic input features (Sennrich and
Haddow, 2016). Therefore, the source side can be fac-
2. Low content overlap filtering using the cross-lingual tored using language-specific factorisation tools (de-
alignment tool MPAligner (Pinnis, 2013). pending on the source language - either part-of-speech
or morphological taggers or syntactic parsers).
3. Digit mismatch filtering, which showed to be effective
in identifying parallel corpora sentence segmentation 3.3. SMT and NMT Model Training
issues. After the data is pre-processed, SMT or NMT models are
trained. During configuration of an MT system, users can
After filtering, each valid segment is cleaned in order to fur-
freely select whether to train an SMT or an NMT model.
ther reduce noise and to remove potential non-translatable
For SMT models, word alignment is performed using fast-
text fragments, which will be processed by the formatting
align (Dyer et al., 2013), after which a 7-gram translation
tag handling method in the translation workflow prior to de-
and the wbe-msd-bidirectional-fe-allff 9 reordering models
coding but are not necessary during training. The following
are built. For language modelling, the KenLM (Heafield,
are the main cleaning steps:
2011) toolkit is used (the n-gram order can be specified by
1. Removal of HTML and XML tags, the users). SMT systems are tuned using MERT (Bertoldi
et al., 2009).
2. Removal of the byte order mark For NMT models, training data is further pre-processed by
introducing unknown phenomena (i.e., unknown word to-
3. Removal of escaped characters (e.g., ”\n”) kens) within training data following the methodology by
Pinnis et al. (2017b). Then, an NMT model is trained using
4. Decoding of XML entities, normalisation of whites-
the configuration specified by the user (e.g., the vocabulary
pace characters
size, embedding and hidden layer dimensions, whether to
5. Removal of empty braces and curly tags (specific to use dropout, the learning rate, gradient clipping, etc. pa-
parallel corpora extracted from some CAT tools) rameters can be freely configured). To ensure the stability
of the system, a maximum value restriction is applied for
6. Separation of ligatures into letters (specific to parallel each configuration parameter that influences the hardware
corpora extracted using OCR methods) resources to be consumed.
INPUT OUTPUT noise, development. The platform also allows training more
document document robust NMT models by preparing data in a way that it con-
tains unknown phenomena in common contexts. We also
described the translation workflow and its abilities to han-
Handling of tags Tag re-inser�on dle unknown phenomena and to facilitate customer specific
(text extrac�on) (text inser�on) customisation needs. Finally, we briefly discussed also the
various integration possibilities offered by Tilde MT, such
as the CAT tool plugins, the external API, and the Tilde
Data pre-processing Data post-processing Terminology integration).
Acknowledgements
Transla�on with SMT In accordance with the contract No. be-
or NMT engines tween the “Forest Sector Competence Centre” Ltd. and
the Central Finance and Contracting Agency, concluded
Figure 2: A broad overview of the translation workflow on 13th of October, 2016, the study is conducted by Tilde
Ltd. with support from the European Regional Develop-
ment Fund (ERDF) within the framework of the project
MWS). For NMT systems, rare and unknown words are ”Forest Sector Competence Centre”.
identified (based on word part unigram and bigram statis-
