Tilde MT Platform For Developing Client Specific MT Solutions
Tilde MT Platform For Developing Client Specific MT Solutions
Tilde MT Platform For Developing Client Specific MT Solutions
Mārcis Pinnis, Andrejs Vasiļjevs, Rihards Kalniņš, Roberts Rozis, Raivis Skadiņš, Valters Šics
Tilde
Vienı̄bas gatve 75A, Riga, Latvia, LV-1004
{marcis.pinnis, rihards.kalnins, roberts.rozis, raivis.skadins, valters.sics, andrejs}@tilde.lv
Abstract
In this paper, we present Tilde MT, a custom machine translation (MT) platform that provides linguistic data storage (parallel,
monolingual corpora, multilingual term collections), data cleaning and normalisation, statistical and neural machine translation system
training and hosting functionality, as well as wide integration capabilities (a machine user API and popular computer-assisted translation
tool plugins). We provide details for the most important features of the platform, as well as elaborate typical MT system training
workflows for client-specific MT solution development.
1344
2.5. CAT Tool Plugins and External API
Client layer
Professional translators often use specific computer-
Web CAT Mobile Web assisted translation tools in their professional duties. De-
browser Tools applications widget pending on specific projects or customers, translators may
have to use different CAT tools. Therefore, it is crucial to
the success of an MT platform to provide integration ca-
Tilde MT Interface layer External pabilities for at least the most popular CAT tools. MT sys-
tems trained on Tilde MT can be accessed from at least four
Web site API popular CAT tools5 : MateCat (Federico et al., 2014), SDL
Trados Studio, Memsource6 , and memoQ7 .
1345
to the filters used for SMT systems, for NMT systems Tilde 6. Morphology-driven word splitting (MWS) (Pinnis et
MT performs also the following filtering steps to ensure that al., 2017b) or byte-pair encoding (BPE) (Sennrich et
the parallel corpora are filtered more strictly (Pinnis et al., al., 2015). For NMT systems, tokens can be split using
2017c): a morphological analyser8 and processed with BPE.
1. Incorrect language filtering using a language detection 7. Source side factorisation. Tilde MT supports NMT
tool (Shuyo, 2010). models that use linguistic input features (Sennrich and
Haddow, 2016). Therefore, the source side can be fac-
2. Low content overlap filtering using the cross-lingual tored using language-specific factorisation tools (de-
alignment tool MPAligner (Pinnis, 2013). pending on the source language - either part-of-speech
or morphological taggers or syntactic parsers).
3. Digit mismatch filtering, which showed to be effective
in identifying parallel corpora sentence segmentation 3.3. SMT and NMT Model Training
issues. After the data is pre-processed, SMT or NMT models are
trained. During configuration of an MT system, users can
After filtering, each valid segment is cleaned in order to fur-
freely select whether to train an SMT or an NMT model.
ther reduce noise and to remove potential non-translatable
For SMT models, word alignment is performed using fast-
text fragments, which will be processed by the formatting
align (Dyer et al., 2013), after which a 7-gram translation
tag handling method in the translation workflow prior to de-
and the wbe-msd-bidirectional-fe-allff 9 reordering models
coding but are not necessary during training. The following
are built. For language modelling, the KenLM (Heafield,
are the main cleaning steps:
2011) toolkit is used (the n-gram order can be specified by
1. Removal of HTML and XML tags, the users). SMT systems are tuned using MERT (Bertoldi
et al., 2009).
2. Removal of the byte order mark For NMT models, training data is further pre-processed by
introducing unknown phenomena (i.e., unknown word to-
3. Removal of escaped characters (e.g., ”\n”) kens) within training data following the methodology by
Pinnis et al. (2017b). Then, an NMT model is trained using
4. Decoding of XML entities, normalisation of whites-
the configuration specified by the user (e.g., the vocabulary
pace characters
size, embedding and hidden layer dimensions, whether to
5. Removal of empty braces and curly tags (specific to use dropout, the learning rate, gradient clipping, etc. pa-
parallel corpora extracted from some CAT tools) rameters can be freely configured). To ensure the stability
of the system, a maximum value restriction is applied for
6. Separation of ligatures into letters (specific to parallel each configuration parameter that influences the hardware
corpora extracted using OCR methods) resources to be consumed.
1346
INPUT OUTPUT noise, development. The platform also allows training more
document document robust NMT models by preparing data in a way that it con-
tains unknown phenomena in common contexts. We also
described the translation workflow and its abilities to han-
Handling of tags Tag re-inser�on dle unknown phenomena and to facilitate customer specific
(text extrac�on) (text inser�on) customisation needs. Finally, we briefly discussed also the
various integration possibilities offered by Tilde MT, such
as the CAT tool plugins, the external API, and the Tilde
Data pre-processing Data post-processing Terminology integration).
6. Acknowledgements
Transla�on with SMT In accordance with the contract No. 1.2.1.1/16/A/009 be-
or NMT engines tween the “Forest Sector Competence Centre” Ltd. and
the Central Finance and Contracting Agency, concluded
Figure 2: A broad overview of the translation workflow on 13th of October, 2016, the study is conducted by Tilde
Ltd. with support from the European Regional Develop-
ment Fund (ERDF) within the framework of the project
MWS). For NMT systems, rare and unknown words are ”Forest Sector Competence Centre”.
identified (based on word part unigram and bigram statis-
tics in the training data) and replaced with unknown word 7. Bibliographical References
tokens in order to assist the NMT model in the handling of Bertoldi, N., Haddow, B., and Fouet, J.-B. (2009). Im-
rare and unknown phenomena (Pinnis et al., 2017c). Each proved Minimum Error Rate Training in Moses. The
pre-processed sentence is then translated using the MT en- Prague Bulletin of Mathematical Linguistics, 91(1):7–
gine that was used during training. 16.
After translation, unknown word tokens are replaced back Chen, Y. and Eisele, A. (2012). MultiUN v2: UN Doc-
with their source language tokens, the text is recased and uments with Multilingual Alignments. In LREC, pages
detokenised using language-specific detokenisation rules. 2500–2504.
Tilde MT provides also functionality for rule-based local- Dyer, C., Chahuneau, V., and Smith, N. A. (2013). A
isation, i.e., the transformation of certain types of tokens Simple, Fast, and Effective Reparameterization of IBM
according to customer specified localisation rules. For in- Model 2. In Proceedings of the Conference of the North
stance, a customer can pre-set the desired styles of quo- American Chapter of the Association for Computational
tation marks and apostrophes, number formats (i.e., deci- Linguistics: Human Language Technologies (NAACL
mal point and thousand separators), even conversion rules HLT 2013), number June, pages 644–648, Atlanta, USA.
for different units of measurement (e.g., imperial to metric Federico, M., Bertoldi, N., Cettolo, M., Negri, M., Turchi,
units, etc.). Finally, if the source was a translation segment, M., Trombetti, M., Cattelan, A., Farina, A., Lupinetti, D.,
formatting tags are re-inserted in the translated text and (in Martines, A., et al. (2014). The MateCat Tool. In Pro-
case of document and Web page translation) the translated ceedings of COLING 2014, the 25th International Con-
segment is inserted in the final document. ference on Computational Linguistics: System Demon-
In order for the whole translation workflow to work, it is strations, pages 129–132.
important to keep track of word and phrase alignments and Hajlaoui, N., Kolovratnik, D., Väyrynen, J., Steinberger,
the changes of the word and phrase alignments at each step. R., and Varga, D. (2014). DCEP-Digital Corpus of the
When using SMT models, the word alignments are pro- European Parliament. In LREC, pages 3164–3171.
vided by the phrase-based translation model, and when us-
Heafield, K. (2011). KenLM : Faster and Smaller Lan-
ing NMT models, the word alignments are extracted (Pinnis
guage Model Queries. In Proceedings of the Sixth Work-
et al., 2017c) from the alignment matrices produced by the
shop on Statistical Machine Translation, number 2009,
attention mechanism of the NMT model.
pages 187–197. Association for Computational Linguis-
To address scalability requirements, Tilde MT allows start- tics.
ing multiple translation server instances of each MT sys-
Junczys-Dowmunt, M., Dwojak, T., and Hoang, H. (2016).
tem. To save computing resources, translation servers can
Is Neural Machine Translation Ready for Deployment?
be set to fall asleep after a certain time without any transla-
A Case Study on 30 Translation Directions. arXiv
tion requests.
preprint arXiv:1610.01108.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Fed-
5. Conclusion erico, M., Bertoldi, N., Cowan, B., Shen, W., Moran,
In this paper, we presented Tilde MT, a distributed cloud- C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and
based custom machine translation platform that is capable Herbst, E. (2007). Moses: Open Source Toolkit for
of supporting SMT and NMT systems. Tilde MT with its Statistical Machine Translation. In Proceedings of the
feature-rich MT system training workflow alleviates a lot of 45th Annual Meeting of the ACL on Interactive Poster
manual work necessary for data preparation prior to train- and Demonstration Sessions, ACL ’07, pages 177–180,
ing. The workflows have been specifically adjusted to cater Stroudsburg, PA, USA. Association for Computational
for NMT system, which are more sensitive to systematic Linguistics.
1347
Koehn, P. (2005). Europarl: a Parallel Corpus for Statisti- arXiv:1703.04357.
cal Machine Translation. In MT summit, volume 5, pages Shuyo, N. (2010). Language Detection Library for Java.
79–86. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Er-
Lommel, A. R. and DePalma, D. A. (2016). Europe’s javec, T., Tufis, D., and Varga, D. (2006). The JRC-
Leading Role in Machine Translation. Technical report, Acquis: a Multilingual Aligned Parallel Corpus with 20+
Common Sense Advisory. Languages. arXiv preprint cs/0609058.
Microsoft. (2015). Translation and UI Strings Glossaries. Steinberger, R., Eisele, A., Klocek, S., Pilos, S., and
Pinnis, M., Gornostay, T., Skadiņš, R., and Vasiļjevs, A. Schlter, P. (2012). DGT-TM: a Freely Available Trans-
(2013). Online Platform for Extracting, Managing, and lation Memory in 22 Languages. In Proceedings of the
Utilising Multilingual Terminology. In Proceedings of Eight International Conference on Language Resources
the Third Biennial Conference on Electronic Lexicogra- and Evaluation (LREC’12), pages 454–459.
phy, eLex 2013, pages 122–131, Tallinn, Estonia. Tro- Tiedemann, J. (2009). News from OPUS - A Collection of
jina, Institute for Applied Slovene Studies (Ljubljana, Multilingual Parallel Corpora with Tools and Interfaces.
Slovenia) / Eesti Keele Instituut (Tallinn, Estonia). In Recent advances in natural language processing, vol-
Pinnis, M., Krišlauks, R., Deksne, D., and Miks, T. ume 5, pages 237–248.
(2017a). Evaluation of Neural Machine Translation for Vasiļjevs, A., Skadiņš, R., and Tiedemann, J. (2012).
Highly Inflected and Small Languages. In Proceedings LetsMT!: A Cloud-Based Platform for Do-It-Yourself
of the 18th International Conference on Intelligent Text Machine Translation. In Min Zhang, editor, Proceedings
Processing and Computational Linguistics (CICLING of the ACL 2012 System Demonstrations, number July,
2017), Budapest, Hungary. pages 43–48, Jeju Island, Korea. Association for Com-
Pinnis, M., Krišlauks, R., Deksne, D., and Miks, T. putational Linguistics.
(2017b). Neural Machine Translation for Morpholog- Zariņa, I., Ņikiforovs, P., and Skadiņš, R. (2015). Word
ically Rich Languages with Improved Sub-word Units alignment based parallel corpora evaluation and clean-
and Synthetic Data. In Proceedings of the 20th In- ing using machine learning techniques. In Proceedings
ternational Conference of Text, Speech and Dialogue of the 18th Annual Conference of the European Associa-
(TSD2017), Prague, Czechia. tion for Machine Translation.
Pinnis, M., Krišlauks, R., Miks, T., Deksne, D., and Šics, V.
(2017c). Tilde’s Machine Translation Systems for WMT
2017. In Proceedings of the Second Conference on Ma-
chine Translation (WMT 2017), Volume 2: Shared Task
Papers, pages 374–381, Copenhagen, Denmark. Associ-
ation for Computational Linguistics.
Pinnis, M. (2013). Context Independent Term Mapper
for European Languages. In Proceedings of Recent Ad-
vances in Natural Language Processing (RANLP 2013),
pages 562–570, Hissar, Bulgaria.
Pinnis, M. (2015). Dynamic Terminology Integration
Methods in Statistical Machine Translation. In Proceed-
ings of the Eighteenth Annual Conference of the Euro-
pean Association for Machine Translation (EAMT 2015),
pages 89–96, Antalya, Turkey. European Association for
Machine Translation.
Rozis, R. and Skadiņš, R. (2017). Tilde MODEL-
Multilingual Open Data for EU Languages. In Proceed-
ings of the 21st Nordic Conference on Computational
Linguistics, pages 263–265.
Sennrich, R. and Haddow, B. (2016). Linguistic Input
Features Improve Neural Machine Translation. In Pro-
ceedings of the First Conference on Machine Translation
(WMT 2016) - Volume 1: Research Papers, pages 83–91.
Sennrich, R., Haddow, B., and Birch, A. (2015). Neural
Machine Translation of Rare Words with Subword Units.
In Proceedings of the 54th Annual Meeting of the Associ-
ation for Computational Linguistics (ACL 2015), Berlin,
Germany. Association for Computational Linguistics.
Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow,
B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S.,
Barone, A. V. M., Mokry, J., et al. (2017). Nematus: a
Toolkit for Neural Machine Translation. arXiv preprint
1348