Tilde MT Platform For Developing Client Specific MT Solutions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Tilde MT Platform for Developing Client Specific MT Solutions

Mārcis Pinnis, Andrejs Vasiļjevs, Rihards Kalniņš, Roberts Rozis, Raivis Skadiņš, Valters Šics
Tilde
Vienı̄bas gatve 75A, Riga, Latvia, LV-1004
{marcis.pinnis, rihards.kalnins, roberts.rozis, raivis.skadins, valters.sics, andrejs}@tilde.lv

Abstract
In this paper, we present Tilde MT, a custom machine translation (MT) platform that provides linguistic data storage (parallel,
monolingual corpora, multilingual term collections), data cleaning and normalisation, statistical and neural machine translation system
training and hosting functionality, as well as wide integration capabilities (a machine user API and popular computer-assisted translation
tool plugins). We provide details for the most important features of the platform, as well as elaborate typical MT system training
workflows for client-specific MT solution development.

Keywords: machine translation, cloud-based platform, data processing workflows

1. Introduction lel and monolingual corpora and multilingual term collec-


Today, as globalisation and cross-border trade infuse more tions); train SMT and NMT systems; integrate the systems
sectors of the economy, the need for translations and mul- through computer assisted translation (CAT) tool plugins or
tilingual content is continuously increasing. The demand the Tilde MT external API in users’ translation and multi-
for translations is surpassing the supply that professional lingual content creation workflows; and perform translation
translation services can handle. As Common Sense Advi- of text snippets, documents of various popular formats, and
sory reported in a 2016 market study, ”enterprises intend websites directly in the Tilde MT graphical user interface
to increase their translation volumes by 67% over current or using a widget that can be integrated in any website.
levels by 2020.” (Lommel and DePalma, 2016) An obvi-
2.1. SMT and NMT System Support
ous alternative to customers who cannot afford professional
translation services or who have too much content to trans- Tilde MT supports two MT paradigms: statistical machine
late, and also an obvious choice for translation and local- translation and neural machine translation. The platform al-
isation service providers who see the potential in increas- lows to train Moses phrase-based SMT systems (Koehn et
ing their productivity, is machine translation (MT). Gen- al., 2007) and attention-based encoder-decoder NMT sys-
eral MT providers (such as Google1 and Microsoft2 ) cater tems with multiplicative long short-term memory units us-
for the masses with general-domain MT systems. However, ing the Nematus toolkit (Sennrich et al., 2017). The plat-
customers who require purpose-built and highly customised form has been designed to allow switching to different
MT systems, particular for complex or low-resourced lan- NMT engines easily. For translation, Nematus NMT mod-
guages, turn to MT service providers that offer MT sys- els are converted to Marian (formerly AmuNMT) NMT
tem development and customization capabilities as a full- models (Junczys-Dowmunt et al., 2016) that allow reaching
service package. much higher translation speed (up to 10 times faster com-
One such custom MT platform is Tilde MT3 , the successor pared to Nematus in non-batched translation scenarios).
of the LetsMT! platform (Vasiļjevs et al., 2012) for Sta- 2.2. Cloud-based Infrastructure
tistical Machine Translation (SMT) system development,
first launched in 2012. Tilde MT builds upon the LetsMT! To facilitate on-demand training and deployment of MT
platform by providing greater customisation capabilities, systems, it is important for the platform to be highly scal-
smarter data processing workflows, and current state-of- able, available, and reliable. To address these requirements,
the-art Neural Machine Translation (NMT) system support. Tilde MT has been developed as a distributed cloud-based
The paper is further structured as follows: Section 2. pro- platform (see Figure 1) that is able to dynamically start and
vides an overview of the main features of the Tilde MT plat- turn off computing nodes depending on current workloads.
form, Section 3. describes the MT system training work- The computing nodes are responsible for running MT sys-
flow with the different tools used for data processing, Sec- tem training tasks and translation servers. To provide MT
tion 4. describes the MT system translation workflow, and services also to customers with security concerns or cus-
Section 5. concludes the paper. tomers whose data is not allowed to leave the customers’
infrastructure, Tilde MT can be also deployed as an enter-
2. Overview of Tilde MT prise solution in customer infrastructure.
Tilde MT is a cloud-based custom MT platform that al-
2.3. Tilde Data Library
lows users to store linguistic resources (such as paral-
The platform features a resource-rich data repository, the
1
https://translate.google.com Tilde Data Library. The library is used as the central data
2
https://www.bing.com/translator repository for Tilde MT, as well as an open facility for reg-
3
https://tilde.com/mt istered users to upload their own corpora (both parallel and

1344
2.5. CAT Tool Plugins and External API
Client layer
Professional translators often use specific computer-
Web CAT Mobile Web assisted translation tools in their professional duties. De-
browser Tools applications widget pending on specific projects or customers, translators may
have to use different CAT tools. Therefore, it is crucial to
the success of an MT platform to provide integration ca-
Tilde MT Interface layer External pabilities for at least the most popular CAT tools. MT sys-
tems trained on Tilde MT can be accessed from at least four
Web site API popular CAT tools5 : MateCat (Federico et al., 2014), SDL
Trados Studio, Memsource6 , and memoQ7 .

Authentification and authorisation 3. MT System Training Workflow


Logic Tilde MT provides users with rich customisation possibili-
layer Translation Data / System / User
ties when training MT systems. Users can specify which fil-
service management tering and cleaning steps to take, which data pre-processing
tools to use, and which MT models and with which config-
urations to train. Further, we describe in more detail the
Data layer High performance computing main MT system training capabilities of Tilde MT.
(HPC) cluster
3.1. Data Filtering and Cleaning
HPC task scheduler
Not all data that users upload in the Tilde Data Library as
parallel data is actually parallel. For instance, the data may
Tilde CPU CPU CPU CPU contain misalignment issues, formatting issues, encoding
Data CPU CPU CPU CPU corruption issues, sentence breaking issues, etc. Therefore,
Library GPU GPU GPU GPU the Tilde MT platform performs data filtering before MT
system training. The following issues are addressed by var-
ious filters in Tilde MT:
Figure 1: Tilde MT infrastructure design
1. Source-source or target-target entries in parallel data
(equal source/target entries are filtered out).
monolingual) for training MT engines. It stores publicly
2. Sentence splitting issues (segments >1000 symbols or
available and proprietary parallel and monolingual corpora
>400 tokens are filtered out; the numerical thresholds
as well as multilingual term collections that users can use
here and further can be adjusted for each individual
to train their MT systems within the Tilde MT platform.
training task).
Several of the largest (publicly available) parallel corpora
in the Tilde Data Library are the DGT-TM (Steinberger 3. Data corruption through optical character recognition
et al., 2012), Tilde MODEL (Rozis and Skadiņš, 2017), (OCR), e.g., when processing PDF documents (seg-
Open Subtitles (Tiedemann, 2009), MultiUN (Chen and ments containing tokens with >50 symbols are filtered
Eisele, 2012), DCEP (Hajlaoui et al., 2014), JRC-Acquis out).
(Steinberger et al., 2006), Europarl (Koehn, 2005), Mi-
crosoft Translation Memories and UI Strings Glossaries 4. Redundancy issues (duplicate entries are filtered out).
(Microsoft, 2015). The Tilde Data Library stores approxi-
5. Partial translation (also sentence splitting) issues (en-
mately 12.35 billion parallel segments for 58 languages and
tries where the length ratio between the source and tar-
over 4 million terms for more than 125 languages.
get segments is too small (e.g., <0.3) are filtered out).
2.4. Tilde Terminology Integration 6. Foreign language data issues (entries containing letters
Term collections are important linguistic resources that are from neither source nor target languages are filtered
often used by translators who work on domain-specific out).
translation tasks. To alleviate the need for users to store
and manage their terminological resources on multiple plat- 7. Sentence misalignment issues (sentences failing a
forms, the Tilde MT platform allows users to access their cross-lingual alignment test using c-eval (Zariņa et al.,
Tilde Terminology (Pinnis et al., 2013) resources. Users 2015) are filtered out).
can add their term collections to MT system training tasks In our previous research, we identified that NMT systems
for static terminology integration as well as to running are sensitive to systematic noise that can be found in the
SMT4 systems for dynamic terminology integration (Pin- parallel data (Pinnis et al., 2017a). Therefore, additionally
nis, 2015). Integration of terminology has been shown to
improve term translation accuracy by up to 52.6% (Pinnis, 5
For more details, refer to: https://www.tilde.com/products-
2015). and-services/machine-translation/features/integration-with-cat
6
https://memsource.com
4 7
For NMT systems, dynamic integration is not yet available. https://www.memoq.com

1345
to the filters used for SMT systems, for NMT systems Tilde 6. Morphology-driven word splitting (MWS) (Pinnis et
MT performs also the following filtering steps to ensure that al., 2017b) or byte-pair encoding (BPE) (Sennrich et
the parallel corpora are filtered more strictly (Pinnis et al., al., 2015). For NMT systems, tokens can be split using
2017c): a morphological analyser8 and processed with BPE.

1. Incorrect language filtering using a language detection 7. Source side factorisation. Tilde MT supports NMT
tool (Shuyo, 2010). models that use linguistic input features (Sennrich and
Haddow, 2016). Therefore, the source side can be fac-
2. Low content overlap filtering using the cross-lingual tored using language-specific factorisation tools (de-
alignment tool MPAligner (Pinnis, 2013). pending on the source language - either part-of-speech
or morphological taggers or syntactic parsers).
3. Digit mismatch filtering, which showed to be effective
in identifying parallel corpora sentence segmentation 3.3. SMT and NMT Model Training
issues. After the data is pre-processed, SMT or NMT models are
trained. During configuration of an MT system, users can
After filtering, each valid segment is cleaned in order to fur-
freely select whether to train an SMT or an NMT model.
ther reduce noise and to remove potential non-translatable
For SMT models, word alignment is performed using fast-
text fragments, which will be processed by the formatting
align (Dyer et al., 2013), after which a 7-gram translation
tag handling method in the translation workflow prior to de-
and the wbe-msd-bidirectional-fe-allff 9 reordering models
coding but are not necessary during training. The following
are built. For language modelling, the KenLM (Heafield,
are the main cleaning steps:
2011) toolkit is used (the n-gram order can be specified by
1. Removal of HTML and XML tags, the users). SMT systems are tuned using MERT (Bertoldi
et al., 2009).
2. Removal of the byte order mark For NMT models, training data is further pre-processed by
introducing unknown phenomena (i.e., unknown word to-
3. Removal of escaped characters (e.g., ”\n”) kens) within training data following the methodology by
Pinnis et al. (2017b). Then, an NMT model is trained using
4. Decoding of XML entities, normalisation of whites-
the configuration specified by the user (e.g., the vocabulary
pace characters
size, embedding and hidden layer dimensions, whether to
5. Removal of empty braces and curly tags (specific to use dropout, the learning rate, gradient clipping, etc. pa-
parallel corpora extracted from some CAT tools) rameters can be freely configured). To ensure the stability
of the system, a maximum value restriction is applied for
6. Separation of ligatures into letters (specific to parallel each configuration parameter that influences the hardware
corpora extracted using OCR methods) resources to be consumed.

3.2. Data Pre-processing 4. Translation Workflow


Next, the filtered and cleaned corpora are pre-processed us- When an MT model is trained, a translation server can
ing standard and custom tools. This step is identical for be started. A typical translation server (see Figure 2 for
both the training and translation workflows. The following a broad overview) allows to translate text snippets, trans-
pre-processing steps are performed: lation segments (i.e., content that includes tags), docu-
ments (e.g., various OpenDocument10 and Office Open
1. Normalisation of punctuation. Tilde MT allows lim-
XML11 formats, translation and localisation formats, such
iting MT models to one standard of quotation marks
as XML Localisation Interchange File Format (XLIFF)12
and apostrophes.
(with different variations), Translation Memory eXchange
2. Identification of terminology. For SMT systems, dy- (TMX)13 , etc.), and web sites (the latter two basically con-
namic terminology integration support ensures that sist of zero to many translation segments).
terms can be identified in the source text and possi- When translating translation segments, first, formatting
ble translation equivalents can be provided to the SMT tags are removed from the segments and the tag positions
engine before the actual translation. are remembered for reinsertion after translation. Then, the
text is pre-processed using the same steps that were used
3. Identification of non-translatable entities. E-mail ad- for the training data. Additionally to the previous pre-
dresses, URLs, file addresses and XML tags can be processing steps, the text is also split into sentences (before
identified and replaced with place-holders.
8
Currently, for Latvian and English only.
4. Tokenisation. Tilde MT uses a regular expression- 9
For more information, see http://www.statmt.org/moses/?n=
based tokeniser that allows applying customised to- FactoredTraining.BuildReorderingModel
10
kenisation rules for each language and customer. http://opendocumentformat.org/
11
See ISO/IEC 29500 at http://standards.iso.org/ittf/PubliclyAv
5. Truecasing. The standard Moses truecasing tool true- ailableStandards
12
case.perl can be used to truecase the first or all words http://docs.oasis-open.org/xliff
13
of each sentence. http://www.ttt.org/oscarstandards/tmx/tmx13.htm

1346
INPUT OUTPUT noise, development. The platform also allows training more
document document robust NMT models by preparing data in a way that it con-
tains unknown phenomena in common contexts. We also
described the translation workflow and its abilities to han-
Handling of tags Tag re-inser�on dle unknown phenomena and to facilitate customer specific
(text extrac�on) (text inser�on) customisation needs. Finally, we briefly discussed also the
various integration possibilities offered by Tilde MT, such
as the CAT tool plugins, the external API, and the Tilde
Data pre-processing Data post-processing Terminology integration).

6. Acknowledgements
Transla�on with SMT In accordance with the contract No. 1.2.1.1/16/A/009 be-
or NMT engines tween the “Forest Sector Competence Centre” Ltd. and
the Central Finance and Contracting Agency, concluded
Figure 2: A broad overview of the translation workflow on 13th of October, 2016, the study is conducted by Tilde
Ltd. with support from the European Regional Develop-
ment Fund (ERDF) within the framework of the project
MWS). For NMT systems, rare and unknown words are ”Forest Sector Competence Centre”.
identified (based on word part unigram and bigram statis-
tics in the training data) and replaced with unknown word 7. Bibliographical References
tokens in order to assist the NMT model in the handling of Bertoldi, N., Haddow, B., and Fouet, J.-B. (2009). Im-
rare and unknown phenomena (Pinnis et al., 2017c). Each proved Minimum Error Rate Training in Moses. The
pre-processed sentence is then translated using the MT en- Prague Bulletin of Mathematical Linguistics, 91(1):7–
gine that was used during training. 16.
After translation, unknown word tokens are replaced back Chen, Y. and Eisele, A. (2012). MultiUN v2: UN Doc-
with their source language tokens, the text is recased and uments with Multilingual Alignments. In LREC, pages
detokenised using language-specific detokenisation rules. 2500–2504.
Tilde MT provides also functionality for rule-based local- Dyer, C., Chahuneau, V., and Smith, N. A. (2013). A
isation, i.e., the transformation of certain types of tokens Simple, Fast, and Effective Reparameterization of IBM
according to customer specified localisation rules. For in- Model 2. In Proceedings of the Conference of the North
stance, a customer can pre-set the desired styles of quo- American Chapter of the Association for Computational
tation marks and apostrophes, number formats (i.e., deci- Linguistics: Human Language Technologies (NAACL
mal point and thousand separators), even conversion rules HLT 2013), number June, pages 644–648, Atlanta, USA.
for different units of measurement (e.g., imperial to metric Federico, M., Bertoldi, N., Cettolo, M., Negri, M., Turchi,
units, etc.). Finally, if the source was a translation segment, M., Trombetti, M., Cattelan, A., Farina, A., Lupinetti, D.,
formatting tags are re-inserted in the translated text and (in Martines, A., et al. (2014). The MateCat Tool. In Pro-
case of document and Web page translation) the translated ceedings of COLING 2014, the 25th International Con-
segment is inserted in the final document. ference on Computational Linguistics: System Demon-
In order for the whole translation workflow to work, it is strations, pages 129–132.
important to keep track of word and phrase alignments and Hajlaoui, N., Kolovratnik, D., Väyrynen, J., Steinberger,
the changes of the word and phrase alignments at each step. R., and Varga, D. (2014). DCEP-Digital Corpus of the
When using SMT models, the word alignments are pro- European Parliament. In LREC, pages 3164–3171.
vided by the phrase-based translation model, and when us-
Heafield, K. (2011). KenLM : Faster and Smaller Lan-
ing NMT models, the word alignments are extracted (Pinnis
guage Model Queries. In Proceedings of the Sixth Work-
et al., 2017c) from the alignment matrices produced by the
shop on Statistical Machine Translation, number 2009,
attention mechanism of the NMT model.
pages 187–197. Association for Computational Linguis-
To address scalability requirements, Tilde MT allows start- tics.
ing multiple translation server instances of each MT sys-
Junczys-Dowmunt, M., Dwojak, T., and Hoang, H. (2016).
tem. To save computing resources, translation servers can
Is Neural Machine Translation Ready for Deployment?
be set to fall asleep after a certain time without any transla-
A Case Study on 30 Translation Directions. arXiv
tion requests.
preprint arXiv:1610.01108.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Fed-
5. Conclusion erico, M., Bertoldi, N., Cowan, B., Shen, W., Moran,
In this paper, we presented Tilde MT, a distributed cloud- C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and
based custom machine translation platform that is capable Herbst, E. (2007). Moses: Open Source Toolkit for
of supporting SMT and NMT systems. Tilde MT with its Statistical Machine Translation. In Proceedings of the
feature-rich MT system training workflow alleviates a lot of 45th Annual Meeting of the ACL on Interactive Poster
manual work necessary for data preparation prior to train- and Demonstration Sessions, ACL ’07, pages 177–180,
ing. The workflows have been specifically adjusted to cater Stroudsburg, PA, USA. Association for Computational
for NMT system, which are more sensitive to systematic Linguistics.

1347
Koehn, P. (2005). Europarl: a Parallel Corpus for Statisti- arXiv:1703.04357.
cal Machine Translation. In MT summit, volume 5, pages Shuyo, N. (2010). Language Detection Library for Java.
79–86. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Er-
Lommel, A. R. and DePalma, D. A. (2016). Europe’s javec, T., Tufis, D., and Varga, D. (2006). The JRC-
Leading Role in Machine Translation. Technical report, Acquis: a Multilingual Aligned Parallel Corpus with 20+
Common Sense Advisory. Languages. arXiv preprint cs/0609058.
Microsoft. (2015). Translation and UI Strings Glossaries. Steinberger, R., Eisele, A., Klocek, S., Pilos, S., and
Pinnis, M., Gornostay, T., Skadiņš, R., and Vasiļjevs, A. Schlter, P. (2012). DGT-TM: a Freely Available Trans-
(2013). Online Platform for Extracting, Managing, and lation Memory in 22 Languages. In Proceedings of the
Utilising Multilingual Terminology. In Proceedings of Eight International Conference on Language Resources
the Third Biennial Conference on Electronic Lexicogra- and Evaluation (LREC’12), pages 454–459.
phy, eLex 2013, pages 122–131, Tallinn, Estonia. Tro- Tiedemann, J. (2009). News from OPUS - A Collection of
jina, Institute for Applied Slovene Studies (Ljubljana, Multilingual Parallel Corpora with Tools and Interfaces.
Slovenia) / Eesti Keele Instituut (Tallinn, Estonia). In Recent advances in natural language processing, vol-
Pinnis, M., Krišlauks, R., Deksne, D., and Miks, T. ume 5, pages 237–248.
(2017a). Evaluation of Neural Machine Translation for Vasiļjevs, A., Skadiņš, R., and Tiedemann, J. (2012).
Highly Inflected and Small Languages. In Proceedings LetsMT!: A Cloud-Based Platform for Do-It-Yourself
of the 18th International Conference on Intelligent Text Machine Translation. In Min Zhang, editor, Proceedings
Processing and Computational Linguistics (CICLING of the ACL 2012 System Demonstrations, number July,
2017), Budapest, Hungary. pages 43–48, Jeju Island, Korea. Association for Com-
Pinnis, M., Krišlauks, R., Deksne, D., and Miks, T. putational Linguistics.
(2017b). Neural Machine Translation for Morpholog- Zariņa, I., Ņikiforovs, P., and Skadiņš, R. (2015). Word
ically Rich Languages with Improved Sub-word Units alignment based parallel corpora evaluation and clean-
and Synthetic Data. In Proceedings of the 20th In- ing using machine learning techniques. In Proceedings
ternational Conference of Text, Speech and Dialogue of the 18th Annual Conference of the European Associa-
(TSD2017), Prague, Czechia. tion for Machine Translation.
Pinnis, M., Krišlauks, R., Miks, T., Deksne, D., and Šics, V.
(2017c). Tilde’s Machine Translation Systems for WMT
2017. In Proceedings of the Second Conference on Ma-
chine Translation (WMT 2017), Volume 2: Shared Task
Papers, pages 374–381, Copenhagen, Denmark. Associ-
ation for Computational Linguistics.
Pinnis, M. (2013). Context Independent Term Mapper
for European Languages. In Proceedings of Recent Ad-
vances in Natural Language Processing (RANLP 2013),
pages 562–570, Hissar, Bulgaria.
Pinnis, M. (2015). Dynamic Terminology Integration
Methods in Statistical Machine Translation. In Proceed-
ings of the Eighteenth Annual Conference of the Euro-
pean Association for Machine Translation (EAMT 2015),
pages 89–96, Antalya, Turkey. European Association for
Machine Translation.
Rozis, R. and Skadiņš, R. (2017). Tilde MODEL-
Multilingual Open Data for EU Languages. In Proceed-
ings of the 21st Nordic Conference on Computational
Linguistics, pages 263–265.
Sennrich, R. and Haddow, B. (2016). Linguistic Input
Features Improve Neural Machine Translation. In Pro-
ceedings of the First Conference on Machine Translation
(WMT 2016) - Volume 1: Research Papers, pages 83–91.
Sennrich, R., Haddow, B., and Birch, A. (2015). Neural
Machine Translation of Rare Words with Subword Units.
In Proceedings of the 54th Annual Meeting of the Associ-
ation for Computational Linguistics (ACL 2015), Berlin,
Germany. Association for Computational Linguistics.
Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow,
B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S.,
Barone, A. V. M., Mokry, J., et al. (2017). Nematus: a
Toolkit for Neural Machine Translation. arXiv preprint

1348

You might also like