A Study of Indonesian-To-Malaysian MT System
A Study of Indonesian-To-Malaysian MT System
MT System
Abstract—The paper presents an ongoing work on the was originally developed for European languages and one of
implementation of an MT system between Indonesian and the main goals of this paper is to describe the issues
Malaysian. The system uses a method of almost a direct encountered in the process of the application of the method to a
translation exploiting the similarity of both languages. This pair of Asian languages which are typologically different from
method was previously used on a number of language pairs of the European languages for which the method has been
European languages. The paper also makes an overview of originally developed (Slavic and Romance languages).
linguistic phenomena which can negatively influence the
translation quality and it suggests a solution for some of them. If we look at the experiments made so far for related
languages, we will find numerous experiments which have
Keywords-machine translation; related languages; direct been performed recently for various language groups:
translation; morphology; hybrid method
for Slavic languages in [12] and [16],
I. INTRODUCTION for Scandinavian languages in [3], [6], and [13],
Probably none other linguistic application area has attracted for Turkic languages in [10]
as much research effort as the area of automatic translation of
texts between natural languages (a field usually called Machine and for languages of Spain in [1].
Translation -- MT). After more than fifty years of research
The close relatedness of natural languages from one
during which there were periods of uncritical expectations
typological group (and sometimes even across the group
followed by long periods of bitter despair, the application of borders, cf., Czech-to-Lithuanian experiment described in [8])
stochastic methods brought new hopes into a field which
makes the translation task easier thus allowing for the
notoriously failed to provide acceptable results. The stochastic
application of methods which would not be good enough for
methods rejected traditional rule-based approaches and
the translation of unrelated language pairs. Using simpler
replaced them by the exploitation of bigger and bigger amounts
methods does not mean a lower translation quality - many of
of data. The lack of large coverage grammars was replaced by
the translation errors result from the imperfect attempts to parse
a lack of parallel data.
a source language fully, in some cases even to the deep
Although nowadays the expectations are yet again very syntactic level of representation. The accumulation of errors in
high, it is clear that not even the current breakthrough caused parsing, transfer and generation in the systems using the
by stochastic or hybrid approaches as, e.g., in the factored classical transfer-based architecture substantially decreases the
translation model described in [17], will solve all the problems, translation quality.
especially the problems of less represented languages.
One property which makes the translation task easier is the II. TYPOLOGY OF THE LANGUAGE
relatedness of the source and target languages. The relatedness Although spoken by millions of speakers, research on this
usually means a great deal of similarity at all levels, but the pair of languages has not been very enthusiastic compare to
experiments carried out in the past (cf. the references further in most of the European languages. This makes these two closely
the text) have shown that the most important level is the level related languages under question very compelling to be
of syntax closely followed by morphology. explored. Coming from the same language family,
Austronesian, the languages share similar behavior which
This article describes an experiment with the application of
usually being misapprehended by non-natives that they both
an existing model for the MT between related languages on a
are mutually intelligible. The languages are very dynamic
new language pair from a very different language group. The
where the evolution makes them differ from one another.
architecture of the system is based primarily on rule-based
approach which allows for a great deal of ambiguity in all Both of these agglutinative languages have similar
steps. This ambiguity is then resolved by a simple stochastic morphology mechanisms and share some words, both the
ranking of all translation hypotheses. The simple architecture words with exact or similar meaning and also the words with
Orthography – The alphabet is basic modern Latin Morphological Analyser – the surface forms are
segmented and each form will be analyzed to get the lexical
alphabet with hyphen used to separate words on the
reduplication case and on special clitic case. unit, such as lemma, Part-of-Speech tags and morphological
inflection information. Apertium offers various morphological
Word Order – The word order is fixed and the position in analysis tools that can accommodate different nature of
the sentence is essential to determine the role of the word in the languages. For this particular language pair under question, the
sentence. morphological analyser are developed based on Xerox finite-
state tools (XFST) and high-level declarative language to
Tense – The languages do not have special inflection tense specify language lexicon (LEXC), which then compiled in
marking. The tense are marked by using additional word or Foma (http://foma.sourceforge.net/) [14], a finite state toolkit
temporal information in the sentence.
that implements Xerox xfst and lexc. This module includes the
Voice – The sentence voices are marked by different prefix source language monolingual dictionary as well.
of the inflected word.
Gender – Classification of gender is not common although
it occurs in some irregular cases marked by several suffixes.
This fashion is now rarely used and not productive any longer.
Number – The plurality is not only found in Nouns but
also in other Part-of-Speech (POS) where it marks the plurality
of the action or referring to plural entities.
Initiating from that we take the part where it handles the Generation Result
morphophonemic and reduplication. Then we build a kirim<vblex><actor><sg> > pengirim
morphological analyser with more extensive inflection kirim<vblex><actio><sg> > pengiriman
*#pemberkiriman
coverage. We also introduce more fine-grained tags and change *#perkiriman
the forms from +TAG into <TAG> to suit Apertium platform. *#kepengiriman
*#keberkiriman
*#kekiriman
TABLE II. MORPHOLOGICAL TAGSET
kirim<vblex><ent><sg> > kiriman
Tag Description Tag Type
<adj> adjective lemma POS *) marks the ungrammatical inflected words
<n> noun lemma POS #) marks the un-generated inflected words
<num> number lemma POS
<prn> pronoun POS
<det> determiner POS Here is the analysis for Indonesian sentence “apabila,
<cnjcoo> coordinating conjunction POS sebelum mengunduh, menginstal, mengaktifkan atau
<cnjsub> subordinating conjunction POS
<vblex> verb lemma POS
menggunakan piranti lunak, anda memutuskan bahwa anda
<part> particle POS tidak bersedia untuk menyetujui ketentuan-ketentuan
<mod> modal POS perjanjian ini, anda tidak bisa dan tidak berhak menggunakan
<ij> interjection POS piranti lunak ini” (“if, before downloading, installing,
<qst> question word POS activating or using the software, you decided that you are
<pr> preposition lemma POS unwilling to agree to this agreement terms, you cannot and do
<p1> first person PERSON not have right to use this software”).
<p2> second person PERSON
<p3> third person PERSON ^apabila/apabila<cnjsub>$
<sg> singular NUM ,
<pl> plural NUM ^sebelum/sebelum<cnjsub>$
<card> cardinal number DERNUM ^mengunduh/unduh<vblex><actv><imp><sg>$
<ord> ordinal number DERNUM ,
<coll> collective number DERNUM ^menginstal/instal<vblex><actv><imp><sg>$
<ref> referential number DERNUM ,
<vbhaver> verb „to have‟ VERBVAR ^mengaktifkan/aktif<adj><actv><imp><caus><sg>$
<vbser> verb „to be‟ VERBVAR ^atau/atau<cnjcoo>$
<actv> active voice VOICE ^menggunakan/guna<n><actv><imp><caus><sg>$
<pasv> passive voice VOICE ^piranti~lunak/piranti~lunak<n><bare><sg>$
<perf> perfective aspect ASPECT ,
<imp> imperfective aspect ASPECT ^anda/anda<prn><p2><sg>$
<bare> bare noun DERNOUN ^memutuskan/putus<adj><actv><imp><caus><sg>$
^bahwa/bahwa<cnjsub>$ <e><p><l>apabila<s n="cnjsub"/></l>
^anda/anda<prn><p2><sg>$ <r>jika<s n="cnjsub"/></r></p></e>
^tidak~bersedia/enggan<adj><positive>$ <e><p><l>sebelum<s n="cnjsub"/></l>
^untuk/untuk<pr>$ <r>sebelum
^menyetujui/setuju<vblex><actv><imp><appl><sg>$ <s n="cnjsub"/></r></p></e>
^ketentuan-ketentuan/tentu<adj><abstract><pl>$ <e><p><l>unduh<s n="vblex"/></l>
^perjanjian/janji<n><theme><sg>$ <r>muatturunkan
^ini/ini<det>$
<s n="vblex"/></r></p></e>
,
^anda/anda<prn><p2><sg>$
<e><p><l>instal<s n="vblex"/></l>
^tidak/tidak<adv>$ <r>pasang<s n="vblex"/></r></p></e>
^bisa/bisa<mod>/bisa<n><bare><sg>$ <e><p><l>aktif<s n="adj"/></l>
^dan/dan<cnjcoo>$ <r>aktif<s n="adj"/></r></p></e>
^tidak/tidak<adv>$ <e><p><l>atau<s n="cnjcoo"/></l>
^berhak/hak<n><actv><perf><vbhaver><bare><sg>$ <r>atau<s n="cnjcoo"/></r></p></e>
^menggunakan/guna<n><actv><imp><caus><sg>$ <e><p><l>guna<s n="n"/></l>
^piranti~lunak/piranti~lunak<n><bare><sg>$ <r>guna<s n="n"/></r></p></e>
^ini/ini<det>$ <e><p><l>piranti~lunak<s n="n"/></l>
<r>perisian<s n="n"/></r></p></e>
Figure 3. Analysis Example for Indonesian Sentence <e><p><l>anda<s n="prn"/></l>
“apabila, sebelum mengunduh, menginstal, mengaktifkan atau menggunakan <r>anda<s n="prn"/></r></p></e>
piranti lunak, anda memutuskan bahwa anda tidak bersedia untuk menyetujui <e><p><l>putus<s n="adj"/></l>
ketentuan-ketentuan perjanjian ini, anda tidak bisa dan tidak berhak <r>putus<s n="adj"/></r></p></e>
menggunakan piranti lunak ini” <e><p><l>bahwa<s n="cnjsub"/></l>
<r>bahawa<s n="cnjsub"/></r></p></e>
The generation process is simply the opposite direction of <e><p><l>enggan<s n="adj"/></l>
the analysis, where the surface forms are composed based on <r>enggan<s n="adj"/></r></p></e>
the analysis. <e><p><l>untuk<s n="pr"/></l>
<r>untuk<s n="pr"/></r></p></e>
V. DISAMBIGUATION <e><p><l>setuju<s n="vblex"/></l>
<r>bersetuju
Although the morphological analysis has been expanded to <s n="vblex"/></r></p></e>
prevent ambiguities, but cases such as homophones will still <e><p><l>tentu<s n="adj"/>
remain. The word ‘bisa’ in the previous analysis example <s n="abstract"/>
(Figure 3) will have two possible analyses since it is a <s n="pl"/></l>
homophone for the word „can/able to‟, a modal verb, and <r>terma<s n="n"/><s n="bare"/>
„snake venom‟, a noun. This several analyses are <s n="pl"/></r></p></e>
disambiguated statistically based on some probability. <e><p><l>janji<s n="n"/></l>
<r>janji<s n="n"/></r></p></e>
The disambiguation of the analyses is done in the POS <e><p><l>ini<s n="det"/></l>
tagger. There are several ways provided by Apertium to train <r>ini<s n="det"/></r></p></e>
the Tagger. We choose to use the target language tagger <e><p><l>tidak<s n="adv"/></l>
training, that provided by Apertium [5]. This training process is <r>tidak<s n="adv"/></r></p></e>
relatively faster and more suitable for our MT system which <e><p><l>hak<s n="n"/></l>
only has one-stage transfer. It trains the tagger based on the <r>hak<s n="n"/></r></p></e>
source and target language. Intend to do that we need to have a
text corpus in source and target languages, a tag definition file, Figure 4. Bilingual Dictionary Entries
and having the MT system running. In the tag definition file we
specify the sequence of tags that is enforced or forbidden to be The bilingual dictionary records the lemma and the
occurring in the analysis. The analysis of the word ‘bisa’ in necessary tags such as POS tag. Compound words are recorded
Figure 3 is being disambiguate into as one entry, for example the word “ibu kota” which translated
as capital city, will be mapped to “ibu negara” (which in
^bisa<mod>$ Indonesian will be misinterpreted as „first lady‟).
VI. TRANSFER
The translation to the target language takes place in the <e><p><l>ibu~kota<s n="n"/></l>
lexical and structural transfers. The analyses of the source <r>ibu~negara<s n="n"/>
language are transferred into the target language and then it is </r></p></e>
generated to the target surface form.
The transfer between the two languages is done using Figure 5. Bilingual Dictionary Entries – Compound words
transfer rules and bilingual dictionary. The sentence structure
of both languages is similar where reordering is not required. A preprocess is conducted to add tilde „~‟ character to
We use Lttoolbox to keep the bilingual dictionary. combine the compound words together so that Foma will
handle it as single word. This is because currently Foma does
not tokenize the sentence while doing the analysis which is a the first task which probably will help us to improve the system
functionality that other Apertium morphological tools have, in the future. The development of building the full pipeline of
such as Lttoolbox and HFST. the system didn‟t take most of the development time if
compared to the effort on developing the resources such as
VII. LANGUAGE RESOURCES morphological analyser and dictionaries.
In the analysis and generation step, monolingual It will be an interesting research to build the MT system in
dictionaries on both languages are needed. To build the the opposite direction, Malaysian to Indonesian, which appears
Indonesian monolingual dictionary, we take the list of lemmata to be somehow symmetrical. Another challenging research
that was available before on the previous Morphological would be to make Indonesian/Malaysian-English MT system
Analyser [4] and adapt it with the current setting. We keep only using this approach.
the lemmata that are tagged as Noun, Verb, and Adjectives.
Additionally, closed word entries such as prepositions or ACKNOWLEDGMENT
conjunctions are added and tagged. The problem in Malaysian
side is that we do not have list of Malaysian lemmata as we Thanks to Måns Huldén for his help in converting patent-
have in Indonesian side. We simply take the Malaysian entry encumbered and some other aspects of the Xerox syntax into
on the bilingual dictionary. Foma. Thanks to Francis Tyers for the support and his help
setting up the new Apertium language pair development
Indonesian and Malaysian dictionary is not yet available. environment.
To build a fast and cheap bilingual dictionary, we grabbed
available public online dictionary and also generating it from a REFERENCES
parallel corpus. Here describes the process of the dictionary
construction:
[1] A. M. Corbi-Bellot, M. L. Forcada, S. Ortiz-Rojas, J. A. Prez-Ortiz, G.
1) Online Dictionary. There are several online dictionary Ramirez-Sanchez, F. Sanchez-Martinez, I. Alegria, A. Mayor, and K.
Sarasola, “An open-source shallow-transfer machine translation engine
website available. We query the site for each Indonesian for the romance languages of spain,” Proceedings of the Tenth
lemma and grabbed the translation word if available. The Conference of the European Association for Machine Translation, pp.
source tag and the target tag are also recorded. 79–86, May 2005.
2) Statistical word pairing. Word pairs are also build by [2] F. M. Tyers, F. Sánchez-Martínez, S. Ortiz-Rojas, and M. L. Forcada,
“Free/open-source resources in the Apertium platform for machine
using statistical method. This is done by training a small size translation research and development,” The Prague Bulletin of
of parallel corpus composed from several sources such as Mathematical Linguistics No. 93, pp. 67-76, 2010.
manuals, recipes, agreements, and holy books. The tools used [3] F. M. Tyers, L. Wiechetek, and T. Trosterud, “Developing prototypes
is Moses (http://www.statmt.org/moses/) [18]. On the source for machine translation between two Sámi languages,”
Proceedings of the 13th Annual Conference of the European Association
language side, the words are being analyzed to get the analysis ofMachine Translation, EAMT09, 2009.
forms (lemma and morphological tags) while the target side [4] F. Pisceldo, R. Mahendra, R. Manurung, and I W. Arka, “A Two-Level
composed of sentences with words in surface forms. After we Morphological Analyser for Indonesian,” Abstract submitted to the
got the word pairs, the words morphems on the target side are Australasian Language Technology (ALTA) Workshop 2008, Tasmania,
2008.
stripped. This is done to get lemma-to-lemma pairs. [5] F. Sánchez-Martínez, J. A. Pérez-Ortiz, and M. L. Forcada, “Using
The results from both approaches are merged and target-language information to train part-of-speech taggers for machine
handpicked to retain the quality of the translation. translation,” Machine Translation, volume 22, numbers 1-2, pp.29-66.
[6] H. Dyvik, “Exploiting structural similarities in machine translation,”
Computers and Humanities 28, pp. 225–245, 1995.
VIII. CONCLUSIONS AND FUTURE WORK [7] J. Hajič, J. Hric, and V. Kuboň, “Machine translation of very close
languages,” Proceedings of the 6th Applied Natural Language
Although the experiment described in the paper is still work Processing Conference, 2000.
in progress and we are at the current stage unable to provide a [8] J. Hajič, P. Homola, and V. Kuboň, “A simple multilingual machine
standard quality evaluation, there are already some results translation system,” Proceedings of the MT Summit IX, New Orleans,
which may turn out to be important for further research. 2003.
[9] J. Vičič, “Rapid development of data for shallow transfer rbmt
First of all, the work on the system has led us to the translation systems for highly inflective languages,” Jezikovne
investigation of both languages in the direction of how certain tehnologije, language technologies : zbornik konference : proceedings of
phenomena may be handled from the point of view of machine the conference, pp. 98–103, 2008.
translation, which phenomena may cause problems in a [10] K. Altintas and I. Cicekli, “A machine translation system between a pair
relatively straightforward system etc. of closely related languages,” Proceedings of the 17th International
Symposium on Computer and Information Sciences (ISCIS 2002), 2002.
Second, the relatively high numbers of resources needed for [11] K. Oliva, “A Parser for Czech Implemented in Systems Q,” Explizite
building individual modules for the system made us think Beschreibung der Sprache und automatische Textbearbeitung XVI, MFF
about the methods how to obtain them in a reasonable quantity UK Prague, 1989.
and quality. This turned out to be a challenge especially [12] K. P. Scanell, “Machine translation for closely related language pairs,”
because for the European languages used in previous Unknown, 2008.
experiments there are many more resources available, nothing [13] K. Unhammer and T. Trosterud, “Reuse of free resources in machine
is usually built from scratch. Building better resources will be translation between Nynorsk and Bokmål,” Proceedings of the First
International Workshop on Free/Open-Source Rule-Based Machine
Translation / Edited by J. A. Pérez-Ortiz, F. Sánchez-Martínez, F. M.
Tyers, pp. 35-42, Alicante : Universidad de Alicante, Departamento de
Lenguajes y Sistemas Informáticos, 2009.
[14] M. Hulden, “Foma: a finite-state compiler and library,” Proceedings of
the 12th Conference of the European Chapter of the Association for
Computational Linguistics: Demonstrations Session, pp. 29-32, Athens,
Greece, April 03-03, 2009.
[15] M. L. Forcada, F. M. Tyers, and G. Ramírez-Sánchez, “The
free/opensource machine translation platform Apertium: Five years on,”
Proceedings of the First International Workshop on Free/Open-Source
Rule-Based Machine Translation FreeRBMT'09, pp. 3-10, November
2009.
[16] P. Homola and V. Kuboň, “A translation model for languages of
acceding countries,” Proceedings of the IX EAMT Workshop, La
Valetta, University of Malta, 2004.
[17] P. Koehn and H. Hoang, “Factored translation models,” Proceedings of
the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning (EMNLP-
CoNLL), pp. 868–876, 2007.
[18] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N.
Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A.
Constantin, E. Herbst, “Moses: Open Source Toolkit for Statistical
Machine Translation,” Annual Meeting of the Association for
Computational Linguistics (ACL): Demonstration session, Prague,
Czech Republic, June 2007.
[19] S. Marinov, “Structural Similarities in MT: A Bulgarian-Polish case,”
unknown, 2003.