0% found this document useful (0 votes)
23 views7 pages

A Study of Indonesian-To-Malaysian MT System

Uploaded by

anakothman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

A Study of Indonesian-To-Malaysian MT System

Uploaded by

anakothman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A Study of Indonesian-to-Malaysian

MT System

Septina Dian Larasati, Vladislav Kuboň


Inst. Of Formal and Applied Linguistics
Charles University
Prague, Czech Republic

Abstract—The paper presents an ongoing work on the was originally developed for European languages and one of
implementation of an MT system between Indonesian and the main goals of this paper is to describe the issues
Malaysian. The system uses a method of almost a direct encountered in the process of the application of the method to a
translation exploiting the similarity of both languages. This pair of Asian languages which are typologically different from
method was previously used on a number of language pairs of the European languages for which the method has been
European languages. The paper also makes an overview of originally developed (Slavic and Romance languages).
linguistic phenomena which can negatively influence the
translation quality and it suggests a solution for some of them. If we look at the experiments made so far for related
languages, we will find numerous experiments which have
Keywords-machine translation; related languages; direct been performed recently for various language groups:
translation; morphology; hybrid method
 for Slavic languages in [12] and [16],
I. INTRODUCTION  for Scandinavian languages in [3], [6], and [13],
Probably none other linguistic application area has attracted  for Turkic languages in [10]
as much research effort as the area of automatic translation of
texts between natural languages (a field usually called Machine  and for languages of Spain in [1].
Translation -- MT). After more than fifty years of research
The close relatedness of natural languages from one
during which there were periods of uncritical expectations
typological group (and sometimes even across the group
followed by long periods of bitter despair, the application of borders, cf., Czech-to-Lithuanian experiment described in [8])
stochastic methods brought new hopes into a field which
makes the translation task easier thus allowing for the
notoriously failed to provide acceptable results. The stochastic
application of methods which would not be good enough for
methods rejected traditional rule-based approaches and
the translation of unrelated language pairs. Using simpler
replaced them by the exploitation of bigger and bigger amounts
methods does not mean a lower translation quality - many of
of data. The lack of large coverage grammars was replaced by
the translation errors result from the imperfect attempts to parse
a lack of parallel data.
a source language fully, in some cases even to the deep
Although nowadays the expectations are yet again very syntactic level of representation. The accumulation of errors in
high, it is clear that not even the current breakthrough caused parsing, transfer and generation in the systems using the
by stochastic or hybrid approaches as, e.g., in the factored classical transfer-based architecture substantially decreases the
translation model described in [17], will solve all the problems, translation quality.
especially the problems of less represented languages.
One property which makes the translation task easier is the II. TYPOLOGY OF THE LANGUAGE
relatedness of the source and target languages. The relatedness Although spoken by millions of speakers, research on this
usually means a great deal of similarity at all levels, but the pair of languages has not been very enthusiastic compare to
experiments carried out in the past (cf. the references further in most of the European languages. This makes these two closely
the text) have shown that the most important level is the level related languages under question very compelling to be
of syntax closely followed by morphology. explored. Coming from the same language family,
Austronesian, the languages share similar behavior which
This article describes an experiment with the application of
usually being misapprehended by non-natives that they both
an existing model for the MT between related languages on a
are mutually intelligible. The languages are very dynamic
new language pair from a very different language group. The
where the evolution makes them differ from one another.
architecture of the system is based primarily on rule-based
approach which allows for a great deal of ambiguity in all Both of these agglutinative languages have similar
steps. This ambiguity is then resolved by a simple stochastic morphology mechanisms and share some words, both the
ranking of all translation hypotheses. The simple architecture words with exact or similar meaning and also the words with

This work was supported by an Erasmus Mundus Master Program Language


and Communication Technologies and partially supported by the grant MSM
0021620838 of the MŠMT ČR.
different meaning that can be misinterpreted by both native only between related languages but also can be extended for
speakers. Example on words that can be misinterpreted is the language pairs which are not closely related.
word „kereta‟ which means „car‟ in Malaysian and „train‟ in
Indonesian. That word can be inflected in the same way such as Apertium has a modular architecture [2] and in each
„berkereta‟ which means „having car‟ in Malaysian and module it provides various tool options depending on the
nature of the language. In this MT system some module are
„having train‟ in Indonesian. With these backgrounds, this
language pair is a suitable pair to apply this shallow rule-based skipped from the original setting. The modules that are being
MT method. kept in this MT system are

Orthography – The alphabet is basic modern Latin Morphological Analyser – the surface forms are
segmented and each form will be analyzed to get the lexical
alphabet with hyphen used to separate words on the
reduplication case and on special clitic case. unit, such as lemma, Part-of-Speech tags and morphological
inflection information. Apertium offers various morphological
Word Order – The word order is fixed and the position in analysis tools that can accommodate different nature of
the sentence is essential to determine the role of the word in the languages. For this particular language pair under question, the
sentence. morphological analyser are developed based on Xerox finite-
state tools (XFST) and high-level declarative language to
Tense – The languages do not have special inflection tense specify language lexicon (LEXC), which then compiled in
marking. The tense are marked by using additional word or Foma (http://foma.sourceforge.net/) [14], a finite state toolkit
temporal information in the sentence.
that implements Xerox xfst and lexc. This module includes the
Voice – The sentence voices are marked by different prefix source language monolingual dictionary as well.
of the inflected word.
Gender – Classification of gender is not common although
it occurs in some irregular cases marked by several suffixes.
This fashion is now rarely used and not productive any longer.
Number – The plurality is not only found in Nouns but
also in other Part-of-Speech (POS) where it marks the plurality
of the action or referring to plural entities.

III. ARCHITECTURE OF THE SYSTEM


Most of the systems mentioned in the introduction section
try to exploit the similarity of closely related languages. This
can apparently be done only in case that the system architecture
is reasonably simple. The more complicated the architecture is,
the higher number of errors is introduced into the translation
process by individual modules. These errors then negate the
advantage of working with closely related languages.
The most successful architecture for simple MT systems
had been developed for the system Česílko [7], and also used
by the system Apertium [1]. The fact that Apertium is an open-
source platform and thus can easily been adopted for
experiments with other language pairs led us to the decision to
use it for our experiments with two South-Asian languages,
Figure 1. MT System Modular Architecture
Malaysian and Indonesian.
As mentioned above, the architecture of Česílko and Part-of-Speech Tagger – trained using text corpus and
Apertium is relatively simple. The systems are in fact transfer tagger definition file to disambiguate the analysis.
based systems with the transfer being performed either at the
Lexical Transfer and Structural Transfer – reads each
morphological or shallow syntactic level (depending on the
source language word analysis and transfers it into the target
degree of syntactic similarity of a source and a target
language using bilingual dictionary. Structural transfer between
language). The role of morphology in such a system is really
source and target language can be done in three stages,
crucial.
Chunker, Interchunk, and Postchunk depending on the need.
Indonesian and Malaysian MT system is implemented on This MT system only utilizes one stage transfer.
Apertium (http://www.apertium.org), a free/open-source MT
Morphological Generator – the reverse direction of
platform for developing rule-based machine translation system
[15]. This platform is a shallow-transfer machine translation Morphological Analyser to generate the analysis results to their
surface forms.
engine word-to-word machine translation to produce fast,
reasonably intelligible and easily correctable translations not Ranker – is also added to choose the best translation
hypotheses statistically.
IV. MORPHOLOGICAL ANALYSIS AND GENERATION initially designed for. It works by defining exhaustive
Considering the typology of the languages under question, combination of the inflection forms that are possible in a
the extensive engineering task falls on the morphological language, called paradigm. We found that this tool cannot
analyser and generation compared to the other parts. Here accommodate well Indonesian and Malaysian morphology by
describes the morphological operations of the language these several limitations:
followed by how the analysis and generation are implemented.  The treatment for morphemes that precedes the
base word is not straightforward. The analysis
A. Morphological Operations expected from this module is in the form of lemma
The language pair has similar morphological mechanism. followed by morphological tag(s), for example
We broke down this mechanism into four morphological pesan<n><bare><sg>. The process of the
operations. Those operations that have to be handled are analysis is done on the position of the inflection.
Therefore the prefix analysis, which is the tag(s),
1. Affixation. This operation including prefix, suffix, will be in the front of the lemma. By this, a
and circumfix. There are several cases of infixes, separate additional reformatting needs to be done.
which now are rarely used. These special cases are Moreover, circumfix will be treated as
being handled differently in the language resource independent prefix and suffix.
part (see Language Resource).
 The morphophonemic are handled by expanding
2. Reduplication. The reduplication can occur on any the morpheme to its whole possible inflection
POS. It is divided into three different types, forms. For example for the pre-prefix „meN-‟ will
namely full reduplication, partial reduplication and be expanded to its several different forms
affixed reduplication. Partial reduplication is not considering to which base word it glued to. This
handled in the morphological analyser but treated morpheme will inflect into „menge-‟ for one
as an entry in the dictionary. syllable case, „meng-‟ for words starting with [a i
3. Clitic. Enclitic and proclitic are representing the u e o g h], „meny-‟ for words starting with [s, y]
pronouns. It can be kept as clitic or restored to its and so on.
corresponding independent pronoun, where both  This tool cannot handle reduplication cases.
ways are grammatically correct.
Therefore to encounter this we decided not to use Lttoolbox
4. Particle. Particle marks the stress, level of and initially employed an available Indonesian morphological
formality and constructing question words. analyser [4], which was developed in xfst/lexc platform. This
Shown in Figure 2, the schema of how the inflection around tool has already handled the reduplication and Indonesian
the lemma. The prefix itself is divided into two depending on morphophonemic. To incorporate this tool to Apertium we
the position and then named as pre-prefix and prefix. The compiled it in Foma, a finite-state toolkit.
reduplication can occur almost everywhere in the affixed This morphological analyser includes large number of
lemma. Indonesian lemmata, but unfortunately the coverage of how it
handles the inflections was not adequate enough for the task,
where
 It covers partly the morphological operations. The
morphological operation that it handles was
reduplication and several affixations, not including
clitic and particle. The uncovered cases will cause
the inflected word to be left un-translated.
 The tagset is underspecified for generation. It
consists of 17 general tags, which mostly tag the
Part-Of-Speech (POS) and the morphological
operation that occurs. The POS tag simply marks
Figure 2. Morphological Operations Schema three POS types, namely Verb, Noun, and
Adjective, while others are considered as Etc.
B. Morphological Tool  Several inflected words have the same analysis,
Since the morphological mechanisms are similar, we which is unfavorable for the translation since those
simply use the same morphological analyser for both different inflected words will be transferred to the
languages. The widely used tool to do analysis and generation same target analysis. For example in the case of
on Apertium platform is Lttoolbox, a toolbox for lexical the noun derivation „kiriman’, „pengirim‟ and
processing, morphological analysis and generation of words. „pengiriman’ from the verb „kirim‟ will have
This tool has been used on several language pairs and mostly kirim+Noun as the result of the analysis.
on languages that has the inflection on suffix as Apertium was
 Yet relating to the tagset problem, the generation <abstract> derived abstract noun DERNOUN
step generates a big number of inflected words, <actio> derived action noun DERNOUN
which will produce bigger numbers of translation <actor> derived actor noun DERNOUN
<ent> derived entity noun DERNOUN
hypotheses. For example, the analysis <theme> derived theme noun DERNOUN
kirim+Noun will generate words as showed in <positive> bare adjective DERADJ
Table I. <sup> superlative adjective DERADJ
<exceed> adjective that shows something exceeding DERADJ
<manner> adjective that shows similar manner DERADJ
TABLE I. PROBLEM IN THE ANALYSIS /GENERATION <uni> union adjective DERADJ
Analysis Result <possib> adjectival phrase DERADJ
kiriman <enc> enclitic CLITIC
pengirim > kirim+Noun <pro> proclitic CLITIC
pengiriman <appl> applicative TRANSITIVITY
<caus> causative TRANSITIVITY
Generation Result <cap> capitalization mark MARK
pengirim <pos> possesive mark MARK
pengiriman
*pemberkiriman Comparing to the previous example, with the current
*perkiriman morphological analyser the analysis are more precise.
kirim+Noun >
*kepengiriman
*keberkiriman
*kekiriman TABLE III. CURRENT ANALYSIS/GENERATION
kiriman
Analysis Result
*) marks the ungrammatical inflected words kiriman > kirim<vblex><ent><sg>
#) marks the un-generated inflected words
pengirim > kirim<vblex><actor><sg>
pengiriman > kirim<vblex><actio><sg>

Initiating from that we take the part where it handles the Generation Result
morphophonemic and reduplication. Then we build a kirim<vblex><actor><sg> > pengirim
morphological analyser with more extensive inflection kirim<vblex><actio><sg> > pengiriman
*#pemberkiriman
coverage. We also introduce more fine-grained tags and change *#perkiriman
the forms from +TAG into <TAG> to suit Apertium platform. *#kepengiriman
*#keberkiriman
*#kekiriman
TABLE II. MORPHOLOGICAL TAGSET
kirim<vblex><ent><sg> > kiriman
Tag Description Tag Type
<adj> adjective lemma POS *) marks the ungrammatical inflected words
<n> noun lemma POS #) marks the un-generated inflected words
<num> number lemma POS
<prn> pronoun POS
<det> determiner POS Here is the analysis for Indonesian sentence “apabila,
<cnjcoo> coordinating conjunction POS sebelum mengunduh, menginstal, mengaktifkan atau
<cnjsub> subordinating conjunction POS
<vblex> verb lemma POS
menggunakan piranti lunak, anda memutuskan bahwa anda
<part> particle POS tidak bersedia untuk menyetujui ketentuan-ketentuan
<mod> modal POS perjanjian ini, anda tidak bisa dan tidak berhak menggunakan
<ij> interjection POS piranti lunak ini” (“if, before downloading, installing,
<qst> question word POS activating or using the software, you decided that you are
<pr> preposition lemma POS unwilling to agree to this agreement terms, you cannot and do
<p1> first person PERSON not have right to use this software”).
<p2> second person PERSON
<p3> third person PERSON ^apabila/apabila<cnjsub>$
<sg> singular NUM ,
<pl> plural NUM ^sebelum/sebelum<cnjsub>$
<card> cardinal number DERNUM ^mengunduh/unduh<vblex><actv><imp><sg>$
<ord> ordinal number DERNUM ,
<coll> collective number DERNUM ^menginstal/instal<vblex><actv><imp><sg>$
<ref> referential number DERNUM ,
<vbhaver> verb „to have‟ VERBVAR ^mengaktifkan/aktif<adj><actv><imp><caus><sg>$
<vbser> verb „to be‟ VERBVAR ^atau/atau<cnjcoo>$
<actv> active voice VOICE ^menggunakan/guna<n><actv><imp><caus><sg>$
<pasv> passive voice VOICE ^piranti~lunak/piranti~lunak<n><bare><sg>$
<perf> perfective aspect ASPECT ,
<imp> imperfective aspect ASPECT ^anda/anda<prn><p2><sg>$
<bare> bare noun DERNOUN ^memutuskan/putus<adj><actv><imp><caus><sg>$
^bahwa/bahwa<cnjsub>$ <e><p><l>apabila<s n="cnjsub"/></l>
^anda/anda<prn><p2><sg>$ <r>jika<s n="cnjsub"/></r></p></e>
^tidak~bersedia/enggan<adj><positive>$ <e><p><l>sebelum<s n="cnjsub"/></l>
^untuk/untuk<pr>$ <r>sebelum
^menyetujui/setuju<vblex><actv><imp><appl><sg>$ <s n="cnjsub"/></r></p></e>
^ketentuan-ketentuan/tentu<adj><abstract><pl>$ <e><p><l>unduh<s n="vblex"/></l>
^perjanjian/janji<n><theme><sg>$ <r>muatturunkan
^ini/ini<det>$
<s n="vblex"/></r></p></e>
,
^anda/anda<prn><p2><sg>$
<e><p><l>instal<s n="vblex"/></l>
^tidak/tidak<adv>$ <r>pasang<s n="vblex"/></r></p></e>
^bisa/bisa<mod>/bisa<n><bare><sg>$ <e><p><l>aktif<s n="adj"/></l>
^dan/dan<cnjcoo>$ <r>aktif<s n="adj"/></r></p></e>
^tidak/tidak<adv>$ <e><p><l>atau<s n="cnjcoo"/></l>
^berhak/hak<n><actv><perf><vbhaver><bare><sg>$ <r>atau<s n="cnjcoo"/></r></p></e>
^menggunakan/guna<n><actv><imp><caus><sg>$ <e><p><l>guna<s n="n"/></l>
^piranti~lunak/piranti~lunak<n><bare><sg>$ <r>guna<s n="n"/></r></p></e>
^ini/ini<det>$ <e><p><l>piranti~lunak<s n="n"/></l>
<r>perisian<s n="n"/></r></p></e>
Figure 3. Analysis Example for Indonesian Sentence <e><p><l>anda<s n="prn"/></l>
“apabila, sebelum mengunduh, menginstal, mengaktifkan atau menggunakan <r>anda<s n="prn"/></r></p></e>
piranti lunak, anda memutuskan bahwa anda tidak bersedia untuk menyetujui <e><p><l>putus<s n="adj"/></l>
ketentuan-ketentuan perjanjian ini, anda tidak bisa dan tidak berhak <r>putus<s n="adj"/></r></p></e>
menggunakan piranti lunak ini” <e><p><l>bahwa<s n="cnjsub"/></l>
<r>bahawa<s n="cnjsub"/></r></p></e>
The generation process is simply the opposite direction of <e><p><l>enggan<s n="adj"/></l>
the analysis, where the surface forms are composed based on <r>enggan<s n="adj"/></r></p></e>
the analysis. <e><p><l>untuk<s n="pr"/></l>
<r>untuk<s n="pr"/></r></p></e>
V. DISAMBIGUATION <e><p><l>setuju<s n="vblex"/></l>
<r>bersetuju
Although the morphological analysis has been expanded to <s n="vblex"/></r></p></e>
prevent ambiguities, but cases such as homophones will still <e><p><l>tentu<s n="adj"/>
remain. The word ‘bisa’ in the previous analysis example <s n="abstract"/>
(Figure 3) will have two possible analyses since it is a <s n="pl"/></l>
homophone for the word „can/able to‟, a modal verb, and <r>terma<s n="n"/><s n="bare"/>
„snake venom‟, a noun. This several analyses are <s n="pl"/></r></p></e>
disambiguated statistically based on some probability. <e><p><l>janji<s n="n"/></l>
<r>janji<s n="n"/></r></p></e>
The disambiguation of the analyses is done in the POS <e><p><l>ini<s n="det"/></l>
tagger. There are several ways provided by Apertium to train <r>ini<s n="det"/></r></p></e>
the Tagger. We choose to use the target language tagger <e><p><l>tidak<s n="adv"/></l>
training, that provided by Apertium [5]. This training process is <r>tidak<s n="adv"/></r></p></e>
relatively faster and more suitable for our MT system which <e><p><l>hak<s n="n"/></l>
only has one-stage transfer. It trains the tagger based on the <r>hak<s n="n"/></r></p></e>
source and target language. Intend to do that we need to have a
text corpus in source and target languages, a tag definition file, Figure 4. Bilingual Dictionary Entries
and having the MT system running. In the tag definition file we
specify the sequence of tags that is enforced or forbidden to be The bilingual dictionary records the lemma and the
occurring in the analysis. The analysis of the word ‘bisa’ in necessary tags such as POS tag. Compound words are recorded
Figure 3 is being disambiguate into as one entry, for example the word “ibu kota” which translated
as capital city, will be mapped to “ibu negara” (which in
^bisa<mod>$ Indonesian will be misinterpreted as „first lady‟).
VI. TRANSFER
The translation to the target language takes place in the <e><p><l>ibu~kota<s n="n"/></l>
lexical and structural transfers. The analyses of the source <r>ibu~negara<s n="n"/>
language are transferred into the target language and then it is </r></p></e>
generated to the target surface form.
The transfer between the two languages is done using Figure 5. Bilingual Dictionary Entries – Compound words
transfer rules and bilingual dictionary. The sentence structure
of both languages is similar where reordering is not required. A preprocess is conducted to add tilde „~‟ character to
We use Lttoolbox to keep the bilingual dictionary. combine the compound words together so that Foma will
handle it as single word. This is because currently Foma does
not tokenize the sentence while doing the analysis which is a the first task which probably will help us to improve the system
functionality that other Apertium morphological tools have, in the future. The development of building the full pipeline of
such as Lttoolbox and HFST. the system didn‟t take most of the development time if
compared to the effort on developing the resources such as
VII. LANGUAGE RESOURCES morphological analyser and dictionaries.
In the analysis and generation step, monolingual It will be an interesting research to build the MT system in
dictionaries on both languages are needed. To build the the opposite direction, Malaysian to Indonesian, which appears
Indonesian monolingual dictionary, we take the list of lemmata to be somehow symmetrical. Another challenging research
that was available before on the previous Morphological would be to make Indonesian/Malaysian-English MT system
Analyser [4] and adapt it with the current setting. We keep only using this approach.
the lemmata that are tagged as Noun, Verb, and Adjectives.
Additionally, closed word entries such as prepositions or ACKNOWLEDGMENT
conjunctions are added and tagged. The problem in Malaysian
side is that we do not have list of Malaysian lemmata as we Thanks to Måns Huldén for his help in converting patent-
have in Indonesian side. We simply take the Malaysian entry encumbered and some other aspects of the Xerox syntax into
on the bilingual dictionary. Foma. Thanks to Francis Tyers for the support and his help
setting up the new Apertium language pair development
Indonesian and Malaysian dictionary is not yet available. environment.
To build a fast and cheap bilingual dictionary, we grabbed
available public online dictionary and also generating it from a REFERENCES
parallel corpus. Here describes the process of the dictionary
construction:
[1] A. M. Corbi-Bellot, M. L. Forcada, S. Ortiz-Rojas, J. A. Prez-Ortiz, G.
1) Online Dictionary. There are several online dictionary Ramirez-Sanchez, F. Sanchez-Martinez, I. Alegria, A. Mayor, and K.
Sarasola, “An open-source shallow-transfer machine translation engine
website available. We query the site for each Indonesian for the romance languages of spain,” Proceedings of the Tenth
lemma and grabbed the translation word if available. The Conference of the European Association for Machine Translation, pp.
source tag and the target tag are also recorded. 79–86, May 2005.
2) Statistical word pairing. Word pairs are also build by [2] F. M. Tyers, F. Sánchez-Martínez, S. Ortiz-Rojas, and M. L. Forcada,
“Free/open-source resources in the Apertium platform for machine
using statistical method. This is done by training a small size translation research and development,” The Prague Bulletin of
of parallel corpus composed from several sources such as Mathematical Linguistics No. 93, pp. 67-76, 2010.
manuals, recipes, agreements, and holy books. The tools used [3] F. M. Tyers, L. Wiechetek, and T. Trosterud, “Developing prototypes
is Moses (http://www.statmt.org/moses/) [18]. On the source for machine translation between two Sámi languages,”
Proceedings of the 13th Annual Conference of the European Association
language side, the words are being analyzed to get the analysis ofMachine Translation, EAMT09, 2009.
forms (lemma and morphological tags) while the target side [4] F. Pisceldo, R. Mahendra, R. Manurung, and I W. Arka, “A Two-Level
composed of sentences with words in surface forms. After we Morphological Analyser for Indonesian,” Abstract submitted to the
got the word pairs, the words morphems on the target side are Australasian Language Technology (ALTA) Workshop 2008, Tasmania,
2008.
stripped. This is done to get lemma-to-lemma pairs. [5] F. Sánchez-Martínez, J. A. Pérez-Ortiz, and M. L. Forcada, “Using
The results from both approaches are merged and target-language information to train part-of-speech taggers for machine
handpicked to retain the quality of the translation. translation,” Machine Translation, volume 22, numbers 1-2, pp.29-66.
[6] H. Dyvik, “Exploiting structural similarities in machine translation,”
Computers and Humanities 28, pp. 225–245, 1995.
VIII. CONCLUSIONS AND FUTURE WORK [7] J. Hajič, J. Hric, and V. Kuboň, “Machine translation of very close
languages,” Proceedings of the 6th Applied Natural Language
Although the experiment described in the paper is still work Processing Conference, 2000.
in progress and we are at the current stage unable to provide a [8] J. Hajič, P. Homola, and V. Kuboň, “A simple multilingual machine
standard quality evaluation, there are already some results translation system,” Proceedings of the MT Summit IX, New Orleans,
which may turn out to be important for further research. 2003.
[9] J. Vičič, “Rapid development of data for shallow transfer rbmt
First of all, the work on the system has led us to the translation systems for highly inflective languages,” Jezikovne
investigation of both languages in the direction of how certain tehnologije, language technologies : zbornik konference : proceedings of
phenomena may be handled from the point of view of machine the conference, pp. 98–103, 2008.
translation, which phenomena may cause problems in a [10] K. Altintas and I. Cicekli, “A machine translation system between a pair
relatively straightforward system etc. of closely related languages,” Proceedings of the 17th International
Symposium on Computer and Information Sciences (ISCIS 2002), 2002.
Second, the relatively high numbers of resources needed for [11] K. Oliva, “A Parser for Czech Implemented in Systems Q,” Explizite
building individual modules for the system made us think Beschreibung der Sprache und automatische Textbearbeitung XVI, MFF
about the methods how to obtain them in a reasonable quantity UK Prague, 1989.
and quality. This turned out to be a challenge especially [12] K. P. Scanell, “Machine translation for closely related language pairs,”
because for the European languages used in previous Unknown, 2008.
experiments there are many more resources available, nothing [13] K. Unhammer and T. Trosterud, “Reuse of free resources in machine
is usually built from scratch. Building better resources will be translation between Nynorsk and Bokmål,” Proceedings of the First
International Workshop on Free/Open-Source Rule-Based Machine
Translation / Edited by J. A. Pérez-Ortiz, F. Sánchez-Martínez, F. M.
Tyers, pp. 35-42, Alicante : Universidad de Alicante, Departamento de
Lenguajes y Sistemas Informáticos, 2009.
[14] M. Hulden, “Foma: a finite-state compiler and library,” Proceedings of
the 12th Conference of the European Chapter of the Association for
Computational Linguistics: Demonstrations Session, pp. 29-32, Athens,
Greece, April 03-03, 2009.
[15] M. L. Forcada, F. M. Tyers, and G. Ramírez-Sánchez, “The
free/opensource machine translation platform Apertium: Five years on,”
Proceedings of the First International Workshop on Free/Open-Source
Rule-Based Machine Translation FreeRBMT'09, pp. 3-10, November
2009.
[16] P. Homola and V. Kuboň, “A translation model for languages of
acceding countries,” Proceedings of the IX EAMT Workshop, La
Valetta, University of Malta, 2004.
[17] P. Koehn and H. Hoang, “Factored translation models,” Proceedings of
the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning (EMNLP-
CoNLL), pp. 868–876, 2007.
[18] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N.
Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A.
Constantin, E. Herbst, “Moses: Open Source Toolkit for Statistical
Machine Translation,” Annual Meeting of the Association for
Computational Linguistics (ACL): Demonstration session, Prague,
Czech Republic, June 2007.
[19] S. Marinov, “Structural Similarities in MT: A Bulgarian-Polish case,”
unknown, 2003.

You might also like