0% found this document useful (0 votes)
25 views8 pages

A Rule-Based English To Arabic Machine Translation Approach: December 2015

Uploaded by

Mike Davis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views8 pages

A Rule-Based English To Arabic Machine Translation Approach: December 2015

Uploaded by

Mike Davis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/289323722

A Rule-based English to Arabic Machine Translation Approach

Conference Paper · December 2015

CITATION READS

1 1,276

2 authors, including:

Dr. Ahmad T. Al-Taani


Yarmouk University
38 PUBLICATIONS 326 CITATIONS

SEE PROFILE

All content following this page was uploaded by Dr. Ahmad T. Al-Taani on 23 March 2016.

The user has requested enhancement of the downloaded file.


The International Arab Conference on Information Technology (ACIT’2015)

A Rule-based English to Arabic Machine


Translation Approach
Ahmad Farhat and Ahmad Al-Taani
Department of Computer Science, Yarmouk University, Irbid, Jordan

Abstract: In this study, we propose Rule-based English to Arabic Machine translation system for translating simple English
declarative sentences into well-structured Arabic sentences. The proposed system translates sentences containing gerunds,
infinitives, prepositions, direct and indirect objects. The system is implemented using bilingual dictionary designed in the SQL
server. A major goal of this system is to be used as a stand-alone tool and can be integrated with general (English-Arabic)
machine translation systems. The proposed system is evaluated using 70 various simple English declarative sentences written
by English Language experts. Experimental results showed the effectiveness of the proposed MT system in translating English
simple declaratives sentences into Arabic. Results are compared with two well-known commercial systems; Google Translate
and Systran Systems. The proposed system reached an accuracy of 85 .71% while Google got 31.42% and Systran got 20% on
the same test sample.

Keywords: Machine translation, Rule-based approach, Bilingual dictionary, Natural Language Processing.

1. Introduction Rule-based MT (RBMT) has several advantages over


Since the middle of last century, and particularly in the the corpus-based approaches, which is one of the most
last ten years, there has been a spurt of research growth widely explored areas in MT [13]. These include:
in Machine Translation (MT). Various MT systems
have been developed in Europe, the USA and in the Far
East, but these systems principally involve European
languages [5]. Comparatively little work has been done 1. RBMT systems tend to produce better
on MT systems involving Arabic as either the source or translations from a syntactic point of view [9].
target language. Also the incorporation of Arabic into 2. RBMT systems deal with long distance
MT systems is clearly of importance, not only from dependencies, agreement and constituent
economic and trade considerations, but also for social reordering in a more principled way, since
and cultural reasons. they perform the analysis, transfer and
generation steps based on morphologic
Arabic Natural Language Processing has been the focus knowledge [9].
of research for a long time in order to obtain an 3. RBMT systems are a less-resourced approach
automated understanding of the Arabic language [3]. It compared with the corpus-based approaches
is a highly inflectional language with a rich which need very large corpora [11].
morphology and relatively free word order, and two 4. RBMT systems are extensible and
types of sentences: nominal and verbal [10]. English is Maintainable [7].
a universal language that is widely used in the media,
commerce, science, technology, and education. Modern On the other hand, RBMT have problems with lexical
English content (e.g. literature and web content) is selection due to a poor modelling of word level
larger than the amount of Arabic content available. translation preferences [4]. Furthermore, if the input
Consequently, English-to-Arabic MT is particularly sentence cannot be parsed due to the limitations of the
important and the systems are mainly based on the parser or because the sentence is ungrammatical, the
transfer classification. translation may fail and produce very low quality
results [6]. The literature also showed that the
The related work showed that English-Arabic MT Example-based MT which is one of the corpus-based
approaches were concentrated on Rule-based and approaches can cover this loophole.
Corpus-based approaches. It also showed a small
amount of work done on the Arabic language as a target According to Groves et al. [4] the Example-based
language. machine translation has a main advantage over rule-
based approaches; it is usually better at lexical
selection and fluency, since it models lexical choice
with distributional principles and explicit probabilistic
The International Arab Conference on Information Technology (ACIT’2015)

language models trained on very large corpora. Arabic sentence pattern that has been generated, and
Furthermore, it can be implemented in systems which depending on agreement and synthesis rules, it
are not EBMT systems themselves. generated the target text. The proposed methodology
design is shown in Figure1.
Researchers stated that agreement and word ordering
are the main problems in MT and play a big role with 2.1 The Analysis Phase
the quality of translated sentences from English to A sentence is a group of words which starts with
Arabic. Agreement is main property of language, it capital letter and ends with a dot. A sentence contains
occurs when two words in the appropriate pattern or implies a predicate and a subject.
exhibit morphology consistent with their co-occurrence. Sentences contain clauses, simple sentences have one
In the English language, the main case of this linguistic clause and sentences can contain subjects and objects.
mechanism is number agreement between a subject and The subject in a sentence is generally the person or
a verb [2], and there are several agreements we attempt thing carrying out an action comes before the verb.
to solve in this study such as: Adjective-noun The object in a sentence is involved in an action but
agreement, Verbs-subject agreement, and Pronouns does not carry out that action. The object comes after
agreement. the verb. For example: The boy climbed a tree.
If you want to say more about the subject (the boy) or
More than two-thirds of the linguistic efforts in the object (the tree), you can add an adjective, the
analyzing English are spent on the morphology [10]. In adjective comes before the noun (whether subject or
most existing systems the incorrect translation occurs object). For example: The boy climbed the tall tree.
because agreement and word reordering problems still
exist. In this research, we propose a MT system to deal
Verb
with agreement and reordering problems. The proposed
approach can be extended to include other types of Nou
Find Reorder
Rule
English sentences. n
Adjecti
ve
2. Methodology
Adver
The proposed methodology is flexible and scalable and b
the main advantages are: first it depends on the
Lexical
morphological issues which are mainly based on Syntactic Transfer Bilingual
Transfer Dictionary
translation rules of English-Arabic languages.
Secondly, it can be applied on several different
languages.
Gender
Find
The proposed machine translation system use the agreement
Numbe
Rule
transfer based method. This method attracts me in r
contrast to other methods, because with the direct Humani
ty
method the translation is based on dictionaries and
word-by-word translation with the same grammatical Parser POS
Syntactic Syntactic
adjustment. There is no parsing here, so it is not enough Analysis Synthesis Generation
to develop the desired machine translation. Regards to rules

the Interlingua method, it is beyond the need for the


desired machine translation because this method has
much relevance in multilingual machine translation and
this emphasizes a single representation for different
languages. SL TL

In general, the flow of the transfer-based approach is as


follows; it begins with the analyzer which takes the
English sentence (source text) that is to be translated
and produces a POS tagging for every word in it. Next
it transfers this POS tagging to an English sentence Figure 1: The overall methodology design
pattern to obtain the equivalent Arabic sentence pattern
by using reordering rules. Finally, get the meaning of
words by using bilingual dictionaries, and from the
The International Arab Conference on Information Technology (ACIT’2015)

In the English language there are many patterns for In the proposed MT system I used the OpenNLP POS
sentences and according to [8] the simple declarative tagger which depends on the Penn TreeBank tag set.
sentences (SDS) have some pattern as follows: The OpenNLP (POS) tagger like other natural-
 Subject - Verb - Object Pattern: language tools was developed based on a rule-based
For example: He likes coffee. paradigm or a corpus-based one. Rule-based taggers
 Subject - Verb – Indirect Object – Direct use a set of rules to compute the tags of a new given
Object Pattern sentence, while corpus-based taggers learn how to tag
For example: The teacher gave the student a new inputs from a large tagged corpus. Hybrid taggers
book. also exist. The OpenNLP (POS) tagger used huge
 Subject - Verb - Adverb Patten: corpus files to distinguish the parts of speech of
For example: The boy came quickly words. The following are some of them:
 Adjective - Subject Pattern: (gen.nbin,location.nbin,num.nbin,money.nbin,organiz
For example: The small house. ation.nbin, person.nbin, time.nbin…etc)
 Subject - Verb - Adjective Patten: For example, given the sentence, “They are two
For example: He is kind. good boys”, the following are the tags of its
 Subject - Verb – Adverb - Adjective Patten: words, using the OpenNLP (POS) which is based
For example: The girl is very smart on the Penn TreeBank tag set:
They/PRP are/VBP two/CD good/JJ boys/NNS
In general, the English language contains eight parts of
speech (also called lexical categories). These are the 2.2 The Transfer Phase
following: (Verbs, Nouns, Pronouns, Adjectives,
The second phase is the transfer phase, in which a
Adverbs, Preposition, Conjunctions and Interjections)
transformation is applied to the English sentence
[8]. Computers need many POS to distinguish between
pattern to construct the equivalent in Arabic. Once the
words, to deal with the grammatical structure of a given
POS tagging process is complete, I store the POS for
sentence and to help resolve some of the morphological
all the words of the given sentence into an array to
ambiguities of words. There is an urgent need for a tool
simplify the handling of each word by its index and
to handle these POS which is known as a part-of-
through that I can get the English pattern for each
speech (POS) tagger. Part-of-speech tagging is the
sentence depending on the English grammar as
process of marking sentence words with their part-of-
mentioned earlier, such as: the subject coming before
speech. The tags are taken from a tag set, which is a
the verb, the object coming after the verb, the
predefined tag list. Table 1 shows the well-known Penn
adjective coming before the noun and the adverb
TreeBank tags [12].
coming after the verb [2].
Table 1|: The Penn TreeBank project tag set
The second step is transferring the English sentence
pattern obtained from the first step to its equivalent
Arabic sentence pattern depending on the English-
Arabic comparison pattern table [2]. This step was
done by swapping indexes in the array of POS and the
array of words.

Table 2: English-Arabic comparison pattern table


English Sentence Pattern Arabic Sentence Pattern

S V V S
E.g. The boys ran ‫ركض األوالد‬
S V O V S O
E.g. The child drank the ‫شرب الطفل الحلية‬
milk
S V Oi Od V S Oi Od
E.g. The teacher gave the ‫أعطى المعلم الطالة كتاب‬
student a book
The International Arab Conference on Information Technology (ACIT’2015)

S V Cs S Cs Or S V Cs Or V S Cs In the Arabic language, the verb and adjective


E.g. Ali is kind ‫علي لطيف‬ invariably change whenever the subject changes in
E.g. Ali was sick ‫كان علي مريض‬ gender and number. The gender in Arabic is basically
masculine or feminine, and the number in Arabic is
E.g. Ali came quickly ‫جاء علي تسرعح‬
singular, dual, or plural [2]. I added a third feature
E.g. Ali is very smart ‫علي ركي جذا‬ which is humanity to get more accurate translation;
Cs O O Cs the humanity is true or false. So I designed my own
E.g. The small house ‫الثيت الصغير‬ English-Arabic bilingual dictionary which included
these fields: English words, Arabic words, POS tags,
S V O Co S V O Co
number, gender and humanity. The English word and
E.g. They elected him ‫(هم) انتخثوه رئيس‬ the POS tag fields will be filled automatically from the
president first phase by the POS tagger and the other fields will
S V Oc V S Oc be filled manually by machine learning form.
E.g. Ali lives a good life ‫يعيش علي حياج جيذج‬
There are a lot of cases that arise during the generation
process that must be taken into account and fixed
Where S:Subject, V:Verb, O:Object, Od: Direct Object, before generating the resulting Arabic sentence. These
Oi: Indirect Object, Cs: Subject complement which cases constitute a set of grammar rules as follows:
may be an Adjective, an Adverb or both , Co: Object
complement which may be a Noun or an Adjective, Oc: 2.3.1 Rule1: Adjective-Noun definiteness Agreement
Cognate object which is Adjective followed by noun. A sentence containing the article “the” as (DT)
According to this table I wrote the reordering rules to followed by a noun (NNX)., in Arabic language there
be used in my MT system to got a correct Arabic is no separate equivalent word to the article “the”, so
sentence pattern from the English one. These cases instead a separate word prefix will be added"‫"ال‬to the
constitute a set of reorder rules as follows: next (NNX) that follows in Arabic.

2.2.1 Rule1 For example: The door  ‫الثيت‬

When the English sentence contains a (Noun) as A sentence containing the article “the“ as (DT),
Subject followed by a Verb, in the corresponding followed by an adjective (JJ), followed by a noun
Arabic sentence the Verb must precede the Subject [1]. (NNX). After applying a suitable reorder rule, instead
For example: “The boys ran” must translated to Arabic of a separate word being added in Arabic, a prefix will
sentence as: “‫”ركض األوالد‬ be added "‫"ال‬to the next (NNX) then adding "‫ "ال‬to the
next (JJ).
For example: The small house ‫الثيت الصغير‬

2.2.2 Rule2 A sentence containing the article “a” or “an” as (DT),


followed by an adjective (JJ), followed by noun
When the English sentence contains a (Pronoun) as (NNX), then in Arabic language these articles must
Subject followed by a Verb, in the corresponding not translated.
Arabic sentence the order stay as it is. For example: A small house ‫تيت صغير‬
For example: “He runs” must translated to Arabic There are also many rules I processed that covering
sentence as: " ‫" هو يركض‬ subject - verb Agreement, adjective – noun agreement
(for number and gender), cardinal number – noun
And there are also 14 other rules. agreement, cardinal number – Noun and Adjective
agreement, cardinal number – pronoun and noun
2.3 The generation phase agreement, and personal possessive pronouns - noun
The last phase is the generation phase, which is a agreement.
combination of extracting the Arabic meaning and
other features for each word from the English-Arabic 3. Experiments and Evaluation
bilingual dictionary, then applying syntheses and I drew a sample consisting of 70 various simple
agreement rules on the sentence to produce the Arabic English declarative sentences selected from human
sentence as a result of the translation process. experts in the English Language.

3.1 Evaluation Method and Results


The International Arab Conference on Information Technology (ACIT’2015)

In order to evaluate the correctness of the proposed MT The proposed MT system is restricted only to simple
system, we developed suitable evaluation methodology. English declarative sentences and no other sentence
The following steps describe the evaluation type, so extending the current MT system to cover not
methodology: only simple declarative sentences, but also the
compound ones and possibly other types of sentences
1. Run the system on the data set. will be the next step in future works. That depends on
2. Compare the output translation between the more analysis and demands more grammars rules, so
proposed MT system, Google MT and Systran MT the complexity of the translation process will increase.
by human Expert.
3. Classify the problems that arise from the Finally, Some sub patterns from the main patterns of
mismatches between the proposed MT system and simple declarative sentence are not yet included in the
other MT systems. proposed MT system, not because hard to do it, but
4. Determine the percentage accuracy of the data set since it demands more time to cover, while there is a
for each MT system, by computing the number of time constraint to complete the project and get
correctness test cases over total number of test sensible results. For example: the sentence: “The girls
cases multiplied by 100%. will eat the food”, it‟s a sub pattern from (subject-
5. Suggest possible solutions for the identified Verb-Object), the verb here is in the simple future
problems and apply the necessary improvements tense, the proposed MT system covered the verb in
to the MT system. present, past tense and gerund by using equivalent
reorder and agreement rules. But in the case of the
3.2 Analysis of Results verb being in the simple future tense, this sub pattern
The result shows that 60 sentences have been translated was covered when the subject was singular but not
correctly using the proposed MT system and 10 have plural, for example: the sentence: “The girl will eat the
been translated incorrectly, so it needed some food”.
improvements, on the other hand 22 sentences have References
been translated correctly using the Google translator [1] Abdo, Dawod., 'Deep Structure of the
and 14 sentences have been translated correctly using Sentence in Arabic: Did Verb Subject Object
the Systran translator. The proposed MT system has or Subject Verb Object' . By Dar Al-Karmel.
the highest accuracy of 85.71% after that the Google Amman, Jordan, Pages 103-105, 2008.
translator with 31.42% accuracy and at lastly the [2] Alkhuli, Muhammad Ali.,„Comparative
Systran translator with 20% accuracy. Table 3 shows a Linguistics: English and Arabic‟. By the
sample of tested sentences compared with Google National Library, (ISBN): 9957-401-05-9.
translate and Systran system. Amman, Jordan, 1999.
[3] Al-Sughaiyer, Imad and Al-Kharashi
4. Conclusions Ibrahim., Arabic Morphological Analysis
Enhancement of the outputs of the proposed MT system Techniques: A Comprehensive Survey.
can be done only by formalizing our linguistic Journal Of The American Society For
knowledge and enriching the system with adequate Information Science And Technology, 2004.
rules to deal with the linguistic issues. Fully automated [4] Groves, Declan and Way., Hybrid data-
high quality machine translation (FAHQMT) has not driven models of machine translation. Volume
been achieved yet. There is a lot of work that we can do 19, Issue 3-4, pp 301-323, 2005.
to improve the quality of MT outputs and increase its [5] Hutchins, W.John., Machine Translation: A
usefulness. In this project I have presented the necessity brief History. Concise history of the language
to handle both the agreement and the word reordering sciences: from the Sumerians to the
problems in the machine translation from English to cognitivists. Edited by E.F.K.Koerner and
Arabic. I proposed a system which uses the advantages R.E.Asher. Oxford: Pergamon Press. Pages
of the Rule-based machine translation (RBMT) 431-445, 1995.
approach to solve those problems. The project has dealt [6] Hutchins, W.John., Example-based machine
with the two features that greatly affect the outputs of translation: a review and commentary.
MTs, which come from the fact that different languages Published online: © Springer Science and
have different text orientations where some of them are Business Media. Pages 6,7, 2006.
left-to-right and others are right-to-left. The orders of [7] Kaji Hiroyuki., An Efficient Execution
the words in the sentence are also different from one Method for Rule-Based Machine Translation.
language to another. Systems Development Laboratory~ Hitachi
Ltdo1099 Ohzenji, Asao, Kawasaki, 215~
Japan, 1988.
The International Arab Conference on Information Technology (ACIT’2015)

[8] Khalil, Aziz M., A Contrastive Grammar of [13] Tripathi Sneha and Sarkhel Krishna.,
English and Arabic. By Jordan Book Center, Approaches to machine translation. Annals of
(ISBN): 978-9957-604-13-4. Amman, Jordan, Library and Information Studies. Vol.57, pp.
2010. 388-393, 2010.
[9] Labaka, Gorka and Stroppa, Nicolas and Way,
Andy and Sarasola, Kepa., Comparing rule-
based and data-driven approaches to Spanish-
to-Basque machine translation. Copenhagen,
Denmark, 2007.
[10] Ryding Karin., A Reference Grammar of
Modern Standard Arabic. Cambridge
University Press The Edinburgh Building,
Cambridge, CB2 2RU, UK, 2005.
[11] Shaalan Khaled., Rule-based Approach in
Arabic Natural Language Processing.
International Journal on Information and
Communication Technologies, Vol. 3, No. 3,
2010.
[12] The Penn Treebank Project., Computer and
Information Science, Penn University.
URL http://www.cis.upenn.edu/~treebank/
(viewed on 27/11/06), 2006.

Table 3: Evaluation
‫)‪The International Arab Conference on Information Technology (ACIT’2015‬‬
‫‪Sentence‬‬ ‫‪My MT system‬‬ ‫‪Google MT‬‬ ‫‪Systran MT‬‬ ‫‪Human judgment‬‬
‫‪results‬‬ ‫‪results‬‬ ‫‪results‬‬
‫‪Sarah writes a‬‬ ‫تكتب سارة رسالة‬ ‫سارة يكتب بريد إلكتروني‬ ‫سارة يكتب حرف‬ ‫ترجمة الباحث أصوب‬
‫‪letter‬‬
‫‪The boys write a‬‬ ‫يكتب األوالد كتاب‬ ‫األوالد إرسال كتاب‬ ‫يكتب الفتى كتاب‬ ‫ترجمة الباحث أصوب‬
‫‪book‬‬ ‫ينقصها تنوين المفعول به‬
‫‪They ate the‬‬ ‫هم أكلوا اللحم‬ ‫أكلوا لحوم‬ ‫اللحم هم أكلوا‬ ‫ترجمة الباحث و ترجمة سيستران‬
‫‪meat‬‬ ‫أصوب‬
‫‪The girls were‬‬ ‫كانت البنات جيدات‬ ‫وكانت الفتيات جيدة‬ ‫جيّد البنت كان‬ ‫ترجمة الباحث أصوب‬
‫‪good‬‬
‫‪The lions eat the‬‬ ‫تأكل األسود اللحم‬ ‫األسود تأكل اللحم‬ ‫اللحم األسد يأكل‬ ‫ترجمة الباحث و ترجمة جوجل‬
‫‪meat‬‬ ‫أصوب‬
‫‪She needs help‬‬ ‫هي تحتاج مساعدة‬ ‫وهي في حاجة إلى مساعدة‬ ‫مساعدة يحتاج هو‬ ‫ترجمة الباحث أصوب‬
‫‪It eats the meat‬‬ ‫إنها تأكل اللحم‬ ‫وهو يأكل اللحوم‬ ‫اللحم يأكل هو‬ ‫ترجمة الباحث أصوب‬
‫‪They need help‬‬ ‫هم يحتاجون مساعدة‬ ‫إنهم بحاجة إلى مساعدة‬ ‫مساعدة يحتاجون هم‬ ‫الترجمات الثالث صائبة‬
‫‪Sarah ate the‬‬ ‫أكلت سارة التفاحة‬ ‫سارة أكل التفاح‬ ‫سارة أكل التفاح‬ ‫ترجمة الباحث أصوب‬
‫‪apple‬‬
‫‪They elected him‬‬ ‫هم انتخبوه رئيس‬ ‫إنهم انتخبوه رئيسا‬ ‫هم انتخبواه رئيس‬ ‫ترجمة ترجمة جوجل و ترجمة‬
‫‪president‬‬ ‫الباحث ينقصها تنوين المفعول به‬
‫‪Sarah lives a‬‬ ‫تعيش سارة حياة جيدة‬ ‫سارة تعيش حياة جيدة‬ ‫سارة يعيش حياة جيد‬ ‫ترجمة الباحث و ترجمة جوجل‬
‫‪good life‬‬ ‫أصوب‬
‫‪Ahmad lives a‬‬ ‫يعيش احمد حياة جيدة‬ ‫أحمد يعيش حياة جيدة‬ ‫أحمد يعيش حياة جيد‬ ‫ترجمة الباحث و ترجمة جوجل‬
‫‪good life‬‬ ‫أصوب‬
‫‪They elected him‬‬ ‫هم انتخبوه رئيس‬ ‫إنهم انتخبوه رئيسا‬ ‫هم انتخبواه رئيس‬ ‫أصوب وترجمة ترجمة جوجل‬
‫‪president‬‬ ‫الباحث ينقصها تنوين المفعول به‬

‫‪View publication stats‬‬

You might also like