A Rule-Based English To Arabic Machine Translation Approach: December 2015
A Rule-Based English To Arabic Machine Translation Approach: December 2015
net/publication/289323722
CITATION READS
1 1,276
2 authors, including:
SEE PROFILE
All content following this page was uploaded by Dr. Ahmad T. Al-Taani on 23 March 2016.
Abstract: In this study, we propose Rule-based English to Arabic Machine translation system for translating simple English
declarative sentences into well-structured Arabic sentences. The proposed system translates sentences containing gerunds,
infinitives, prepositions, direct and indirect objects. The system is implemented using bilingual dictionary designed in the SQL
server. A major goal of this system is to be used as a stand-alone tool and can be integrated with general (English-Arabic)
machine translation systems. The proposed system is evaluated using 70 various simple English declarative sentences written
by English Language experts. Experimental results showed the effectiveness of the proposed MT system in translating English
simple declaratives sentences into Arabic. Results are compared with two well-known commercial systems; Google Translate
and Systran Systems. The proposed system reached an accuracy of 85 .71% while Google got 31.42% and Systran got 20% on
the same test sample.
Keywords: Machine translation, Rule-based approach, Bilingual dictionary, Natural Language Processing.
language models trained on very large corpora. Arabic sentence pattern that has been generated, and
Furthermore, it can be implemented in systems which depending on agreement and synthesis rules, it
are not EBMT systems themselves. generated the target text. The proposed methodology
design is shown in Figure1.
Researchers stated that agreement and word ordering
are the main problems in MT and play a big role with 2.1 The Analysis Phase
the quality of translated sentences from English to A sentence is a group of words which starts with
Arabic. Agreement is main property of language, it capital letter and ends with a dot. A sentence contains
occurs when two words in the appropriate pattern or implies a predicate and a subject.
exhibit morphology consistent with their co-occurrence. Sentences contain clauses, simple sentences have one
In the English language, the main case of this linguistic clause and sentences can contain subjects and objects.
mechanism is number agreement between a subject and The subject in a sentence is generally the person or
a verb [2], and there are several agreements we attempt thing carrying out an action comes before the verb.
to solve in this study such as: Adjective-noun The object in a sentence is involved in an action but
agreement, Verbs-subject agreement, and Pronouns does not carry out that action. The object comes after
agreement. the verb. For example: The boy climbed a tree.
If you want to say more about the subject (the boy) or
More than two-thirds of the linguistic efforts in the object (the tree), you can add an adjective, the
analyzing English are spent on the morphology [10]. In adjective comes before the noun (whether subject or
most existing systems the incorrect translation occurs object). For example: The boy climbed the tall tree.
because agreement and word reordering problems still
exist. In this research, we propose a MT system to deal
Verb
with agreement and reordering problems. The proposed
approach can be extended to include other types of Nou
Find Reorder
Rule
English sentences. n
Adjecti
ve
2. Methodology
Adver
The proposed methodology is flexible and scalable and b
the main advantages are: first it depends on the
Lexical
morphological issues which are mainly based on Syntactic Transfer Bilingual
Transfer Dictionary
translation rules of English-Arabic languages.
Secondly, it can be applied on several different
languages.
Gender
Find
The proposed machine translation system use the agreement
Numbe
Rule
transfer based method. This method attracts me in r
contrast to other methods, because with the direct Humani
ty
method the translation is based on dictionaries and
word-by-word translation with the same grammatical Parser POS
Syntactic Syntactic
adjustment. There is no parsing here, so it is not enough Analysis Synthesis Generation
to develop the desired machine translation. Regards to rules
In the English language there are many patterns for In the proposed MT system I used the OpenNLP POS
sentences and according to [8] the simple declarative tagger which depends on the Penn TreeBank tag set.
sentences (SDS) have some pattern as follows: The OpenNLP (POS) tagger like other natural-
Subject - Verb - Object Pattern: language tools was developed based on a rule-based
For example: He likes coffee. paradigm or a corpus-based one. Rule-based taggers
Subject - Verb – Indirect Object – Direct use a set of rules to compute the tags of a new given
Object Pattern sentence, while corpus-based taggers learn how to tag
For example: The teacher gave the student a new inputs from a large tagged corpus. Hybrid taggers
book. also exist. The OpenNLP (POS) tagger used huge
Subject - Verb - Adverb Patten: corpus files to distinguish the parts of speech of
For example: The boy came quickly words. The following are some of them:
Adjective - Subject Pattern: (gen.nbin,location.nbin,num.nbin,money.nbin,organiz
For example: The small house. ation.nbin, person.nbin, time.nbin…etc)
Subject - Verb - Adjective Patten: For example, given the sentence, “They are two
For example: He is kind. good boys”, the following are the tags of its
Subject - Verb – Adverb - Adjective Patten: words, using the OpenNLP (POS) which is based
For example: The girl is very smart on the Penn TreeBank tag set:
They/PRP are/VBP two/CD good/JJ boys/NNS
In general, the English language contains eight parts of
speech (also called lexical categories). These are the 2.2 The Transfer Phase
following: (Verbs, Nouns, Pronouns, Adjectives,
The second phase is the transfer phase, in which a
Adverbs, Preposition, Conjunctions and Interjections)
transformation is applied to the English sentence
[8]. Computers need many POS to distinguish between
pattern to construct the equivalent in Arabic. Once the
words, to deal with the grammatical structure of a given
POS tagging process is complete, I store the POS for
sentence and to help resolve some of the morphological
all the words of the given sentence into an array to
ambiguities of words. There is an urgent need for a tool
simplify the handling of each word by its index and
to handle these POS which is known as a part-of-
through that I can get the English pattern for each
speech (POS) tagger. Part-of-speech tagging is the
sentence depending on the English grammar as
process of marking sentence words with their part-of-
mentioned earlier, such as: the subject coming before
speech. The tags are taken from a tag set, which is a
the verb, the object coming after the verb, the
predefined tag list. Table 1 shows the well-known Penn
adjective coming before the noun and the adverb
TreeBank tags [12].
coming after the verb [2].
Table 1|: The Penn TreeBank project tag set
The second step is transferring the English sentence
pattern obtained from the first step to its equivalent
Arabic sentence pattern depending on the English-
Arabic comparison pattern table [2]. This step was
done by swapping indexes in the array of POS and the
array of words.
S V V S
E.g. The boys ran ركض األوالد
S V O V S O
E.g. The child drank the شرب الطفل الحلية
milk
S V Oi Od V S Oi Od
E.g. The teacher gave the أعطى المعلم الطالة كتاب
student a book
The International Arab Conference on Information Technology (ACIT’2015)
When the English sentence contains a (Noun) as A sentence containing the article “the“ as (DT),
Subject followed by a Verb, in the corresponding followed by an adjective (JJ), followed by a noun
Arabic sentence the Verb must precede the Subject [1]. (NNX). After applying a suitable reorder rule, instead
For example: “The boys ran” must translated to Arabic of a separate word being added in Arabic, a prefix will
sentence as: “”ركض األوالد be added ""الto the next (NNX) then adding " "الto the
next (JJ).
For example: The small house الثيت الصغير
In order to evaluate the correctness of the proposed MT The proposed MT system is restricted only to simple
system, we developed suitable evaluation methodology. English declarative sentences and no other sentence
The following steps describe the evaluation type, so extending the current MT system to cover not
methodology: only simple declarative sentences, but also the
compound ones and possibly other types of sentences
1. Run the system on the data set. will be the next step in future works. That depends on
2. Compare the output translation between the more analysis and demands more grammars rules, so
proposed MT system, Google MT and Systran MT the complexity of the translation process will increase.
by human Expert.
3. Classify the problems that arise from the Finally, Some sub patterns from the main patterns of
mismatches between the proposed MT system and simple declarative sentence are not yet included in the
other MT systems. proposed MT system, not because hard to do it, but
4. Determine the percentage accuracy of the data set since it demands more time to cover, while there is a
for each MT system, by computing the number of time constraint to complete the project and get
correctness test cases over total number of test sensible results. For example: the sentence: “The girls
cases multiplied by 100%. will eat the food”, it‟s a sub pattern from (subject-
5. Suggest possible solutions for the identified Verb-Object), the verb here is in the simple future
problems and apply the necessary improvements tense, the proposed MT system covered the verb in
to the MT system. present, past tense and gerund by using equivalent
reorder and agreement rules. But in the case of the
3.2 Analysis of Results verb being in the simple future tense, this sub pattern
The result shows that 60 sentences have been translated was covered when the subject was singular but not
correctly using the proposed MT system and 10 have plural, for example: the sentence: “The girl will eat the
been translated incorrectly, so it needed some food”.
improvements, on the other hand 22 sentences have References
been translated correctly using the Google translator [1] Abdo, Dawod., 'Deep Structure of the
and 14 sentences have been translated correctly using Sentence in Arabic: Did Verb Subject Object
the Systran translator. The proposed MT system has or Subject Verb Object' . By Dar Al-Karmel.
the highest accuracy of 85.71% after that the Google Amman, Jordan, Pages 103-105, 2008.
translator with 31.42% accuracy and at lastly the [2] Alkhuli, Muhammad Ali.,„Comparative
Systran translator with 20% accuracy. Table 3 shows a Linguistics: English and Arabic‟. By the
sample of tested sentences compared with Google National Library, (ISBN): 9957-401-05-9.
translate and Systran system. Amman, Jordan, 1999.
[3] Al-Sughaiyer, Imad and Al-Kharashi
4. Conclusions Ibrahim., Arabic Morphological Analysis
Enhancement of the outputs of the proposed MT system Techniques: A Comprehensive Survey.
can be done only by formalizing our linguistic Journal Of The American Society For
knowledge and enriching the system with adequate Information Science And Technology, 2004.
rules to deal with the linguistic issues. Fully automated [4] Groves, Declan and Way., Hybrid data-
high quality machine translation (FAHQMT) has not driven models of machine translation. Volume
been achieved yet. There is a lot of work that we can do 19, Issue 3-4, pp 301-323, 2005.
to improve the quality of MT outputs and increase its [5] Hutchins, W.John., Machine Translation: A
usefulness. In this project I have presented the necessity brief History. Concise history of the language
to handle both the agreement and the word reordering sciences: from the Sumerians to the
problems in the machine translation from English to cognitivists. Edited by E.F.K.Koerner and
Arabic. I proposed a system which uses the advantages R.E.Asher. Oxford: Pergamon Press. Pages
of the Rule-based machine translation (RBMT) 431-445, 1995.
approach to solve those problems. The project has dealt [6] Hutchins, W.John., Example-based machine
with the two features that greatly affect the outputs of translation: a review and commentary.
MTs, which come from the fact that different languages Published online: © Springer Science and
have different text orientations where some of them are Business Media. Pages 6,7, 2006.
left-to-right and others are right-to-left. The orders of [7] Kaji Hiroyuki., An Efficient Execution
the words in the sentence are also different from one Method for Rule-Based Machine Translation.
language to another. Systems Development Laboratory~ Hitachi
Ltdo1099 Ohzenji, Asao, Kawasaki, 215~
Japan, 1988.
The International Arab Conference on Information Technology (ACIT’2015)
[8] Khalil, Aziz M., A Contrastive Grammar of [13] Tripathi Sneha and Sarkhel Krishna.,
English and Arabic. By Jordan Book Center, Approaches to machine translation. Annals of
(ISBN): 978-9957-604-13-4. Amman, Jordan, Library and Information Studies. Vol.57, pp.
2010. 388-393, 2010.
[9] Labaka, Gorka and Stroppa, Nicolas and Way,
Andy and Sarasola, Kepa., Comparing rule-
based and data-driven approaches to Spanish-
to-Basque machine translation. Copenhagen,
Denmark, 2007.
[10] Ryding Karin., A Reference Grammar of
Modern Standard Arabic. Cambridge
University Press The Edinburgh Building,
Cambridge, CB2 2RU, UK, 2005.
[11] Shaalan Khaled., Rule-based Approach in
Arabic Natural Language Processing.
International Journal on Information and
Communication Technologies, Vol. 3, No. 3,
2010.
[12] The Penn Treebank Project., Computer and
Information Science, Penn University.
URL http://www.cis.upenn.edu/~treebank/
(viewed on 27/11/06), 2006.
Table 3: Evaluation
)The International Arab Conference on Information Technology (ACIT’2015
Sentence My MT system Google MT Systran MT Human judgment
results results results
Sarah writes a تكتب سارة رسالة سارة يكتب بريد إلكتروني سارة يكتب حرف ترجمة الباحث أصوب
letter
The boys write a يكتب األوالد كتاب األوالد إرسال كتاب يكتب الفتى كتاب ترجمة الباحث أصوب
book ينقصها تنوين المفعول به
They ate the هم أكلوا اللحم أكلوا لحوم اللحم هم أكلوا ترجمة الباحث و ترجمة سيستران
meat أصوب
The girls were كانت البنات جيدات وكانت الفتيات جيدة جيّد البنت كان ترجمة الباحث أصوب
good
The lions eat the تأكل األسود اللحم األسود تأكل اللحم اللحم األسد يأكل ترجمة الباحث و ترجمة جوجل
meat أصوب
She needs help هي تحتاج مساعدة وهي في حاجة إلى مساعدة مساعدة يحتاج هو ترجمة الباحث أصوب
It eats the meat إنها تأكل اللحم وهو يأكل اللحوم اللحم يأكل هو ترجمة الباحث أصوب
They need help هم يحتاجون مساعدة إنهم بحاجة إلى مساعدة مساعدة يحتاجون هم الترجمات الثالث صائبة
Sarah ate the أكلت سارة التفاحة سارة أكل التفاح سارة أكل التفاح ترجمة الباحث أصوب
apple
They elected him هم انتخبوه رئيس إنهم انتخبوه رئيسا هم انتخبواه رئيس ترجمة ترجمة جوجل و ترجمة
president الباحث ينقصها تنوين المفعول به
Sarah lives a تعيش سارة حياة جيدة سارة تعيش حياة جيدة سارة يعيش حياة جيد ترجمة الباحث و ترجمة جوجل
good life أصوب
Ahmad lives a يعيش احمد حياة جيدة أحمد يعيش حياة جيدة أحمد يعيش حياة جيد ترجمة الباحث و ترجمة جوجل
good life أصوب
They elected him هم انتخبوه رئيس إنهم انتخبوه رئيسا هم انتخبواه رئيس أصوب وترجمة ترجمة جوجل
president الباحث ينقصها تنوين المفعول به