Rule Based Approach For Prepositional Phrase Attachment in English-Tamil Translation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Journal of Computer Applications (0975 – 8887)

Volume 63– No.22, February 2013

Rule based Approach for Prepositional Phrase


Attachment in English-Tamil Translation
S.Suganthi P.Bama Ruckmani
Dept of CSE-PG Asst.Professor/CSE-PG
National Engineering College National Engineering College
K.R.Nagar,Kovilpatti K.R.Nagar,Kovilpatti
Tamil Nadu,India. Tamil Nadu,India.

K.G.Srinivasagan, PhD. M.Saravanan


Prof. & Head/CSE-PG Dept of CSE/PG
National Engineering College National Engineering College
K.R.Nagar,Kovilpatti K.R.Nagar,Kovilpatti
Tamil Nadu,India. Tamil Nadu,India.

ABSTRACT languages are different structural order. The following fig.1


This paper mainly under the field of Natural Language illustrates the structural order of English and Tamil language.
Processing (NLP). Machine Translation is a major application
area under NLP. The main aim of this work is to improving
English Police caught the thief
the translation quality especially in English-Tamil. There are
:
so many researchers are already developed their work in this
field. But the human expectation is not yet achieved. So the
translation research is still exist. While translating English-
Tamil translation prepositional phrase attachment and Tamil ப ோலீஸ் திருடனை ிடித்தோர்
orthographic errors are the major issues. Different kinds of :
prepositions are used quite normally in English, in context of
Tamil translation focusing towards English prepositions
Figure 1. The structural order of English and Tamil
alone. Always English prepositions are treated as postposition
in Tamil. Place the postposition is based on ‘time, place,
direction, context’. In some context different preposition may In the existing system (Google Translate) already deal with
promote unique meaning, in such scenarios the Word Sense solution for the problem of machine translation of almost 40
Ambiguation problem may arise. To resolve the Word Sense languages. Especially in English-Tamil translation accuracy is
Ambiguation and word reordering, an algorithm called not appreciable in many cases, particularly while dealing with
“Prepositional Phrase Attachment” is proposed. This system the preposition and spelling errors. In order to obtain the
handles the frequently used prepositions such as “of, in, to, correct English to Tamil translation the correct preposition,
on, by, from”. The correct meaning of a prepositional word is correct spelling and word structure is very essential. The rest
achieved through this work. of the paper is organized as follows, the related works are
discussed in section 2, framework and proposed algorithm
Keywords presented in section 3, experimental results and discussion are
reported in section 4 and followed by concluding remarks.
Machine Translation, Prepositional Phrase Attachment,
Orthographical Rules, POS tagging, Words Reordering. 2. RELATED WORK
1.INTRODUCTION Earlier many researchers have concentrated in language
translation using prepositions, idioms, phrasal verbs, and
The language has been playing a major role in all the sectors. converting a complex sentence into simple sentences. They
For the purpose of communicating with the world wide obtained and reported the different levels of accuracy as
people, and accessing scientific resources in the major field result. The results may not fulfill the scenario of users need.
the language knowledge is needed. Literacy in the mother
tongue is no longer enough to follow the information supplied Micheal Collins et al [1] proposed a Backed off method for
by the other languages. so it is necessary to bridge the gap handling the prepositional phrase attachments, which is
with the help of modern technologies as early as possible. In appreciably better than other methods which have been tested
this context the machine translation is essential. The main on the wall street journal corpus. Their algorithm has the
issue of the translation is structural order of the sentence. additional advantages of being conceptually simple and
Language sentences have many parts of speech like subject, computationally inexpensive to implement. They reported as
verb and object. Structural order of language is very from accuracy of 84.5% is close to the human performance figure
language to language. The words reordering might be difficult of 88% using the 4 head words alone. Sanda M.Harabagiu [2]
and important task because the order of words may affect the presented a method for word sense disambiguation and
original meaning of a sentence. Basically English and Tamil coherence understanding of prepositional relations. This
method is to classify prepositional attachments according to
semantic equivalence of phrase heads and then apply

37
International Journal of Computer Applications (0975 – 8887)
Volume 63– No.22, February 2013

inferential heuristics for understanding the validity of 3. PREPOSITIONAL PHRASE


prepositional structures. This paper proposes a method of
extracting and validating semantic relations for prepositional ATTACHMENT
attachment the method.
Syntactically prepositions can be arranged into three classes-
Sudip Kumar Naskar et al [4] presented the approach of simple prepositions, compound prepositions and phrase
handling of English prepositions in Bengali has been studied prepositions. Different kinds of prepositions are used quite
with reference to a machine translation system from English normally in English, in context of Tamil translation focusing
to Bengali. In machine translation, sense disambiguation of towards English prepositions alone. Always English
preposition is necessary when the target language has prepositions are treated as postposition in Tamil. Place the
different representations for the same preposition. In Bengali, postposition is based on ‘time’, ‘place or direction’ and
the choice of the appropriate inflection depends on the ‘context’. In some context different prepositions may promote
spelling of the reference object.The choice of the unique meaning, in such scenarios the Word Sense
postpositional word depends on the semantic information ambiguation problem may arise. To resolve the Word Sense
about the reference object obtained from the WordNet. V. Ambiguation problem and word reordering, an algorithm
Dhanalakshmi et al [5] presented the grammar teaching tools called “prepositional attachment algorithm” is proposed.
for analyzing and learning character, word and sentence of
Tamil language. Tools like Character Analyzer for analyzing The Rule-Based approach is used in the proposed algorithm to
character, morphological Analyzer and generator and verb solve the above mentioned problem. The correct meaning of
conjugator for the word level analysis and parts of speech a preposition is chosen by using a rule-based approach and is
tagger, Chunker and dependency parser for the sentence level placed incorrect position using the semantic structure of a
analysis were developed using machine learning based target language. Our system handles the frequently used
technology. prepositions such as ‘of ’, ‘in’, ‘to’, ‘on’, ‘by’, ‘from’. The
semantic rules are fully based on parts of speech in a given
Poornima C et al [6] proposed a rule based technique for sentence. The PENN tree tag is used for the purpose of
simplifying the complex sentences into simple sentences assigning the POS tag. The PENN tree tag contains
based on connectives like pronouns, coordinating and approximately 36 tags with examples. In order to overcome
subordinating conjunction without changing the meaning of word sense disambiguation the proposed algorithm extracts
the sentence. This method is useful as a preprocessing tool for the triplet words (Preceding word of preposition, Preposition,
machine translation. It has been proved that the splitting
Succeeding word of preposition).
technique can lead to remarkable improvements in machine
translation system. Dr.S.Saraswathi et al [7] developed a The proposed algorithm illustrates the following steps:
bilingual translation system for English and Tamil using
hybrid approach. They use Rule based machine translation 1. Segmenting the sentences from the larger paragraph based
(RBMT) and Knowledge based machine translation (KBMT) on delimiters such as “.”, “?”.
techniques. New rules have been added to the proposed 2. Assign POS tag for every word in a given sentence using
system in order to make the system more efficient. PENN tree grammar set.
Matt Post et al [8] described the collection of six parallel 3. Check whether the prepositions is presence or not in a
corpora containing four-way redundant translations of the current sentence.
source-language text. The Indian languages of these corpora
are low-resource and understudied, and exhibit markedly 4. If the preposition is found then to extract the triplet tag.
different linguistic properties compared to English. They 5. Decision based on the middle word of the triplet.
performed baseline experiments and suggested a number of
approaches that could improve the quality of models 6. The words are reordered based on the semantic structure
constructed from the datasets. S. Lakshmana Pandian et al [9] of the target language.
presents an effective methodology for English to Tamil
The text can be in any form either individual sentences or
translation. They implemented in a Rule based approach paragraph format. The simple sentences are converted from
which involves segmentation and tagging, Rule based the paragraph using the delimiters such as (“.”, “?”). The
Reordering, Morphological Analyzing and dictionary based simple sentence obtained in each word is assigned as the parts
translation to the target language. Then the errors in the of speech using the PENN tree grammar set. Based on the
translated sentences are corrected by applying Statistical POS tag the rules are generated.
technique. Since a word in English has multiple meaning in
Tamil an effective word dictionary file is needed in order to POS (Parts of Speech) is the process of marking up a word in
achieve better results in translation. a text (corpus) as corresponding to a particular part of speech,
based on both its definition, as well as its context. The POS is
P G Thiruumeni et al [10] provides a technique for used to useful for assigning the grammatical categories or word
handle the idioms and phrasal verbs during the translation category disambiguation for each and every English words in
process and it increases the accuracy of the translation. The English to Tamil translation. In this paper, taken as a PENN
BLEU and NIST scores calculated before and after handling Tree Tag set. This tag set nearly 36 tags. The following table.
the phrasal verbs and idioms during the translation process I illustrates some of the example of PENN Tree POS (Parts Of
show a significant increase in the accuracy of the translation. Speech) tag set.
The proposed technique for used in English to Tamil machine
translation system, can be incorporated with machine
translation system for English to any language. This approach Rules of the prepositional phrase “of ”
can be used in both rule based and factored statistical machine 1. If the order of triplets presents in a given sentence is
translation with some modifications. <NN><IN><DT> or <NN><IN><NN> then the meaning

38
International Journal of Computer Applications (0975 – 8887)
Volume 63– No.22, February 2013

of a prepositional phrase should be


“udaya/in”[உடைய/இன்]. S.No Preposition Meaning in Tamil

2. If the order of triplets presents in a given sentence as 1. udaiya [உனடய]


<NN><IN><JJ> then the meaning of a prepositional 2. in[இன்]
phrase is given as “kkhana”[க்கான] 1 of 3. kkaana[க்கோை]
4. aal[ஆல்]
3. If the order of triplets presents in a given sentence as
5. il [இல்]
<RB><IN><NNP> then the meaning of a prepositional
phrase is given as “ill” [இல்].
2 in il [இல்]
4. If the order of triplets presents in a given sentence as
<VBN><IN><IN> then the meaning of a prepositional 3 to Kku[க்கு]
phrase is given as “aal”[ஆல்].
4 on Mele[பேபே]

Rules of the prepositional phrase “from ” 5 by Aal[ஆல்]


If the order of triplets presents in a given sentence is
<POSP1><IN><POSP2> then the meaning of a prepositional 6 from Irunthu[இருந்து]
phrase is “irunthu”.

Rules of the prepositional phrase “by”


If the order of triplets presents in a given sentence is Table 1. Tamil meaning of prepositions ‘of ’, ‘in’, ‘to’,
<POSP1><IN><POSP2> then the meaning of a prepositional ‘on’, ‘by’, ‘from’
phrase is “aal”. The detailed algorithm for prepositional phrase attachment is
as follows:
Rules of the prepositional phrase “on”
If the order of triplets presents in a given sentence is Algorithm for attaching the Prepositional
<POSP1><IN><POSP2> then the meaning of a prepositional phrase “of ”
phrase is “mele/il”. Let the paragraph be “P” and split into “S1”, “S2”,
“S3”….“Sn”
Rules of the prepositional phrase “in” Take s1 into the number of segments “W1”, “W2”,
If the order of triplets presents in a given sentence is “W3”….“Wn”
<POSP1><IN><POSP2> then the meaning of a prepositional For i = 1 to n
phrase is “il”. W1…Wn <- <Pos tag>
From (1 to n)
Check the presence of “Preposition” or “Not”
Rules of the prepositional phrase “to” If so Flag=1
If the order of triplets presents in a given sentence is Else
<POSP1><IN><POSP2> then the meaning of a prepositional Flag=0
phrase is “kku”. End if
If(flag==1)
The proposed system is designed to handle the prepositions For i=1 to n
such as ‘of ’, ‘in’, ‘to’, ‘on’, ‘by’, ‘from’. The following table Extract the Triplet term <POS(P1)><IN><POS(P2)> and store in
1 shows the Tamil meaning of these prepositions. “T”
If ((<IN> == “of ”) && (<POS(P1)>==<NN>) &&
(<POS(P2)>==<NN>)) ||
((<IN> == “of”) && (<POS(P1)>==<NN>) &&
(<POS(P2)>==<DT>))
Then
<IN>  ‘udaiya/in’
T <- <POS (P2)>||<IN>||<POS (P1)>
Main Phrase  T
Else
If ((<IN> == “of”) && (<POS(P1)>==<NN>) &&
(<POS(P2)>==<JJ>))
Then
<IN>  ‘kkaana’
T <- <POS (P2)>||<IN>||<POS (P1)>
Main Phrase  T
Else
If ((<IN> == “of”) && (<POS(P1)>==<RB>) &&
(<POS(P2)>==<NNP>))
Then
<IN>  ‘il’

39
International Journal of Computer Applications (0975 – 8887)
Volume 63– No.22, February 2013

T <- <POS (P2)>||<IN>||<POS (P1)>


Main Phrase  T Rule 3:
Else If the suffix begins in a consonant and the word ends in an
If ((<IN> == “of”) && (<POS(P1)>==<VBN>) && vowels then add both words without changing any letter.
(<POS(P2)>==<NN>)) சென்டன+க்கு=>சென்டனக்கு
Then
v+c
<IN>  ‘aal’
Rule 4:
T <- <POS (P2)>||<IN>||<POS (P1)>
If the suffix begins in a vowel sound and the word ends in an
Main Phrase  T
End if இ,ஈ,ஏ,ஐ insert a ‘ய்’ in between.
End if
End if கவிடை+இல்=>கவிடை+ய்+இல்=>கவிடையி
End if
ல்
If(flag ==0)
Main Phrase (w1||w2…||wn) c+v
End if;
End for. 5. EXPERIMENTAL RESULTS
4. ORTHOGRAPHICAL RULES The Proposed frame work and algorithm is experimented with
250 sentences of text from the news papers and articles. All
An orthography is a standard system for using a particular the sentences are used for training. In order to evaluate the
writing system for any language. The orthographic rules are system, we applied 250 test sentences, in that 220 sentences
also standard spelling rules that specify the changes that occur are correctly translated. Moreover, Precision and Recall of
when two morphemes are combined together. An example words are widely used metrics to evaluate the efficiency of
would be: singular English words ending with -y, when Machine translation systems. Precision is nothing but the
pluralized, end with –ies. The orthography rules are language percentage of generated words that are actually correct. The
dependent. Thus these rules have to be framed for each recall stands for the percentage of words that are generated
language with the concern of linguistics. It includes rules of and that are actually found in the reference translation. F-
spelling, and may also concern other elements of the written Measure is the harmonic mean of recall and precision.
language elements such as punctuation and capitalization. If a No. of correctly generated Words
language uses multiple writing systems, it may have distinct
Precision = ----------------------------------------- = 88%
orthographies, as is the case with Kurdish, Uyghur, Serbian,
Inuktitut and Turkish. In some cases orthography is regulated Total No. of Words
by bodies such as language academies, although for many
languages (including English) there are no such authorities, No. of correctly generated Words
and orthography develops through less formal processes. The Recall = ------------------------------------------ = 93%
existing does not concentrate the orthography process during Total No. of Translated Words
the translation. In order to improve the translation quality, the
orthographic rules are essential. The better accuracy can be Precision × Recall
guaranteed to achieve through this semantic rules. Generally F-Measure = ------------------------------ = 90%
Tamil Nouns ending with
(Precision + Recall) / 2
Vowels: ஆ, இ, ஈ, உ, ஊ, ஐ and Consonants:
The comparative study is made with Google translate.
ண்,ம்,ய்,ர்,ல்,ழ்,ள்,ன் The same set of data are used with Google translate, out
of which 130 sentences are correct, the main reason is
Rule 1: semantic analyzed of prepositions and reordering error.
If the suffix begins in a consonant and the word ends in an We obtained the precision, recall and F-measure as 88
consonant ர்,ண்,ம்,ய்,ர்,ல்,ள்,ழ்,ன் then insert ‘உ’ in %, 93 %, 90 % respectively is as shown in table 2. The
between. Fig. 2 emphasizes very clearly that the proposed system
performance is better with respect to all the metric.
நண்பர்+க்கு => நண்பர்+உ+க்கு
c+c

Rule 2:
If the suffix begins in a vowel and the word ends in an
consonant then both the vowels and the consonant are
combined together to form new Vowel.
நNNnpaண்பர்+ஆல்=>நண்பரால்
c+ v
லண்ைன்+இல்=>லண்ைனில்
c +v

40
International Journal of Computer Applications (0975 – 8887)
Volume 63– No.22, February 2013

Table 2. Experimental analysis of various metrics 6. REFERENCES

Total No. of Translated

No. of Correct words


[1] Micheal Collins and James Brooks. (1995).

No. of Correct
of Sentences “prepositional phrase attachment through a Back-off
Total. No.

sentences
Total No.
of Words
Metrics

model”. In Proceedings of the Third Workshop on Very


System

*R(%)
Words

*P(%)

*F(%)
/

Large Corpora, pages 27-38.

[1] Sanda M.Harabagiu. (1996). “An application of wordnet


to prepositional attachment”. The Association of
Computational Linguistics Anthology Network.
Proposed
250 1270 1200 220 1120 88 93 90 [2] I.Dan Melamed, Ryan Green and Joseph P.
System
Turian.(2003) “Precision and Recall of Machine
Translation”. Proceedings of the North American
Google Chapter of the Association for Computational Linguistics
250 1270 1200 180 960 75 80 77
Translate on Human Language Technology, Volume 2, pages 61-
63.

*P-Precision, *R-Recall, *F-F-Measure [3] Sudip Kumar Naskar and Sivaji Bandyopadhyay. (2006).
“Handling of Prepositions in English to Bengali Machine
100 93 90 Translation”. Proceedings of the Third ACL-SIGSEM
88 Workshop on Prepositions, Trento, Italy, Association for
80 77
80 75 Computational Linguistics, pages 89–94.

[4] Dhanalakshmi V and Rajendran S. (2010). “Natural


Accuracy

60 Language processing Tools for Tamil grammar Learning


and Teaching”. International journal of Computer
40 Applications (0975-8887), Volume 8, No.14.

20 [5] Poornima Poornima C, Dhanalakshmi V, Anand Kumar


M and Soman K P (2011). “Rule based sentence
0 simplification for English to Tamil Machine Translation
System”. International journal of Computer Applications
Precision Recall F-Measure
(0975-8887),Volume 25, No.8.
Metrics
[6] Dr.S.Saraswathi , P. Kanivadhana, M. Anusiya and
Proposed System Google Translate S.Sathiya (2011). “Bilingual Translation System”.
International Journal on Computer Science and
Figure 2. Comparative study of various metrics Engineering”, Volume 3, No.3.

6. CONCLUSION [7] Matt Post, Chris Callison-Burch and Miles Osborne


English-Tamil translation using semantic approach for (2012). “Constructing Parallel corpora for six Indian
prepositional phrase attachment is implemented in Java languages via crowd sourcing”. proceedings of the 7 th
environment. In this work, we identified the exact meaning of workshop on statistical machine translation, Association
the preposition with respect to the content and place for for computational linguistics, pages 401-409.
English-Tamil translation. The main issue of Word Sense
ambiguation is addressed using rule based approach. We [8] Lakshmana Pandian S and Kumanan Kadhirvelu.(2012).
experimented. the system and found that the reliability and “Machine Translation from English to Tamil using
performance are good. Totally 250 sentences were considered Hybrid Technique”. International journal of Computer
for translation, 220 sentences are of correct translation. Also Applications (0975-8887),Volume 46, No.16.
we calculated Precision, Recall and F-Measure and the
corresponding performance is 88%, 93%, 90%. The proposed
[9] Thiruumeni P G, Anand Kumar M, Dhanalakshmi V and
system was compared with Google Translate and the
performances were reported. In future , we planned to Soman K P. (2012). “An Approach to Handle Idioms and
concentrate and explore additional idioms and phrases and Phrasal Verbs in English-Tamil Machine Translation
tense marker approaches and to determine whether the System”. International Journal of Computer Applications
preposition is used in a spatial or temporal sense and also to (0975 – 8887), Volume 26, No.10.
make it helpful for the task of predicting determiners,
prepositions, and other functional words. [10] Boxing Chen, Roland Kuhn and Samuel Larkin. (2012).
“PORT: a Precision-Order-Recall MT Evaluation Metric
for Tuning”, Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics,
Association for Computational Linguistics, pages 930–
939.

41

You might also like