Interactive English To Urdu Machine Translation Using Example-Based Approach
Interactive English To Urdu Machine Translation Using Example-Based Approach
Abstract—This work is first attempt towards English to Urdu Third approach for MT is Example Based Machine
Machine Translation (MT) using example based approach. We Translation (EBMT). It was introduced by Nagao [6] in 1984.
have developed an interactive MT system to facilitate the user to Using EBMT, translation is carried out by decomposing
customize the translation to his needs, thereby improving the English sentence into fragments, then finding corresponding
performance of the translation. Our MT system supports idioms, translation for those fragments in Urdu and then recombining
homographs, and some other features in addition to the ability of translated fragment into Urdu sentence.
the bilingual corpus to evolve. In the end, we compare the
features of the MT system developed by the center for research in English and Urdu are structurally different languages [1].
Urdu Language Processing (CRULP) with those of our MT For structurally different languages EBMT is a better choice
system. than others [7]. This suggests us to use Example based
approach for our MT system.
Keywords-Machine Translation; Bilingual Corpus
A quality translation cannot be developed now unless the
user gives a feedback to the MT system as discussed in [8], and
I. INTRODUCTION [9]. An interactive system that takes the feedback of the user to
Knowledge is the key to progress and English is one improve the translation quality is developed in our MT system.
language that preserves a tremendous amount of knowledge.
There is a huge literature of Sciences and Engineering available The scheme of this paper is as follows. We present related
in English. Rightly so, English is believed to be an international work in section 2, our proposed method in section 3, discuss
language. It becomes more important to be able to understand bilingual corpus in section 4, compare the work that we have
English. done with that done by CRULP in section 5, after which
conclusion and future work is provided in section 6.
Urdu is a language that is spoken in all over Pakistan and
many parts of India and some other South Asian countries [1]. II. RELATED WORK
This makes Urdu a very important language.
EBMT has been attempted for many languages, including
Only 5% to 10% people of Pakistan are familiar with English, French, Spanish, German, Japanese, Chinese, Turkish,
English [2]. To be able to move with the world, people of Arabic, Indian, and even Sign language, as for instance in [10]
Pakistan must get the latest knowledge but the current standing – [16].
in terms of literacy and the understanding of English finds a
gulf that needs to be abridged. This suggests that there is a need An MT system from English to Urdu is recently developed
of the interface between the two languages, i.e. English and by CRULP that is available online [17]. This MT system uses
Urdu. rule based technique, i.e. this MT system uses syntactic parsing
trees, the parts of speech tagger, and the grammatical rules for
Solution comes in the form of Machine Translation (MT). the translation purpose.
MT refers to the use of computing resources to facilitate the
translation of the content available in one language to its Work has been done on a web-based interactive MT system
equivalent content in another language [3]. [18]. The system is a chat-style MT. The idea is to provide a
broad coverage machine translation using the user’s response to
There are three main techniques for MT known as Rule improve the results inspired by [8], and [9].
based Machine translation (RBMT), Statistical Machine
Translation (SMT), and Example Based Machine Translation To the best of our knowledge, no work is available in
(EBMT). RBMT depends upon linguistic rules to carry out literature for English to Urdu MT using example based
translation from one language to another. It performs approach. Also, interactive features are not available for an MT
morphological, syntactic and semantic analysis on the language system from English to Urdu. This has motivated us to build an
and then transfers it into target language [4]. SMT is a corpus interactive EBMT system for the translation from English to
based approach that uses statistical models to carry out Urdu. Also, we aim to overcome the discrepancies that are
translation [5]. present in the CRULP MT system. The proposed method is
discussed in the next section.
Was 1 1 2
SEARCHING ALGORITHM
Eating 2 2 1
Input: English Phrase
Output: OK or SENTENCE IS NOT SUPPORTED
1 in the bottom right corner indicates that one operation is Algorithm body:
needed to convert the source text ‘is eating’ into target text For each phrase
‘was eating’. Compare the phrase with the set of phrases
in the corpus
The 0 Levenshtein distance means the exact match. A
If exact match found
positive finite Levenshtein distance means that it requires some
Return OK
finite operations to get the target text from the input text. A
threshold 0<T<n/2 means a closer match between two strings. Else
We provide interactivity for the string having closer match. Show options similar to the phrase
taken from Corpus
Semantic distance measures how close the two strings are If user selects an option
semantically. Semantic distance algorithm is an open source Return OK
project [23]. This algorithm returns score, a coefficient measure Else
of semantic closeness between two strings. A score of 1 means Return NOT FOUND
exact match, a score 0.9<s<1 means a synonymous word, If ends
where closeness is towards 1. We provide interactivity for If ends
0.8<s<0.98. These thresholds have been set after testing on
For loop ends
some examples. They can be adjusted in some cases as
suitable. If total number of OK responses divided by total
phrases >= 0.75
For each word of the phrase
TABLE II. TRADE-OFF BETWEEN USING LEVENSHTEIN ALGORITHM AND Find meaning of the word from the
SEMANTIC DISTANCE ALGORITHM
dictionary
Processing Coverage If all found
Levenshtein Fast Low
Add the phrase to the
corpus
Semantic Slow High
Return OK
Else
Return the found words’
meanings or words
If ends
For loop ends
Else
Return SENTENCE IS NOT SUPPORTED
If ends
a set of its Urdu equivalents is considered as follows. We have two triplets (3-tuples) in this case, i.e. we n1*n2*n3
E = Set of English phrases = {e1 , e2 ,..., en }
= 1*2*1 = 2 triplets (3-tuples).
For details of n-ary product, consult any standard text in
{
U1 = Set of Urdu translations of e1 = u 11 , u 12 ,…, u 1n1 } Discrete Mathematics; see for instance [24].
U = n-ary product of all the sets of Urdu translations of all 1) Ordering Rule 1
If there are e1, e2,,…, en are the phrases of English whose
English phrases = U1 × U 2 × …× U n = Urdu equivalents are u1, u2, …, un, then the translation of the
{(u 11 , u 21 ,…u n1 ),…, (u 1n1 , u 2n 2 ,…, u nn n )} sentence [e1 e2 … en] is given by [u1 un un-1 … u3 u2].
Has been studying, ‘parrh raha hai’, ‘parrh rahi hai’, • To pass the exam
For two hours, ‘do ghante se’. ‘Woh imtihan mein kamiabi ke liye mehnat kare gi’
EXAMPLE 5:
My aunt is a doctor and my uncle works in a factory that
makes television sets.
EXAMPLE 8:
You are eating; ‘aap kha rahe hain’, ‘tum kha rahe ho’ etc.
In this example, ‘you’ is the neutral singular word.
Since the homographs are included in the corpus, and also
the gender differences are recognized, therefore, the corpus size
is not homogeneous, i.e. the column size is variable because
one English phrase may have many Urdu equivalents as in
example 9.
EXAMPLE 9:
Enjoys cricket; ‘cricket pasand karta hai’, ‘cricket pasand
karti hai’.
In this example, enjoys cricket is stored in one row, and the
corresponding equivalents are placed in the same row in
different columns.
There is another interesting feature of the tool associated
with the corpus construction. There is a separate module that
ensures that no phrase is duplicated in the corpus, although one
phrase can have more than one translation equivalents as has
Figure 3. Screenshot for interactive ordering of translated text been discussed above. However, in case a phrase occurs more
than once in the construction phase, its frequency is added
IV. BILINGUAL CORPUS thereby improving its probability in the sense of Bayesian
Bilingual corpus is one of the most important parts of this probability theory [28]. This idea is useful when there are
tool, since the tool relies mainly on the corpus and a set of rules scarce computing resources at our disposal as in the case of
to combine the phrases in the corpus. embedded devices.
The conspiracy was brought to light by policeman Translation options for first sentence:
‘saazish police ke afsar ke paas roshni ki taraf layi gayi’ ‘Woh kaam karta hai kinare mein’
‘saazish police afsar se manzar e aam par aayi’ ‘Woh dariya ke kinare ke qareeb intizaar kar raha hai’
‘Woh daria ke bank ke qareeb intizaar kar raha hai’
EXAMPLE 11: In example 12, the word ‘get’ is taken in two senses, i.e. to
Input: come into possession of, and to perceive. In example 13, the
word ‘bank’ has two meanings i.e. bank as a financial
He has come of age today
institution, and bank as the slope beside a body of water.
CRULP MT system response: CRULP MT system doesn’t support the multiple uses of the
word bank and get in these examples, but our MT system
‘woh aaj umer ka aaya hai’ provides such support.
Our MT system response: Examples illustrating the gender and the words taken in
‘woh aaj baaligh hua hai’ singular and plural sense together:
In these examples, the idioms ‘brought to light’, and ‘come
of age’ have been considered for which CRULP MT system EXAMPLE 14:
attempts to translate word for word, whereas, our MT system Input:
attempts the idiomatic phrase correctly.
They are playing in the garden
Examples illustrating homographs:
CRULP MT system response:
EXAMPLE 12: ‘woh baagh mein khel rahe hain’
Input: Our MT system response:
• He gets an apple ‘woh baagh mein khel rahe hain’
• He gets an idea ‘who baagh mein khel rahi hain’
CRULP MT system response:
EXAMPLE 15:
‘use saib milta hai’ Input:
‘use khayal milta hai’ It is his work
Our MT system response (as options): CRULP MT system response:
Translation options for first sentence: ‘yeh uss ka kaam hai’
‘use mila ailk saib’ Our MT system response:
‘use soojha aik saib’ ‘yeh uss ka kaam hai’
Translation options for second sentence: ‘yeh unn ka kaam hai’
‘use soojha aik khayal’