Anusaaraka:A Better Approach To Machine Translation (A Case Study For English-Hindi/Telugu)
Anusaaraka:A Better Approach To Machine Translation (A Case Study For English-Hindi/Telugu)
Anusaaraka:A Better Approach To Machine Translation (A Case Study For English-Hindi/Telugu)
Akshar Bharati
Amba P Kulkarni
Dipti Misra Sharma
([email protected],[email protected])
1. Introduction:
Conventional Machine Translation systems are necessarily fragile. The main reason
behind this is the 'incommensurability' among natural languages. The incommensurability
arises because languages encode information partially and follow different coding
schemes.
Since languages in general do not code information completely, we can say that the text
in a language is like a 'picture' with some explicit 'key strokes'. It is the reader who fills
in the gaps by supplying the missing information and interprets it. The missing
information may involve various combinations of common sense, the world knowledge,
language conventions, cultural background, and domain specific knowledge.
In the above sentence kartaa (or the agent) as well as karma (or the patient) is not marked
explicitly.
However, a native Hindi speaker, does not have any difficulty in 'understanding' the
1
Here we give Hindi examples. However, the complete discussion also holds good for
other Indian languages, and in particular, Telugu.
sentence. Based on the 'yogyataa' , he interprets it correctly that a person (Ram, in this
sentence) would be the agent of eating, and fruit is the thing which is eaten.
When something against 'yogyataa' is to be stated, then language marks it through the
vibhakti explicitly as in the following example.
Thus the information coding is not simple. The process of interpretation, in general,
involves a complex interaction between various sources of information. Quite frequently
there is also ellipses in sentences. This makes the interpretation more difficult.
1.1 Problems in translation:
2. Solution:
It should be noted that the reader has complete access to the world knowledge, common
sense, etc. that are necessary to interpret the text. Hence the system should be designed
keeping the reader at the central place.
One of the solutions adapted by the MT developers is to provide the 'rough' translation in
cases of failures. However, the 'roughness' is not well defined, and hence this solution
does not serve the purpose.
The Basic design features for such a system may be stated as:
-- make sure that complete information is available to the user at every stage.
-- separate the resources that can be made, in principle, reliable from those that are,
inherently unreliable.
-- enable human being to take charge of the situation, wherever important.
The conventional MT systems (fig 1) do the dictionary lookup only after complete
analysis of the source language text. Different linguistic tools used for the source
language analysis, however, are not 'perfect'. For example, the POS taggers, have only
95% to 97% accuracy. In other words, on an average 3 to 5 words out of every 100 are
marked with incorrect tag. Once the decision is made at POS level, the other possibilities
are filtered out. Therefore, after the dictionary look-up what reader gets is the filtered
sense. Since the filtering is done by machine, user does not have any 'control' over the
filtering process.
(fig 1)
Secondly, even if the developer would like to present all the possibilities to the reader, it
is difficult to present the complete information from analysis in compact form to the user.
We describe below the architecture of anusaaraka engine (fig 2) which is based on the
above guidelines.
3. Core anusaaraka engine:
(fig 2)
3.1. Word Level Substitution
At this level we provide the gloss for each source language word in the target
language. The polysemous words, however, are a major source of problems.
When there is no one-one mapping, it is not practical to list all the meanings.
Recall that, on the other hand, anusaaraka aims at providing complete
information to the reader. The question is now, how can this be guaranteed at
word level substitution?
To seek the solution, first let us see why a native speaker does not find it odd to have so
many seemingly different meanings of a word. If we look at the various usages of any
polysemous word, we observe that these polysemous words often have a core meaning
and other meanings are natural extensions of this meaning. In anusaaraka we try to relate
all these meanings and show their relationship by means of a formula. We call this
formula a padasutra[2]. (The concept of padasutra is based on the concept of pravrutti-
nimitta from Indian traditional grammars.)
The word padasutra itself has two meanings:
The English word 'leave' as a noun means chutti' and as a verb 'chodanaa' in Hindi and it
is obvious that chodanaa and chutti are related.
leave: chutti[>chodanaa]
Here a>b stands for b gets derived from a and a[b] roughly stands for a or b.
At this level some of the English words like function words, articles, etc. are not
substituted. The reason being they are either highly ambiguous, or there is a lexical gap in
Hindi corresponding to the English words (e.g. Articles), or substituting them may lead to
catastrophe.
To understand the output thus produced, a human being needs some training. Thus if a
user is willing to put in some effort, he has complete access to the original text. The effort
required here is that of making correct choices based on the common sense, world
knowledge, etc.
This layer ensures to produce an output which is a rough translation that systematically
differs from Hindi. Since the output is generated following certain principles, the chances
of getting mislead are less. Theoretically the output at this layer is reversible.
Thus by division of workload between man and machine and adoption of the concept of
padasutra(word formula), we guarantee that the first level output is faithful to the original
and also acts as a safety net when later modules fail.
The POS taggers can help in WSD when the ambiguity is across POSs.
In the first sentence, 'chairs' is used as a verb, and in the second sentence 'chairs' is used
as a noun. Therefore to decide the meaning of 'chairs' in the respective contexts, one can
just rely on POS tags.
The POS taggers mark the words with appropriate POS tags. These taggers use certain
heuristic rules, and hence may sometimes go wrong. The reported performances of these
POS taggers vary between 95% to 97%. However, they are useful, since they reduce the
search space for meanings substantially.
Anusaaraka uses this tool for developing word sense disambiguation rules semi-
automatically.
The output produced at this stage is irreversible, since machine makes choices based on
heuristics.
Since Indian languages allow certain amount of freedom in the order of words, the
anusaaraka output at previous layer makes sense to the Indian language reader. However
this output not being natural to the Indian languages, one may not enjoy it as much as one
may with natural order. Also it would not be treated as a translation. Therefore in this
module our attempt is to generate the correct word order of the target language.
Since anusaaraka presents the complete information that is available in the source
language text, using target language vocabulary, one may list the following advantages:
-- It points out the ways that different languages encode the information and the
amount of information coded.
-- It helps in identifying sources of information and thereby helps to figure out what is
in principle possible and what is not, in a scientific way.
-- It brings into focus those phenomena that are of prime importance from translation
point of view.
5. Implementing English-Telugu anusaaraka:
1. Padasutras for ambiguous words (can be prioritized in terms of highly ambiguous and
most frequent ones to begin with),
2. English-Telugu bilingual dictionary for content words as well as function words,
3. Word Sense disambiguation rules,
4. Preposition-vibhakti mapping rules, and
5. Telugu generator
Since Indian languages share certain common things with respect to vocabulary, it may
be possible that a number of WSD rules developed for Hindi will also work for Telugu.
In that case the English words just need to be substituted by appropriate Telugu words.
7. Conclusion:
References:
[1] Akshar Bharati, Vineet Chaitanya, Dipti Misra, Amba Kulkarni. Modern
Technologies for language Access: An aid to read English in the Indian Context:Osmania
Papers in Linguistics, Ed. V. Swarajya Lakshmi, pp.111-126.
Appendix-I
Sample English-Telugu anusaaraka output Layer1
Row 1: Original English sentence
Row 2: Word level substitution
Least fragile layer.
Contains Telugu padasutra(word formula) for each English words.
E.g. Small -> cinna^takkuva
rats -> eluku/svaamii~drohamu~ceyyi
Row 3: Word Grouping
A group of words with a new meaning(e.g. Compounding) For example in the above
sentence, are + ing = tunnaa
Row 4: Word Sense Disambiguation Attempts to select the appropriate sense
according to the context.
For example, the big cats -> vyaaghramu
Row 5: Preposition Movement
The prepositions are moved to their correct Telugu positions E.g. '->lona-- adivi' is
changed to '--- --- adivi+lona'.