CHAPTER 1
INTRODUCTION
Nepali is official and most widely spoken language of Nepal. It belongs to Indo-Aryan
branch of the Indo-European language family. It is written in the Devanagari script. There are
11 vowels and 33 consonants in Nepali Language [3]. Machine translation (MT) refers to the
use of computer algorithms and artificial intelligence to automatically translate content from
one language to another [2]. Today’s MT system’s already have more than 90% accuracy
thus opens the door for adoption and the use of highly reliable translation systems [5].
Machine translation has many potential applications, particularly in fields such as business,
government, and academia. It can help to facilitate communication between individuals who
speak different languages, enabling them to exchange information and ideas more effectively.
It can also assist with the translation of documents, web pages, and other written materials,
allowing Nepali-speaking individuals to access information in English and vice versa.
The encoder-decoder recurrent neural network with long short-term Memory (LSTM) can be
used to achieve close to state-of-the-art output in machine translation [2]. RNN was usually
successful at modeling source sentence sequence to the target sentence sequences, but was
eventually replaced by LSTMs because the gradients of a network would explode and vanish
if the length of the sequences increased beyond a particular threshold. Neural networks with
LSTM cells performed well with longer sequences but were even harder to train because of
their continued ingestion of serialized input. Naturally they underutilized the parallelization
ability of GPUs [1]. EN-NP translation poses various challenges. For instance (i) limitation of
parallel corpora and (ii) morphological and variation of word order due to syntactical
divergence. The English has Subject-Verb-Object (SVO) whereas Nepali have Subject-
Object-Verb (SOV) sentence Structure [13].
1
1.1 BACKGROUND OF THE STUDY
Due to globalization, people from different parts of the world interact with each other,
making communication between speakers of different languages more common. As a result,
there is a growing need for automatic translation systems that can help bridge the language
barrier between different languages. Machine translation (MT) systems have been developed
to address this need.
A large amount of work has been reported on machine translation. Various approaches have
been proposed by researchers, such as rule-based, corpus-based, hybrid-based approach [13].
A rule-based approach in MT is a traditional approach that uses a set of predefined linguistic
rules and algorithms to translate text from one language to another. These rules are based on
linguistic knowledge and expertise, and they are usually developed by linguists and language
experts. The corpus-based approach is used in statistical machine translation (SMT) systems
and neural machine translation (NMT) systems. In SMT systems, the approach involves the
use of statistical models to learn the relationship between the source and target languages
based on the analysis of parallel corpora. These models are then used to predict the best
translation for a given input sentence based on the probability of each possible translation. In
NMT systems, the corpus-based approach involves the use of deep neural networks to learn
the relationship between the source and target languages based on the analysis of parallel
corpora. The neural network is trained on a large amount of parallel corpora to learn how to
predict the target language sentence for a given source language sentence. The hybrid
approach can combine rule-based and neural-based methods to create a system that uses the
best features of each approach. For example, a hybrid system might use a neural-based
approach to handle overall sentence structure, word choices and a rule-based approach to
handle specific grammar rules.
Unsupervised machine translation is further a new way of translation without using the
parallel corpus, but the results are still not remarkable. On the other hand, NMT is an
emerging technique and shown significant improvement in the translation results.
2
1.2 STATEMENT OF PROBLEM
Despite significant progress in machine translation over the years, there are still major
challenges in developing accurate and efficient translation systems for low-resource
languages such as Nepali. Nepali is a morphologically rich language with complex sentence
structures, and translating it to and from English requires a deep understanding of its
grammar and syntax. While existing machine translation systems have shown promising
results, they still struggle to produce translations that are natural-sounding and contextually
appropriate.
Although, being a popular approach to the machine translation, RNN with LSTM has limited
capacity to capture a long range dependency in the input sequence. It can suffer from the
vanishing gradient problem, where the gradients become very small during back propagation,
leading to slow convergence and difficulty in learning long-term dependencies.
3
1.3 RESEARCH OBJECTIVES
The NMT requires a large set of parallel corpus with diversified context for the state-of-the-
art performance, which is not available for low-resource language pair (EN-NP). The solution
is to either manually collect more data or perform unsupervised data augmentation. The
manual preparation of data is time consuming and requires involvement of bilingual experts,
so the data augmentation is more a viable approach.
The main objective of the thesis work is:
To improve the translation quality of EN-NP language pair using Transformer model.
4
CHAPTER 2
LITERATURE REVIEW
In [1] Akshara Kandimalla et.al investigate the utility of back-translation method and its
effect on NMT system performance. Their experimental evaluation reveals that the back-
translation method helps to improve the BLEU scores for both English-to-Hindi and English-
to-Bengali NMT system. The back translation improves the BLEU score for English-Hindi
from 33.45 to 33.72 and for English-Bengali from 11.58 to 11.99. Their experiments also
show that back-translation performs better for weaker MT models more than already strong
models.
In [2] Kriti Nemkul et.al developed a RNN - LSTM with attention model. The LSTM cells in
encoder and decoder with attention with two layers of neural network and 256 hidden units
found to be better in translating English to Nepali sentence with the highest BLEU score of
8.9. Which is lower than (Guzman et.al) they have achieved 15.1 and 7.6 in NP-EN and EN-
NP direction respectively.
In [3] Acharya and Bal does a comparative analysis of SMT and NMT for English-Nepali
language pair. They achieved the BLEU score of 5.27 for SMT and 3.28 for NMT. They used
parallel corpus of 7K sentence only. The lower BLEU score was due to insufficient number
of parallel corpus.
In [5] Chaudhari et.al developed a Tamang-Nepali MT system using Transformer model. The
Transformer model is a general sequence to sequence model with self-attention. They
achieved a BLEU score of 27.74 for Nepali-Tamang and 23.74 for Tamang-Nepali direction.
They have used a parallel corpus of 15K sentence.
In [14] S.K.Bista et.al developed a rule-based MT system called Dobhase. The system Takes
a sentence in the source language, analyzes it, parses it to form a parse tree and then
generates syntax of target language and finally applies morphology generation rules to form
sentence in target language. The works on the enhancement of Dobhase was undertaken to
improve the existing Dobhase system. Even after the improvements, the system was still
5
unable to resolve ambiguous words, handle multiple conjunctions and was also unable to
handle single word joiner.
In [15] S.K. Bista et.al developed a Nepali Deconverter for the Universal Networking
Language (UNL). UNL is an Interlingua proposed by United Nations University/Institute of
Advanced Studies, Tokyo, Japan to remove language barrier and digital divide in the World
Wide Web. UNL is an artificial digital language that represents meaning sentence by
sentence. The representation is in logical form. Such logically formed expressions can be
viewed as a semantic net or an acyclic directed hyper graph.
In [16] GNMT, Google’s Neural Machine Translation system consists of a deep LSTM
network with 8 encoder and 8 decoder layers using residual connections as well as attention
connections from the decoder network to the encoder. To improve parallelism and therefore
decrease training time, attention mechanism is used which connects the bottom layer of the
decoder to the top layer of the encoder.
6
CHAPTER 3
RESEARCH METHODOLOGY
3.1 DATASET
The Parallel corpora are considered as the most valuable training resource in natural language
translation purposes. The parallel corpora can be aligned in a various way. A word token
level, sentence-level, paragraph-level, document-level. In a word token level alignment there
is no context, at the sentence-level there will be minimum context, at the paragraph-level
there will be more contexts, and at the document level there will be full context. These thesis
works will use dataset from [17], in this dataset there are 200K English-Nepali parallel
corpuses aligned in a sentence-level. Beside already available dataset, this thesis work will
also contribute on the parallel corpus. The parallel corpus will be collected from the various
areas like news article, books, general articles, laws, plans, government reports etc. [5].
Fig. 3.1: A steps for collecting parallel corpus.
3.1.1 Document Collection
The first step is to identify sources that contain a large number of English and Nepali
documents. These sources can include government websites, news outlets, social media
platforms, and online books and journals [19, 20, 21] etc.
3.1.2 Splitting
Use delimiters: full stops (.) for English and purnabiraams ( ।) for Nepali sentence. The
delimiter will Sentencify Nepali well since in Nepali purnabiraams are only used to end a
7
sentence, but the period symbol has several uses in English, for example, in salutations,
acronyms, Latin borrowings, etc.: Mr., Mrs., Dr., Rs., 1.1.2, a., etc. and so on. The regular
expression will be used to handle those cases [17].
3.1.3 Cleaning
The repeated sentence pairs will be removed from both files. Those lines which doesnot
match it’s translation in the parallel file even when the parallel above and below that line are
properly alligned should be removed. For that Length Similarity checking will be applied.
3.2 TRAINING
3.2.1 Normalization
Text Normalization is the process of transforming text into simple and standard form. When
we bring the text into standard form, it helps to reduce the unnecessary information that the
algorithm has to deal and as a result of it, the efficiency of the translation will be increased.
The stemming and Lemmatization are the example of the normalization.
3.2.2 Tokenization
It is splittig the paragraphs and sentences into sequence of smaller units (words) or tokens.
The Transformer based model in the NMT process the raw text at the token level.
3.2.3 NMT Setup
For the training purpose we will use the transformer model. It still maintains the encoder-
decoder architecture, everything else about it is different from an RNN. A major difference is
the input-output pair for these models. While RNN is a recurrent mechanism that takes a
whole sequence, develops hidden layers and uses a decoder in tandem with the hidden layers
to generate a sentence, all done in a single iteration, the Transformer is not a recurrent
technique. It generates translations one word per iteration. In the figure 3.1 the left part is
encoder and the right part is decoder. After input and output fed into model the embedding is
generated. Positional encoding uses the trigonometric functions to generate positions of each
token [18]. These blocks perform multi head attention on their respective inputs. Three
important things come out of them: values and keys from the encoder and queries from the
masked multi head attention on the decoder. The multi head attention on the encoder side
returns a tensor called values (V) of the most interesting features from the source sentence.
8
Keys (K) is also a tensor and it contains information regarding the positions of the entries in
V. (Again, position of tokens is important in Transformer because it is not recurrent.) Queries
(Q) is the output of the masked multi head attention.
Fig. 3.2.3: The Transformer model architecture [18].
The dot-product of K and Q returns positional information for the target. After normalization
and a softmax on the dot-product, it is multiplied with V. These operations happen
concurrently in many layers and finally the word probabilities are output. The matrix of
outputs is given by:
9
( )
T
QK
Attention ( Q , K ,V ) =softmax V
√ dk
The complexity per layer of the Transformer model is O(n 2·d) where n is the length of the
sequence and d is the representation dimension. For RNNs, the complexity is O(n·d 2). A
restricted Transformer model has complexity of O(r·n·d) where r is the size of neighborhood
[18]. The transformer model has the lower computational complexity than LSTM. And they
work well with the longer sentences.
10
CHAPTER 4
TENTATIVE TIME SCHEDULE
01-Jan 31-Jan 02-Mar 01-Apr 01-May 31-May 30-Jun 30-Jul
Identify Research Area
Proposal Defense
Data Collection
System Design and Coding
Mid-Tern Defense
Data Analysis and Interpretation
Documentation
Fianl Defense and Submission
11
CHAPTER 5
RESULTS AND DISCUSSION
5.1 EXPECTED OUTCOME
The expected outcome of the Transformer model would be the generation of grammatically
correct and semantically meaningful Nepali sentences from English text. The Transformer
model is a state-of-the-art approach to machine translation that uses self-attention
mechanisms to capture long-range dependencies and overcome some of the limitations of rule
based, statistical, and LSTM methods. The quality of the output from English to Nepali
machine translation using the Transformer model would depend on various factors, including
the size and quality of the training data, the complexity of the English text, and the accuracy
of the model architecture and training process. However, in general, the Transformer model is
expected to produce higher quality translations than previous machine translation methods,
due to its ability to handle variations in word order and sentence structure, and its capacity to
learn and generalize from large amounts of data.
5.2 PROPOSED VALIDATION CRITERIA
5.2.1 RESULT ANALYSIS USING BLEU
BLEU (Bilingual Evaluation Understudy) is a widely used metric for evaluating the quality
of machine translation outputs. It measures the similarity between a machine-generated
translation and one or more human-generated reference translations.
BLUE evaluation system requires two ingredients: “numerical translation” closeness metric
and a corpus of good quality human reference translations [4].
The BLEU metric calculates a score between 0 and 1, with 1 indicating a perfect match
between the machine translation and the human reference translation. It works by comparing
the n-gram overlap between the machine translation and the reference translation, where n is
typically 1, 2, 3, or 4.
12
We first compute the geometric average of the modified n-gram precisions, p n, using n-grams
up to length N and positive weights wn summing to one. We compute the brevity penalty BP,
BP=
{
e
1 if c>r
(1−r /c)
if c ≤r
. (1)
Then,
(∑ )
N
BLEU =BP . exp wn log pn .(2)
n=1
The ranking behavior is more immediately apparent in the log domain,
( )
N
r
log BLEU =min 1− , 0 + ∑ wn log pn .(3)
c n=1
Where,
r : length of the effective reference corpus.
c : length of the candidate translation [4].
13
REFERENCES
[1] Improving English-to-Indian Language Neural Machine Translation Systems,
Akshara Kandimalla, Pintu Lohar, Souvik Kumar Maji.
[2] Low Resource English to Nepali Sentence Translation using RNN-Long Short-
Term Memory with Attention, Kriti Nemkul, Subarna Shakya.
[3] Comparative Study of SMT and NMT: Case Study of English-Nepali Language
Pair, Bal Krisna Bal, Praveen Acharya.
[4] BLUE: a Method for Automatic Evaluation of Machine Translation, Kishore
Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu.
[5] Efforts towards developing a Tamang Nepali Machine Translation System, Binay
K Chaudhari, Bal Krishna Bal, Rasil Baidar.
[6] Neural Machine Translation of Low-Resource and Similar Language with
Backtranslation, Michael Przystupa, Muhammad Abdhul-Mageed.
[7] Machine Translation using Recurrent Neural Network on Statistical Machine
Translation, Sainik Kumar Mahata, Dipankar Das and Sivaji Bandyopadhyay.
[8] Enriching the Transformer with Linguistic Factor for Low-Resource Machine
Translation, Jordi Armengol-Estape, Marta R. Costa-jussa, Carlos Escolano.
[9] English to Nepali Statistical Machine Translation System, Abhijit Paul and Bipul
Syam Purkayastha.
[10] The FLORES Evaluation DataSets for Low-Resource Machine Translation:
Nepali-English and Sinhala-English, Francisco Guzman et. all.
[11] UNL Nepali Deconverter, Birendra Keshari, Sanat Kumar Bista.
[12] Improving Neural Machine Translation Models with Monolingual Data, Rico
Sennrich and Barry Haddow and Alexandra Birch.
14
[13] Neural Machine Translation for Low-Resourced Indian Languages, Himanshu
Chaudhary, Shivansh Rao, Rajesh Rohilla
[14] S.K Bista, B. Keshari, J.Bhatta and K.Parajuli, Dobhase:online English to Nepali
Machine Translation System. In the proceedings of the 26 th Annual conference of
the Linguistic Society of Nepal, December 2005.
[15] UNL Nepali DE converter, S.K.Bista and Birendra K.
[16] Yonghui Wu et.al Google’s neural machine translation system: Bridging the gap
between human and machine translation.
[17] Sharad Duwal and Bal Krishna Bal, Efforts in the Development of an Augmented
English–Nepali Parallel Corpus.
[18] Ashish Vaswani et.al Attention is all you need.
[19] https://npc.gov.np/
[20] https://lawcommission.gov.np/en/
[21] https://www.olenepal.org/e-pustakalaya/
15