517 Modified
517 Modified
ON
Fine-Tuning a Neural Machine Translation Model from English to
Hindi Language
submitted in partial fulfillment of the requirement for the award of the degree of
BACHELOR OF TECHNOLOGY IN
May-2025
i
DECLARATION
I, N. Sri Abhinav, bearing hall ticket numbers 23P65A0517 hereby declare that the
industrial oriented mini project report entitled “Fine-Tuning a Neural Machine
Translation Model From English to Hindi Language” under the guidance of Mr. Syed
Noor Mohammed, Assistant Professor , Department of Computer Science and
Engineering, Vignana Bharathi Institute of Technology, Hyderabad, have submitted
to Jawaharlal Nehru Technological University Hyderabad, Kukatpally, in partial
fulfilment of the requirements for the award of the degree of Bachelor of Technology in
Computer Science and Engineering.
This is a record of Bonafide work carried out by me and the results embodied in this
project have not been reproduced or copied from any source. The results embodied in this
project report have not been submitted to any other university or institute for the award
of any other degree or diploma.
ii
Aushapur (V), Ghatkesar (M), Hyderabad, Medchal – Dist, Telangana – 501 301.
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that the industrial oriented mini project titled “Fine-Tuning a Neural
Machine Translation Model From English to Hindi Language” Submitted by
N. Sri Abhinav (23P65A0517) B. Tech, III- II semester, Department of Computer
Science & Engineering is a record of the bonafide work carried out by me.
The Design embodied in this report have not been submitted to any other University for
the award of any degree.
EXTERNAL EXAMINER
iii
ACKNOWLEDGEMENT
Self-confidence, hard work, commitment and planning are essential to carry out any
task. Possessing these qualities is sheer waste, if an opportunity does not exist. So, i
whole- heartedly thank Dr. P.V.S. Srinivas, Principal, and Dr. Dara Raju, Head of the
Department, Computer Science and Engineering for their encouragement and support and
guidance in carrying out the project.
I would like to express our indebtedness to the Overall Project Coordinator, Dr. M.
Venkateswara Rao, Professor, and Section Coordinators, Ms. P. Suvarna Puspha,
Associate Professor, Ms. A. Manasa, Associate Professor, Department of CSE, for their
valuable guidance during the course of project work.
I thank our Project Guide, Mr. Syed Noor Mohammed, Assistant Professor,
Department of Computer Science and Engineering for providing me with an excellent
project and guiding me in completing my Mini Project successfully.
I would like to express my sincere thanks to all the staff of Computer Science and
Engineering, VBIT, for their kind cooperation and timely help during the course of my
project. Finally, I would like to thank our parents and friends who have always stood by
me whenever I am in need of them.
IV
ABSTRACT
v
VISION
MISSION
DM-1: Provide a rigorous theoretical and practical framework across
State-of-the- art
vi
PEO-05: Lifelong Learning: Recognize the significance of independent
learning to become experts in chosen fields and broaden professional
knowledge.
vii
and Department of Computer Science and Engineering interpretation of data,
and synthesis of the information to provide valid conclusions.
PO-05: Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and
modelling to complex engineering activities with an understanding of the
limitations.
PO-12: Life-long learning: Recognize the need for, and have the preparation
and ability to engage in independent and life-long learning in the broadest
context of technological change.
viii
Project Mapping Table:
a) PO Mapping:
PO P P PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO1
O O
2
1 2
Finetuning a 3 3 3 2 3 2 1 2 2 3 2 3
neural
machine
translation
model from
English to
Hindi
Language
b) PSO Mapping:
ix
List of Figures
S.No. Title Page No.
3 Activity Diagram 17
4 Class Diagram 17
5 Architecture Diagram 18
6 Component Diagram 18
7 Output of finetuned model 30
8 Import and use the fine-tuned model 30
x
List of Tables
2 Hardware Requirements 10
3 Software Requirements 10
4 Test case 28
xi
Nomenclature
xii
TABLE OF CONTENTS
CONTENTS PAGE NO
Declaration ii
Certificate iii
Acknowledgements iv
Abstract v
List of Figures x
List of Tables xi
Nomenclature xii
CHAPTER 1:
INTRODUCTION 1-4
1.1 Introduction 2
1.2 Motivation 2
1.6 Objective 4
1.7 Scope 4
CHAPTER 2:
LITERATURE SURVEY 5-9
CHAPTER 3:
xiii
3.1. Operating Environment 11
CHAPTER 4:
CHAPTER 5:
IMPLEMENTATION 21-27
5.3 MODULEs 23
xiv
CHAPTER 6:
TESTING & VALIDATION 28-30
6.1 Testing process 29
6.2. Test planning 29
6.3. Test design 29
6.4. Test execution 30
6.5. Test reporting 30
6.6. Test cases 30
CHAPTER 7:
OUTPUT SCREENS 31-33
CHAPTER 8:
CONCLUSION AND FUTURE SCOPE 34-36
8.1 Conclusion 35
8.2 Future Enhancement 35
REFERENCES 37
xv
CHAPTER – 1
INTRODUCTION
1
CHAPTER – 1
INTRODUCTION
1.1. INTRODUCTION
This project proposes the fine-tuning of a pre-trained NMT model, specifically designed to
enhance the translation of informal English into culturally and linguistically accurate Hindi.
Leveraging the transfer learning capabilities of transformer-based models like mBART or
MarianMT, and using tools like Hugging Face Transformers, this system aims to deliver
translations that sound natural to native speakers and maintain the original tone and intent of
the source sentence.
1.2. MOTIVATION
While widely-used translation tools have reached acceptable levels of fluency for
formal documents, they fall short in non-standard linguistic domains:
English slang like "What's up?", "I'm beat", or "Throw shade" is often translated
literally, resulting in incomprehensible or awkward output.
Cultural nuances and emotional tone are frequently lost.
Generic NMT models do not adapt well to emerging linguistic trends or internet
language.
This inadequacy limits the applicability of machine translation in casual communication,
entertainment media, education, and social media monitoring.
Fine-tuning enables the reuse of powerful pre-trained language models by adapting them to
domain-specific tasks, significantly reducing the cost and time needed for training from
scratch. This project is motivated by the need to build a practical, adaptable, and culturally
aware machine translation system that performs well in informal contexts and can easily be
extended to other languages or dialects.
3
Problem Statement:
To develop and fine-tune a neural translation model that effectively translates
informal and slang-based English expressions into fluent, contextually appropriate colloquial
Hindi, improving the linguistic and cultural relevance of automatic translation systems.
1.6. OBJECTIVE
The primary objectives of the project are:
To identify gaps in existing translation systems for informal English-to-Hindi
translation.
To curate or compile a dataset of English slang/informal expressions paired with
colloquial Hindi equivalents.
To fine-tune a pre-trained NMT model using Hugging Face's tools.
To evaluate the model using both automatic metrics and human judgment.
To provide a deployable translation interface for testing and real-world use.
1.7. SCOPE
The scope of this project includes:
Languages: English (source) and Hindi (target)
Domain: Deep Learning-based Natural Language Processing (NLP) using Transfer
Learning for Neural Machine Translation (NMT).
Technology: Hugging Face Transformers, PyTorch/TensorFlow, tokenizers, BLEU
metrics
Deployment: CLI, Web App, or API integration for demo/testing purposes
4
CHAPTER – 2
LITERATURE SURVEY
5
CHAPTER – 2
LITERATURE SURVEY
This paper emphasizes the importance of contextual awareness in neural machine translation
(NMT), particularly for low-resource languages. It introduces a specialized corpus enriched
with slang and informal expressions to improve translation accuracy in real-life settings. The
proposed model successfully incorporates contextual and colloquial variations,
demonstrating improved performance over traditional models in informal language
scenarios.
This research explores the use of multilingual pre-trained models combined with contrastive
learning to enhance translation quality. By leveraging shared representations across
languages and optimizing through contrastive objectives, the model learns finer distinctions
between semantically similar but contextually different phrases. This approach is shown to
be highly effective in improving cross-lingual transfer and robustness in low- data
environments.
6
[4] Research on Machine Translation (MT) System Based on Deep Learning
This foundational paper surveys the evolution of machine translation systems powered by
deep learning. It discusses different architectures like RNNs, CNNs, and Transformers, and
how they contribute to the current state-of-the-art in NMT. The authors provide insights into
the challenges of training such systems and highlight the role of large-scale data and GPU
acceleration in improving accuracy and speed.
[5] Improving Indonesian Informal to Formal Style Transfer via Pre-trained Language
Models
This paper addresses the style transfer challenge, specifically converting informal
Indonesian text into formal language. By using pre-trained models like BERT, the
researchers achieve significant gains in grammaticality and fluency. The techniques outlined
are highly relevant for tasks involving register conversion, showing how informal speech
patterns can be systematically restructured into a more formal and acceptable form.
This research presents a novel approach to integrating BERT into traditional encoder-
decoder NMT frameworks. The authors propose a dual-encoder strategy that combines
BERT’s contextual embeddings with translation-specific representations, resulting in higher
translation fidelity. The paper demonstrates how BERT’s semantic understanding enhances
phrase alignment and lexical choices, especially in ambiguous or polysemous cases.
7
Sl Title Author Y Advantages Limitations
. e
N a
o r
2 Multilingual Pre- Chen, J., Liu 2 Boosts performance via Complex architec
training Model- , X., & Zhan 0 contrastive learning, su ture, hard to train
Assisted Contrastive Le g, Y. 2 pports multiple languag on low data
arning Neural Machine 4 es
Translation
3 Fine-Tuning Self- Liu, Y., Ji, H 2 Improves translation qu May not capture i
Supervised Multilingual ., & Huang, 0 ality with self- nformal language
Sequence-to- M. 2 supervised learning well
Sequence Pretraining fo 3
r Neural Machine Trans
lation
8
6 An Efficient Way to Inc Yang, Z., Da 2 Uses BERT for better s Computationally
orporate BERT Into Ne i, Y., & Men 0 emantic encoding, boos expensive, not sla
ural Machine Translatio g, Z. 2 ts accuracy ng-specific
n 3
9
CHAPTER – 3
REQUIREMENT ANALYSIS
10
CHAPTER -3
REQUIREMENT ANALYSIS
3.1 OPERATING ENVIRONMENT
The operating environment defines the platform and tools required for the
development, training, and deployment of the translation model.
Development Environment: Local system or cloud (e.g., Google Colab, AWS EC2
with GPU)
Programming Language: Python 3.7+
Frameworks/Libraries:
o Hugging Face Transformers
o Hugging Face Datasets
o PyTorch or TensorFlow
o Tokenizers
o NLTK for text processing
Execution Environment: Command-line interface, Jupyter Notebooks, optional Web UI
(Flask or Streamlit)
11
3.3 SOFTWARE REQUIREMENTS
Component Version / Details
Operating System Windows 10+, Ubuntu 20.04+, macOS
Python Version 3.7 or above
Libraries Transformers, Datasets, Tokenizers, Torch/TensorFlow, Flask
(optional)
IDE / Editor VS Code, Jupyter Notebook, PyCharm
Browser (for Web Chrome, Firefox
UI)
12
3.6 SYSTEM ANALYSIS
The system is designed using a modular approach to support clear separation between
preprocessing, training, evaluation, and inference. The architecture follows these layers:
System Architecture Overview
1. Data Layer: Manages dataset loading, cleaning, and tokenization.
2. Model Layer: Loads a pre-trained NMT model and fine-tunes it on the custom
dataset.
3. Training Layer: Handles training loops, optimizer, scheduler, and checkpointing.
4. Evaluation Layer: Computes evaluation metrics (BLEU, loss curves).
5. User Interface (Optional): Provides CLI or Web UI for users to input text and get
translated output.
6. Deployment Layer (Optional): Deploys the model via API or embedded tool for use
in applications.
This modular design ensures flexibility for future extensions like:
Adding support for more languages
Expanding the dataset
Deploying on mobile or edge devices
13
CHAPTER - 4
SYSTEM ANALYSIS & DESIGN
14
CHAPTER – 4
1. Architecture Overview
Data Pipeline
Evaluation Engine
Each layer is responsible for a specific function and interacts with other components
through well-defined interfaces.
2. Data Pipeline
This stage is responsible for preparing the dataset used for fine-tuning:
15
(input IDs and attention masks) for input into the transformer model.
3. Model Architecture
Fine-Tuning Objective:
Training Configuration:
o Batch size, learning rate, number of epochs set via configuration file
4. Evaluation Engine
Quantitative Metrics:
Qualitative Metrics:
Inference Module:
16
Deployment Options:
17
The diagram illustrates the workflow of a slang-to-Hindi translation system, from user input
18
The diagram illustrates the end-to-end workflow of a neural machine translation
system, from input processing and model training to evaluation, optimization, and
final translation output.
4.4 CLASS DIAGRAM
The component diagram for this project represents how the data preprocessing module,
tokenizer, fine-tuned mBART model, and translation interface work together to translate
context-aware English inputs into Hindi outputs.
20
CHAPTER - 5
IMPLEMENTATION
21
CHAPTER – 5
IMPLEMENTATION
5.1 EXPLANATION OF KEY FUNCTIONS
1. Data Preparation
The dataset is a JSON file containing English slang sentences, context, and their
Hindi translations.
Each English sentence is enriched with contextual prompts to help the model understand
informal meanings better.
The data is split into training and test sets for model evaluation.
2. Tokenization
The MBart50TokenizerFast tokenizer is used to convert text into token IDs, which
are numerical representations the model understands.
The tokenizer is configured with the source (en_XX) and target (hi_IN) language codes to
ensure proper multilingual processing.
3. Preprocessing Function
Prepares both source (English) and target (Hindi) sentences using truncation and
padding to maintain consistent input length.
Appends tokenized Hindi as labels for supervised training, allowing the model to learn
how to generate correct translations.
4. Training Setup
Training arguments are set using TrainingArguments to define the number of
epochs, batch sizes, logging strategies, and output directories.
Weights & Biases logging is disabled for simplicity.
5. Trainer Class
The Hugging Face Trainer class wraps the model, training data, and training
arguments.
Handles training, validation, saving checkpoints, and evaluation seamlessly.
6. Model Saving
The final model and tokenizer are saved in a directory for later inference or
deployment.
22
5.1.1 Operational Workflow
Input: Load raw JSON data with slang sentences and their Hindi translations.
Processing: Contextual prompts are added to English input, and data is tokenized.
Training:
The model is initialized with pre-trained weights from Facebook's mBART-50.
It is fine-tuned using the Trainer on the prepared dataset.
Evaluation:
The model is evaluated after each epoch using the validation set. Metrics
like loss and generation accuracy can be logged.
Output:
A fine-tuned translation model capable of handling informal English-to-Hindi
translation is produced.
This model can be saved and deployed via an API or app.
5.3 MODULEs
Module-Level Breakdown
1. DatasetProcessor Module
Reads and structures raw data.
Adds prompt-style instructions to the inputs.
Tokenizes and maps data for training.
2. ModelTrainer Module
Loads pre-trained mBART model.
Applies training parameters and executes fine-tuning.
Saves the trained model and tokenizer.
23
3. EvaluationStrategy Module
Could include BLEU score or accuracy evaluation using compute_metrics() inside Trainer.
24
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)
inputs["labels"] = targets["input_ids"]
return inputs
import os
os.environ["WANDB_DISABLED"] = "true" # Disable Weights & Biases logging
#Step 8
training_args = TrainingArguments(
output_dir="/content/mbart-finetuned-hi",
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=10,
save_strategy="epoch",
logging_strategy="epoch",
save_total_limit=2,
fp16=True,
logging_dir="/content/logs"
)
25
# Step 9: Trainer Setup
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer
)
model = MBartForConditionalGeneration.from_pretrained("/content/mbart-finetuned-hi")
tokenizer = MBart50TokenizerFast.from_pretrained("/content/mbart-finetuned-hi")
tokenizer.src_lang = "en_XX"
forced_bos_token_id = tokenizer.convert_tokens_to_ids("hi_IN")
test_texts = [
"Translate the slang meaning of: That's fire!",
"Translate the literal meaning of: He lit the fire.",
"Translate the slang meaning of: That party was sick!",
"Translate the literal meaning of: He is sick.",
"Translate the slang meaning of: The party was lit.",
"Translate the literal meaning of: She lit the lamp.",
"Translate the slang meaning of: That deal was a steal!",
"Translate the literal meaning of: He tried to steal the wallet.",
"Translate the slang meaning of: That song is fire!",
26
"Translate the literal meaning of: The fire spread quickly."
]
login()
!huggingface-cli whoami
# Push to Hugging Face hub
model.push_to_hub("ChandrikaManchikanti/finetuned-mbart-slang-literal-en-hi")
tokenizer.push_to_hub("ChandrikaManchikanti/finetuned-mbart-slang-literal-en-hi")
27
CHAPTER - 6
TESTING & VALIDATION
28
CHAPTER – 6
6. 2. TEST DESIGN
Goal: Design test scenarios, test cases, and expected outcomes.
Test Scenarios:
Slang sentence gets correctly translated to informal Hindi.
Invalid inputs are handled gracefully.
Tokenization and model load do not throw errors.
Sample Test Inputs:
Input: “He ghosted me after the party.”
Context: Relationship slang
Expected Output (Colloquial Hindi): “पार् के बाद वो गायब हो गया।”
6. 3. TEST EXECUTION
Process:
Execute model training with selected dataset.
Provide input sentences through notebook/script.
Record actual translation outputs.
Evaluate against expected translations manually or using BLEU score.
Tools Used:
Jupyter Notebook
Hugging Face Trainer module Manual output validation
29
6. 4. TEST REPORTING
Key Findings:
Most outputs were contextually accurate and grammatically fluent.
Occasional misinterpretation of slang due to lack of sufficient samples.
Training was stable and performance was consistent.
Issues Identified:
Slight drop in translation quality for ambiguous phrases. Need
for more data variety to handle edge cases better.
Final Verdict: The model passed functional, performance, and usability tests for the
prototype phase.
30
CHAPTER - 7
OUTPUT SCREENS
31
CHAPTER – 7
OUTPUT SCREENS
32
Fig:7.2 Import and use the fine-tuned model
33
CHAPTER - 8
CONCLUSION AND FUTURE
SCOPE
34
CHAPTER – 8
8.1 CONCLUSION
This project demonstrates the effectiveness of fine-tuning a pre-trained Neural Machine
Translation (NMT) model—specifically Facebook's mBART50—for the task of translating
English slang and informal expressions into colloquial Hindi. By leveraging Hugging Face's
Transformers and Datasets libraries, the model was successfully adapted to understand
domain-specific language and context.The approach of adding prompt-based contextual cues
significantly
enhanced the model’s ability to produce more accurate and culturally relevant translations.
The use of a multilingual, many-to-many architecture allowed seamless integration between
English and Hindi, while fine-tuning ensured the model became domain-aware without
requiring training from scratch.Overall, this project highlights the power of transfer learning
and demonstrates a practical, scalable method to improve machine translation systems for
informal and low-resource language tasks
8 .2 FUTURE ENHANCEMENT
35
Support for More Local Languages
Extend the model to include other Indian languages like Tamil, Bengali, or Marathi by
using the same multilingual backbone.
Human-in-the-Loop Feedback
Integrate human reviewers to provide corrections during inference, enabling iterative
model refinement and improving trustworthiness.
Evaluation with Human Metrics
Go beyond BLEU scores and integrate human evaluation metrics such as fluency, cultural
appropriateness, and comprehension.
Code-Mixed and Hinglish Handling
Train the model to handle code-mixed (English-Hindi) sentences, which are very common
in everyday communication in India.
36
REFERENCES
37