0% found this document useful (0 votes)
25 views52 pages

517 Modified

This mini project report details the fine-tuning of a Neural Machine Translation (NMT) model to enhance the translation of English slang into colloquial Hindi. The project utilizes deep learning techniques and tools from Hugging Face to preprocess data, optimize translation accuracy, and deploy a model capable of generating natural-sounding translations. The work aims to improve the adaptability and cultural relevance of AI-driven translation systems for informal contexts.

Uploaded by

22p61a05i0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views52 pages

517 Modified

This mini project report details the fine-tuning of a Neural Machine Translation (NMT) model to enhance the translation of English slang into colloquial Hindi. The project utilizes deep learning techniques and tools from Hugging Face to preprocess data, optimize translation accuracy, and deploy a model capable of generating natural-sounding translations. The work aims to improve the adaptability and cultural relevance of AI-driven translation systems for informal contexts.

Uploaded by

22p61a05i0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

AN INDUSTRIAL ORIENTED MINI PROJECT REPORT

ON
Fine-Tuning a Neural Machine Translation Model from English to
Hindi Language

submitted in partial fulfillment of the requirement for the award of the degree of

BACHELOR OF TECHNOLOGY IN

COMPUTER SCIENCE AND ENGINEERING


By

N. Sri Abhinav 23P65A0517

Under the esteemed guidance of

Mr. Syed Noor Mohammed


Assistant Professor, Dept. of CSE

Department of Computer Science and Engineering

Aushapur Village, Ghatkesar Mandal,Medchal Malkajigiri (District) Telangana-501301

May-2025

i
DECLARATION

I, N. Sri Abhinav, bearing hall ticket numbers 23P65A0517 hereby declare that the
industrial oriented mini project report entitled “Fine-Tuning a Neural Machine
Translation Model From English to Hindi Language” under the guidance of Mr. Syed
Noor Mohammed, Assistant Professor , Department of Computer Science and
Engineering, Vignana Bharathi Institute of Technology, Hyderabad, have submitted
to Jawaharlal Nehru Technological University Hyderabad, Kukatpally, in partial
fulfilment of the requirements for the award of the degree of Bachelor of Technology in
Computer Science and Engineering.

This is a record of Bonafide work carried out by me and the results embodied in this
project have not been reproduced or copied from any source. The results embodied in this
project report have not been submitted to any other university or institute for the award
of any other degree or diploma.

N. Sri Abhinav 23P65A0517

ii
Aushapur (V), Ghatkesar (M), Hyderabad, Medchal – Dist, Telangana – 501 301.

DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the industrial oriented mini project titled “Fine-Tuning a Neural
Machine Translation Model From English to Hindi Language” Submitted by
N. Sri Abhinav (23P65A0517) B. Tech, III- II semester, Department of Computer
Science & Engineering is a record of the bonafide work carried out by me.

The Design embodied in this report have not been submitted to any other University for
the award of any degree.

INTERNAL GUIDE HEAD OF THE DEPARTMENT

Mr. Syed Noor Mohammed Dr. Raju Dara


Assistant professor, Professor, CSE Dept.
CSE Dept.

EXTERNAL EXAMINER

iii
ACKNOWLEDGEMENT

I am extremely thankful to our beloved Chairman, Dr. N. Goutham Rao and


Secretary, Dr. G. Manohar Reddy who took keen interest to provide me the
infrastructural facilities for carrying out the project work.

Self-confidence, hard work, commitment and planning are essential to carry out any
task. Possessing these qualities is sheer waste, if an opportunity does not exist. So, i
whole- heartedly thank Dr. P.V.S. Srinivas, Principal, and Dr. Dara Raju, Head of the
Department, Computer Science and Engineering for their encouragement and support and
guidance in carrying out the project.

I would like to express our indebtedness to the Overall Project Coordinator, Dr. M.
Venkateswara Rao, Professor, and Section Coordinators, Ms. P. Suvarna Puspha,
Associate Professor, Ms. A. Manasa, Associate Professor, Department of CSE, for their
valuable guidance during the course of project work.

I thank our Project Guide, Mr. Syed Noor Mohammed, Assistant Professor,
Department of Computer Science and Engineering for providing me with an excellent
project and guiding me in completing my Mini Project successfully.

I would like to express my sincere thanks to all the staff of Computer Science and
Engineering, VBIT, for their kind cooperation and timely help during the course of my
project. Finally, I would like to thank our parents and friends who have always stood by
me whenever I am in need of them.

IV
ABSTRACT

Neural Machine Translation (NMT) models have significantly improved language


translation by leveraging deep learning techniques. This project focuses on fine-tuning a pre-
trained model to enhance its ability to translate English slang expressions into colloquial
language. The dataset is preprocessed and tokenized using Hugging Face's Transformers and
Datasets library. The model undergoes fine-tuning using a supervised learning approach,
optimizing translation accuracy. Training is conducted with custom batch sizes, learning
rates, and evaluation strategies to ensure optimal performance. Finally, the fine-tuned model
is deployed to generate high-quality, natural-sounding translations of everyday informal
conversations. This work demonstrates the effectiveness of transfer learning in improving
language models for domain-specific translation tasks, making AI- driven translation
systems more adaptable and culturally relevant.
Key words : Neural Machine Translation (NMT), Deep learning, English slang translation,
Colloquial Hindi, Fine-tuning, Hugging Face Transformers, Supervised learning,
Tokenization, Transfer learning etc.

v
VISION

To become, a Center for Excellence in Computer Science and


Engineering with a focused Research, Innovation through Skill Development
and Social Responsibility.

MISSION
DM-1: Provide a rigorous theoretical and practical framework across
State-of-the- art

infrastructure with an emphasis on software development.

DM-2: Impact the skills necessary to amplify the pedagogy to grow


technically and to meet interdisciplinary needs with collaborations.

DM-3: Inculcate the habit of attaining the professional knowledge,


firm ethical values,innovative research abilities and societal needs.

PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

PEO-01: Domain Knowledge: Synthesize mathematics, science,


engineering fundamentals, pragmatic programming concepts to formulate
and solve engineering problems using prevalent and prominent software.
PEO-02: Professional Employment: Succeed at entry- level engineering
positions in the software industries and government agencies.
PEO-03: Higher Degree: Succeed in the pursuit of higher degree in
engineering or other by applying mathematics, science, and engineering
fundamentals.
PEO-04: Engineering Citizenship: Communicate and work effectively on
team-based engineering projects and practice the ethics of the profession,
consistent with a sense of social responsibility.

vi
PEO-05: Lifelong Learning: Recognize the significance of independent
learning to become experts in chosen fields and broaden professional
knowledge.

PROGRAM SPECIFIC OUTCOMES (PSOs)

PSO-01: Ability to explore emerging technologies in the field of computer


science and engineering.

PSO-02: Ability to apply different algorithms indifferent domains to


create innovative products.

PSO-03: Ability to gain knowledge to work on various platforms to


develop useful and secured applications to the society.

PSO-04: Ability to apply the intelligence of system architecture and


organization in designing the new era of computing environment.

PROGRAM OUTCOMES (POs)

Engineering graduates will be able to:

PO-01: Engineering knowledge: Apply the knowledge of mathematics,


science, engineering fundamentals, and an engineering specialization to the
solution of complex engineering problems.

PO-02: Problem analysis: Identify, formulate, review research literature,


and analyze complex engineering problems reaching substantiated
conclusions using first principles of mathematics, natural sciences, and
engineering sciences.

PO-03: Design/development of solutions: Design solutions for complex


engineering problems and design system components or processes that meet
the specified needs with appropriate consideration for the public health and
safety, and cultural, societal, and environmental considerations.

PO-04: Conduct investigations of complex problems : Use research-based


knowledge and research methods including design of experiments, analysis

vii
and Department of Computer Science and Engineering interpretation of data,
and synthesis of the information to provide valid conclusions.

PO-05: Modern tool usage: Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including prediction and
modelling to complex engineering activities with an understanding of the
limitations.

PO-06: The engineer and society: Apply reasoning informed by the


contextual knowledge to assess societal, health, safety, legal and cultural
issues and the consequent responsibilities relevant to the professional
engineering practice.

PO-07: Environment and sustainability: Understand the impact of the


professional engineering solutions in societal and environmental contexts,
and demonstrate the knowledge of, and need for sustainable development.

PO-08: Ethics: Apply ethical principles and commit to professional ethics


and responsibilities and norms of the engineering practice.

PO-09: Individual and team work: Function effectively as an individual,


and as a member or leader in diverse teams, and in multidisciplinary settings.

PO-10: Communication: Communicate effectively on complex engineering


activities with the engineering community and with society at large, such as,
being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear
instructions.

PO-11: Project management and finance: Demonstrate knowledge and


understanding of the engineering and management principles and apply these
to one's own work, as a member and leader in a team, to manage projects and
in multidisciplinary environments.

PO-12: Life-long learning: Recognize the need for, and have the preparation
and ability to engage in independent and life-long learning in the broadest
context of technological change.

viii
Project Mapping Table:
a) PO Mapping:
PO P P PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO1
O O
2
1 2
Finetuning a 3 3 3 2 3 2 1 2 2 3 2 3
neural
machine
translation
model from
English to
Hindi
Language

b) PSO Mapping:

PSO PSO PSO2 PSO3 PSO4


1
Finetuning a 3 3 2 3
neural
machine
translation
model from
English to
Hindi
Language

ix
List of Figures
S.No. Title Page No.

1 Use Case diagram 16


2 Sequence Diagram 16

3 Activity Diagram 17
4 Class Diagram 17
5 Architecture Diagram 18
6 Component Diagram 18
7 Output of finetuned model 30
8 Import and use the fine-tuned model 30

x
List of Tables

S.No. Title Page No.

1 Literature Survey 7-8

2 Hardware Requirements 10

3 Software Requirements 10

4 Test case 28

xi
Nomenclature

NMT Neural Machine Translation


mBART Multilingual Bidirectional and Auto-Regressive Transformers
NLP Natural Language Processing
GPU Graphics Processing Unit
Tokenizer Tool that splits text into tokens (words or subwords)
Context Slang or literal usage indicator in English input
Fine-tuning Adjusting a pretrained model to a specific task or dataset
Hugging Face A popular platform and library for working with transformer models
JSON JavaScript Object Notation, used for dataset formatting
BLEU Bilingual Evaluation Understudy (translation quality metric)
fp16 16-bit floating-point precision (used for faster training on GPU)
src_lang, target_lang Source and target language codes for mBART

xii
TABLE OF CONTENTS

CONTENTS PAGE NO

Declaration ii

Certificate iii

Acknowledgements iv

Abstract v

Vision & Mission vi

List of Figures x

List of Tables xi

Nomenclature xii

Table of Contents xiii

CHAPTER 1:

INTRODUCTION 1-4

1.1 Introduction 2

1.2 Motivation 2

1.3 Existing System 2

1.4 Proposed System 3

1.5 Problem definition 3

1.6 Objective 4

1.7 Scope 4

CHAPTER 2:
LITERATURE SURVEY 5-9

CHAPTER 3:

REQUIREMENT ANALYSIS 10-13

xiii
3.1. Operating Environment 11

3.2. Hardware Requirements 11

3.3. Software Requirements 12


3.4. Functional Requirements 12
3.5. Non – Functional Requirements 12
3.6. System Analysis 13

CHAPTER 4:

SYSTEM DESIGN 14-20

4.1 Technical Blueprint of AI-Driven Phishing 15-16

4.2 Uml Diagram 17-20

CHAPTER 5:

IMPLEMENTATION 21-27

5.1 Explanation of key functions 22

5.1.1 Operational Workflow 23

5.2 Method of implementation 23

5.3 MODULEs 23

5.4 Sample Code 24-27

xiv
CHAPTER 6:
TESTING & VALIDATION 28-30
6.1 Testing process 29
6.2. Test planning 29
6.3. Test design 29
6.4. Test execution 30
6.5. Test reporting 30
6.6. Test cases 30

CHAPTER 7:
OUTPUT SCREENS 31-33

CHAPTER 8:
CONCLUSION AND FUTURE SCOPE 34-36
8.1 Conclusion 35
8.2 Future Enhancement 35

REFERENCES 37

xv
CHAPTER – 1
INTRODUCTION

1
CHAPTER – 1
INTRODUCTION

1.1. INTRODUCTION

This project proposes the fine-tuning of a pre-trained NMT model, specifically designed to
enhance the translation of informal English into culturally and linguistically accurate Hindi.
Leveraging the transfer learning capabilities of transformer-based models like mBART or
MarianMT, and using tools like Hugging Face Transformers, this system aims to deliver
translations that sound natural to native speakers and maintain the original tone and intent of
the source sentence.

1.2. MOTIVATION
While widely-used translation tools have reached acceptable levels of fluency for
formal documents, they fall short in non-standard linguistic domains:
 English slang like "What's up?", "I'm beat", or "Throw shade" is often translated
literally, resulting in incomprehensible or awkward output.
 Cultural nuances and emotional tone are frequently lost.
 Generic NMT models do not adapt well to emerging linguistic trends or internet
language.
This inadequacy limits the applicability of machine translation in casual communication,
entertainment media, education, and social media monitoring.
Fine-tuning enables the reuse of powerful pre-trained language models by adapting them to
domain-specific tasks, significantly reducing the cost and time needed for training from
scratch. This project is motivated by the need to build a practical, adaptable, and culturally
aware machine translation system that performs well in informal contexts and can easily be
extended to other languages or dialects.

1.3. EXISTING SYSTEM


Most current systems like Google Translate, Microsoft Translator, Amazon
Translate, and other open-source models are trained on large general-purpose datasets, such
as:
Europarl (European Parliament proceedings)
2
News Commentary
 Common Crawl
These datasets emphasize formal, grammatically correct sentences and often lack adequate
representation of slang, informal grammar, code-mixed text (e.g., Hinglish), or culturally
loaded expressions.
Limitations of Existing Systems:
 Lack of personalization or adaptability
 Incorrect or robotic translations of informal text
 Inability to evolve with rapidly changing informal expressions
While some research focuses on dialectal translation or informal text generation, there is still
a large gap in deploying these techniques effectively in real-world applications, particularly
for Indian languages.

1.4.1 PROPOSED SYSTEM


The proposed system addresses the limitations of existing models by:
1. Using Transfer Learning: Fine-tuning a pre-trained transformer-based NMT model
(like MarianMT, mBART, or T5) for informal translation tasks.
2. Creating or Sourcing an Informal Parallel Corpus: This involves English slang,
idioms, abbreviations (e.g., "LOL", "brb"), and their colloquial Hindi equivalents.
3. Preprocessing and Tokenization: Leveraging Hugging Face’s datasets and tokenizers
to clean, tokenize, and encode the data.
4. Training Strategy: Customized training with adjustable batch sizes, learning rate
schedules, and evaluation checkpoints using BLEU, ROUGE, and human validation.
5. Deployment: Exporting the fine-tuned model via an API or web app to allow real-
time translation.
The result is a lightweight yet powerful model that understands slang, captures tone, and
generates fluent, natural-sounding Hindi translations.

1.5. PROBLEM DEFINITION


Despite the progress of transformer models in machine translation, English-to-Hindi
translation in informal contexts remains inaccurate due to the lack of targeted data and
cultural adaptation.

3
Problem Statement:
To develop and fine-tune a neural translation model that effectively translates
informal and slang-based English expressions into fluent, contextually appropriate colloquial
Hindi, improving the linguistic and cultural relevance of automatic translation systems.

1.6. OBJECTIVE
The primary objectives of the project are:
 To identify gaps in existing translation systems for informal English-to-Hindi
translation.
 To curate or compile a dataset of English slang/informal expressions paired with
colloquial Hindi equivalents.
 To fine-tune a pre-trained NMT model using Hugging Face's tools.
 To evaluate the model using both automatic metrics and human judgment.
 To provide a deployable translation interface for testing and real-world use.

1.7. SCOPE
The scope of this project includes:
 Languages: English (source) and Hindi (target)
 Domain: Deep Learning-based Natural Language Processing (NLP) using Transfer
Learning for Neural Machine Translation (NMT).
 Technology: Hugging Face Transformers, PyTorch/TensorFlow, tokenizers, BLEU
metrics
 Deployment: CLI, Web App, or API integration for demo/testing purposes

4
CHAPTER – 2
LITERATURE SURVEY

5
CHAPTER – 2

LITERATURE SURVEY

A COMPREHENSIVE STUDY ON FINE-TUNING A NEURAL


MACHINE TRANSLATION MODEL FROM ENGLISH TO HINDI
LANGUAGE
[1] Context-Aware Neural Machine Translation for Low-Resource Languages Using Slang-
Infused Corpora

This paper emphasizes the importance of contextual awareness in neural machine translation
(NMT), particularly for low-resource languages. It introduces a specialized corpus enriched
with slang and informal expressions to improve translation accuracy in real-life settings. The
proposed model successfully incorporates contextual and colloquial variations,
demonstrating improved performance over traditional models in informal language
scenarios.

[2] Multilingual Pre-training Model-Assisted Contrastive Learning Neural Machine


Translation

This research explores the use of multilingual pre-trained models combined with contrastive
learning to enhance translation quality. By leveraging shared representations across
languages and optimizing through contrastive objectives, the model learns finer distinctions
between semantically similar but contextually different phrases. This approach is shown to
be highly effective in improving cross-lingual transfer and robustness in low- data
environments.

[3] Fine-Tuning Self-Supervised Multilingual Sequence-to-Sequence Pretraining for


Neural Machine Translation

This study focuses on adapting self-supervised multilingual sequence-to-sequence models,


like mBART, for downstream translation tasks. Fine-tuning is carried out using parallel
corpora, resulting in significant performance improvements in both high- and low-resource
language pairs. The paper also underlines the importance of pretraining on diverse
multilingual data to achieve better generalization across various translation domains.

6
[4] Research on Machine Translation (MT) System Based on Deep Learning

This foundational paper surveys the evolution of machine translation systems powered by
deep learning. It discusses different architectures like RNNs, CNNs, and Transformers, and
how they contribute to the current state-of-the-art in NMT. The authors provide insights into
the challenges of training such systems and highlight the role of large-scale data and GPU
acceleration in improving accuracy and speed.

[5] Improving Indonesian Informal to Formal Style Transfer via Pre-trained Language
Models

This paper addresses the style transfer challenge, specifically converting informal
Indonesian text into formal language. By using pre-trained models like BERT, the
researchers achieve significant gains in grammaticality and fluency. The techniques outlined
are highly relevant for tasks involving register conversion, showing how informal speech
patterns can be systematically restructured into a more formal and acceptable form.

[6] An Efficient Way to Incorporate BERT Into Neural Machine Translation

This research presents a novel approach to integrating BERT into traditional encoder-
decoder NMT frameworks. The authors propose a dual-encoder strategy that combines
BERT’s contextual embeddings with translation-specific representations, resulting in higher
translation fidelity. The paper demonstrates how BERT’s semantic understanding enhances
phrase alignment and lexical choices, especially in ambiguous or polysemous cases.

7
Sl Title Author Y Advantages Limitations
. e
N a
o r

1 Context- Gupta, A., V 2 Handles slang in low- Needs curated sla


Aware Neural Machine erma, R., & 0 resource languages, imp ng data, limited
Translation for Low- Singh, K. 2 roves context handling multilingual supp
Resource Languages Us 4 ort
ing Slang-
Infused Corpora

2 Multilingual Pre- Chen, J., Liu 2 Boosts performance via Complex architec
training Model- , X., & Zhan 0 contrastive learning, su ture, hard to train
Assisted Contrastive Le g, Y. 2 pports multiple languag on low data
arning Neural Machine 4 es
Translation

3 Fine-Tuning Self- Liu, Y., Ji, H 2 Improves translation qu May not capture i
Supervised Multilingual ., & Huang, 0 ality with self- nformal language
Sequence-to- M. 2 supervised learning well
Sequence Pretraining fo 3
r Neural Machine Trans
lation

4 Research on Machine T Zhang, Y., 2 Explains DL- More theoretical,


ranslation (MT) System & Wang, L. 0 based MT systems, intr lacks focus on rea
Based on Deep Learnin 2 oduces encoder- l-world datasets.
g 3 decoder usage

5 Improving Indonesian I Setya, A., M 2 Addresses informal-to- Specific to Indon


nformal to Formal Style ahendra, R., 0 formal tone shift using esian, not focused
Transfer via Pre- & Adriani, 2 PLMs on slang
trained Language Mode M. 3
ls

8
6 An Efficient Way to Inc Yang, Z., Da 2 Uses BERT for better s Computationally
orporate BERT Into Ne i, Y., & Men 0 emantic encoding, boos expensive, not sla
ural Machine Translatio g, Z. 2 ts accuracy ng-specific
n 3

Fig:2.1: LITERATURE SURVEY

9
CHAPTER – 3
REQUIREMENT ANALYSIS

10
CHAPTER -3

REQUIREMENT ANALYSIS
3.1 OPERATING ENVIRONMENT
The operating environment defines the platform and tools required for the
development, training, and deployment of the translation model.
 Development Environment: Local system or cloud (e.g., Google Colab, AWS EC2
with GPU)
 Programming Language: Python 3.7+
 Frameworks/Libraries:
o Hugging Face Transformers
o Hugging Face Datasets
o PyTorch or TensorFlow
o Tokenizers
o NLTK for text processing
 Execution Environment: Command-line interface, Jupyter Notebooks, optional Web UI
(Flask or Streamlit)

3.2 HARDWARE REQUIREMENTS


Component Minimum Specification Recommended
Specification
Processor Intel i5 (or equivalent) Intel i7 / AMD Ryzen 7
or
(CPU)
higher
RAM 8 GB 16–32 GB
Storage 10 GB (for datasets, models, logs) 100 GB SSD
GPU (for Optional (slow training on CPU) NVIDIA GPU (e.g., RTX
3060,
training)
T4, A100)
Internet Required for downloading models and
datasets from Hugging Face

11
3.3 SOFTWARE REQUIREMENTS
Component Version / Details
Operating System Windows 10+, Ubuntu 20.04+, macOS
Python Version 3.7 or above
Libraries Transformers, Datasets, Tokenizers, Torch/TensorFlow, Flask
(optional)
IDE / Editor VS Code, Jupyter Notebook, PyCharm
Browser (for Web Chrome, Firefox
UI)

3.4 FUNCTIONAL REQUIREMENTS


Functional requirements describe what the system should do.
 Data Preprocessing: Clean and tokenize input English text using Hugging Face’s
tokenizer.
 Translation Inference: Translate English text to colloquial Hindi using the fine-
tuned model.
 Training Interface: Support training with custom datasets, batch sizes, and learning
rates.
 Model Evaluation: Compute BLEU scores and provide accuracy metrics
during/after training.
 User Input Handling: Accept input sentences via CLI, notebook, or web form.
 Model Saving/Loading: Allow saving and reloading fine-tuned models.

3.5 NON-FUNCTIONAL REQUIREMENTS


Non-functional requirements focus on how the system performs.
 Usability: The system should be easy to use and interpret, even for non-technical
users.
 Performance: Translation should complete within 2 seconds per sentence (on GPU).
 Scalability: The model should handle both single-sentence and batch inputs.
 Maintainability: Code and model should be modular and well-documented.
 Security: Proper input sanitization to prevent injection or abuse in deployment.
 Reliability: The system should handle unexpected input without crashing.
 Portability: Should run on multiple platforms (Windows, Linux, macOS).

12
3.6 SYSTEM ANALYSIS
The system is designed using a modular approach to support clear separation between
preprocessing, training, evaluation, and inference. The architecture follows these layers:
System Architecture Overview
1. Data Layer: Manages dataset loading, cleaning, and tokenization.
2. Model Layer: Loads a pre-trained NMT model and fine-tunes it on the custom
dataset.
3. Training Layer: Handles training loops, optimizer, scheduler, and checkpointing.
4. Evaluation Layer: Computes evaluation metrics (BLEU, loss curves).
5. User Interface (Optional): Provides CLI or Web UI for users to input text and get
translated output.
6. Deployment Layer (Optional): Deploys the model via API or embedded tool for use
in applications.
This modular design ensures flexibility for future extensions like:
 Adding support for more languages
 Expanding the dataset
 Deploying on mobile or edge devices

13
CHAPTER - 4
SYSTEM ANALYSIS & DESIGN

14
CHAPTER – 4

SYSTEM ANALYSIS & DESIGN

4.1 TECHNICAL BLUEPRINT OF FINE-TUNING A NEURAL


MACHINE TRANSLATION MODEL FROM ENGLISH TO HINDI
LANGUAGE
This section serves as a high-level design document that explains how the various
components interact, from data ingestion to model training, evaluation, and deployment.

1. Architecture Overview

The system architecture follows a modular, layered approach comprising:

 Data Pipeline

 Model Architecture (Transformer-based NMT)

 Training and Fine-Tuning Loop

 Evaluation Engine

 Inference & Deployment Interface

Each layer is responsible for a specific function and interacts with other components
through well-defined interfaces.

2. Data Pipeline

This stage is responsible for preparing the dataset used for fine-tuning:

 Data Sources: Slang-rich English sentences paired with informal/colloquial Hindi


translations, either custom-curated or adapted from open datasets (like
OpenSubtitles, Twitter data, or crowd-sourced corpora).

 Cleaning and Preprocessing:

o Normalization of text (e.g., removing HTML tags, special characters)

o Tokenization using Hugging Face's AutoTokenizer

o Padding and truncation for batch processing

 Encoding: The tokenized sentences are converted into numerical representations

15
(input IDs and attention masks) for input into the transformer model.

3. Model Architecture

 Base Model: A transformer-based NMT model such as mBART, MarianMT, or


T5 (pre-trained on multilingual corpora).

 Fine-Tuning Objective:

o Train the model on custom English-Hindi sentence pairs

o Use cross-entropy loss and AdamW optimizer

o Enable teacher forcing to speed up convergence

 Training Configuration:

o Batch size, learning rate, number of epochs set via configuration file

o Trained using Hugging Face Trainer API or a custom training loop in


PyTorch

4. Evaluation Engine

To monitor and measure translation performance:

 Quantitative Metrics:

o BLEU Score (evaluates translation quality)

o ROUGE (optional for summarization-style scoring)

o Loss curves (training vs validation)

 Qualitative Metrics:

o Human evaluation: Fluency, naturalness, and correctness of output

o Error examples logged and analyzed

5. Inference & Deployment

After fine-tuning, the model is saved and deployed in a minimal interface:

 Inference Module:

o Accepts English text as input

o Processes it through tokenizer and model

o Outputs colloquial Hindi translation

16
 Deployment Options:

o Command-line interface for testing

o Web-based GUI using Flask, Streamlit, or Gradio

o REST API endpoint using FastAPI or Flask for integration into


applications

6. Tools and Libraries Use

 Hugging Face Transformers – for model loading, training, tokenization

 PyTorch – for deep learning backend

 Datasets Library – for handling custom and public datasets

 TensorBoard/Weights & Biases – for monitoring training

 Flask / Streamlit – for simple deployment

7. Model Saving and Reusability

 Can be reloaded later for additional training or direct inference

 Configuration and tokenizer also saved to allow full pipeline reproducibility.

Fig 4.1 use case diagram

17
The diagram illustrates the workflow of a slang-to-Hindi translation system, from user input

through preprocessing, model fine-tuning, and translation generation to final output.

4.2 SEQUENCE DIAGRAM

Fig 4.2: Sequence Diagram


The diagram shows the sequence of interactions between system components
during the fine-tuning and deployment of a neural machine translation model
from input processing to delivering the translated output.

4.3 ACTIVITY DIAGRAM

Fig 4.3: Activity Diagram

18
The diagram illustrates the end-to-end workflow of a neural machine translation
system, from input processing and model training to evaluation, optimization, and
final translation output.
4.4 CLASS DIAGRAM

Fig 4.4 Class diagram


The class diagram illustrates the structure of the system by showing the classes used
(like DatasetHandler, Translator, and ModelManager), their attributes, methods, and
relationships.
4.5 ARCHITECTURE DIAGRAM

Fig 4.5: Architecture diagram


19
The architecture diagram illustrates the overall system flow and high-level structure,
showing how components like the user interface, model backend, and data storage
interact in the translation pipeline.

4.6 COMPONENT DIAGRAM

Fig 4.6: Component diagram

The component diagram for this project represents how the data preprocessing module,
tokenizer, fine-tuned mBART model, and translation interface work together to translate
context-aware English inputs into Hindi outputs.

20
CHAPTER - 5
IMPLEMENTATION

21
CHAPTER – 5
IMPLEMENTATION
5.1 EXPLANATION OF KEY FUNCTIONS

1. Data Preparation
The dataset is a JSON file containing English slang sentences, context, and their
Hindi translations.
Each English sentence is enriched with contextual prompts to help the model understand
informal meanings better.
The data is split into training and test sets for model evaluation.
2. Tokenization
The MBart50TokenizerFast tokenizer is used to convert text into token IDs, which
are numerical representations the model understands.
The tokenizer is configured with the source (en_XX) and target (hi_IN) language codes to
ensure proper multilingual processing.
3. Preprocessing Function
Prepares both source (English) and target (Hindi) sentences using truncation and
padding to maintain consistent input length.
Appends tokenized Hindi as labels for supervised training, allowing the model to learn
how to generate correct translations.
4. Training Setup
Training arguments are set using TrainingArguments to define the number of
epochs, batch sizes, logging strategies, and output directories.
Weights & Biases logging is disabled for simplicity.
5. Trainer Class
The Hugging Face Trainer class wraps the model, training data, and training
arguments.
Handles training, validation, saving checkpoints, and evaluation seamlessly.
6. Model Saving
The final model and tokenizer are saved in a directory for later inference or
deployment.

22
5.1.1 Operational Workflow
Input: Load raw JSON data with slang sentences and their Hindi translations.
Processing: Contextual prompts are added to English input, and data is tokenized.
Training:
The model is initialized with pre-trained weights from Facebook's mBART-50.
It is fine-tuned using the Trainer on the prepared dataset.
Evaluation:
The model is evaluated after each epoch using the validation set. Metrics
like loss and generation accuracy can be logged.
Output:
A fine-tuned translation model capable of handling informal English-to-Hindi
translation is produced.
This model can be saved and deployed via an API or app.

5.2 METHOD OF IMPLEMENTATION


Framework: Hugging Face Transformers (for model and training), Datasets (for data
handling), PyTorch backend.
Model: facebook/mbart-large-50-many-to-many-mmt – a multilingual transformer model.
Training Technique: Supervised fine-tuning using cross-entropy loss.
Execution Environment: Google Colab (as paths and !pip install indicate), enabling GPU
acceleration and fast training.

5.3 MODULEs
Module-Level Breakdown
1. DatasetProcessor Module
Reads and structures raw data.
Adds prompt-style instructions to the inputs.
Tokenizes and maps data for training.
2. ModelTrainer Module
Loads pre-trained mBART model.
Applies training parameters and executes fine-tuning.
Saves the trained model and tokenizer.

23
3. EvaluationStrategy Module
Could include BLEU score or accuracy evaluation using compute_metrics() inside Trainer.

5.4 SAMPLE CODE

# Step 1: Install Required Libraries


!pip install transformers datasets sentencepiece
!pip install datasets

# Step 2: Import Modules


import json
import os
from datasets import Dataset
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast,
TrainingArguments, Trainer

# Step 3: Load & Prepare Data


with open("/content/english_hindi_slang.json", "r", encoding="utf-8") as f:
raw_data = json.load(f)["data"]

# Add prompt-style context to input sentences


for item in raw_data:
item["english"] = f'Translate the {item["context"]} meaning of: {item["english"]}'

# Step 4: Split Data


train_data = raw_data[:400]
test_data = raw_data[400:]

# Step 5: Convert to HuggingFace Datasets


train_dataset = Dataset.from_list(train_data)
test_dataset = Dataset.from_list(test_data)

# Step 6: Load mBART Tokenizer & Model


model_name = "facebook/mbart-large-50-many-to-many-mmt"

24
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

# Set tokenizer to source and target languages


tokenizer.src_lang = "en_XX"
target_lang = "hi_IN"

# Step 7: Preprocessing Function


def preprocess_function(examples):
inputs = tokenizer(examples["english"], truncation=True, padding="max_length",
max_length=128)
targets = tokenizer(examples["hindi"], truncation=True, padding="max_length",
max_length=128)

inputs["labels"] = targets["input_ids"]
return inputs

tokenized_train = train_dataset.map(preprocess_function, batched=True)


tokenized_test = test_dataset.map(preprocess_function, batched=True)

import os
os.environ["WANDB_DISABLED"] = "true" # Disable Weights & Biases logging
#Step 8
training_args = TrainingArguments(
output_dir="/content/mbart-finetuned-hi",
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=10,
save_strategy="epoch",
logging_strategy="epoch",
save_total_limit=2,
fp16=True,
logging_dir="/content/logs"
)
25
# Step 9: Trainer Setup
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer
)

# Step 10: Train


trainer.train()

# Step 11: Save Model


model.save_pretrained("/content/mbart-finetuned-hi")
tokenizer.save_pretrained("/content/mbart-finetuned-hi")

# Load fine-tuned model for inference


from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("/content/mbart-finetuned-hi")
tokenizer = MBart50TokenizerFast.from_pretrained("/content/mbart-finetuned-hi")
tokenizer.src_lang = "en_XX"
forced_bos_token_id = tokenizer.convert_tokens_to_ids("hi_IN")

test_texts = [
"Translate the slang meaning of: That's fire!",
"Translate the literal meaning of: He lit the fire.",
"Translate the slang meaning of: That party was sick!",
"Translate the literal meaning of: He is sick.",
"Translate the slang meaning of: The party was lit.",
"Translate the literal meaning of: She lit the lamp.",
"Translate the slang meaning of: That deal was a steal!",
"Translate the literal meaning of: He tried to steal the wallet.",
"Translate the slang meaning of: That song is fire!",

26
"Translate the literal meaning of: The fire spread quickly."
]

for sentence in test_texts:


inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
translated_tokens = model.generate(
**inputs,
forced_bos_token_id=forced_bos_token_id
)
print(f"\nInput: {sentence}")
print("Hindi Translation:", tokenizer.decode(translated_tokens[0],
skip_special_tokens=True))

!pip install huggingface_hub


from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

# Replace this with your actual model and tokenizer objects


model.save_pretrained("finetuned-mbart-slang-literal-en-hi")
tokenizer.save_pretrained("finetuned-mbart-slang-literal-en-hi")
from huggingface_hub import login

login()
!huggingface-cli whoami
# Push to Hugging Face hub
model.push_to_hub("ChandrikaManchikanti/finetuned-mbart-slang-literal-en-hi")
tokenizer.push_to_hub("ChandrikaManchikanti/finetuned-mbart-slang-literal-en-hi")

27
CHAPTER - 6
TESTING & VALIDATION

28
CHAPTER – 6

TESTING & VALIDATION


6. 1. TESTING PROCESS
Objective: Define scope, resources, test environment, and schedule.
Scope: Test model components like preprocessing, tokenization, fine-tuning, and
translation output.
Test Environment: Google Colab (with GPU), Python environment, Hugging Face
Transformers.
Resources: Sample slang dataset (english_hindi_slang.json), pre-trained mBART50 model.
Responsibilities: Development team handles test execution and output validation.

6. 2. TEST DESIGN
Goal: Design test scenarios, test cases, and expected outcomes.
Test Scenarios:
Slang sentence gets correctly translated to informal Hindi.
Invalid inputs are handled gracefully.
Tokenization and model load do not throw errors.
Sample Test Inputs:
Input: “He ghosted me after the party.”
Context: Relationship slang
Expected Output (Colloquial Hindi): “पार् के बाद वो गायब हो गया।”

6. 3. TEST EXECUTION
Process:
Execute model training with selected dataset.
Provide input sentences through notebook/script.
Record actual translation outputs.
Evaluate against expected translations manually or using BLEU score.
Tools Used:
Jupyter Notebook
Hugging Face Trainer module Manual output validation

29
6. 4. TEST REPORTING
Key Findings:
Most outputs were contextually accurate and grammatically fluent.
Occasional misinterpretation of slang due to lack of sufficient samples.
Training was stable and performance was consistent.
Issues Identified:
Slight drop in translation quality for ambiguous phrases. Need
for more data variety to handle edge cases better.
Final Verdict: The model passed functional, performance, and usability tests for the
prototype phase.

6. 5. SAMPLE TEST CASES


Test Description Input Expected Output Result
Case ID
TC01 Test basic slang "That movie was "वो फिल जबरदस थी!" Pass
translation lit!"
TC02 Check unknown "He’s capping "वो फिर से झूठ बोल रहा Pass
slang handling again." है ।"
TC03 Input without slang "She is going to the "वो बाजार जा रही है ।" Pass
market."
TC04 Evaluate empty "" Error message or null Pass
input output
TC05 Grammar "This party is dope." "ये पार् मस है ।" Pass
correctness in Hindi
TC06 Test context- Context = "वो जल रहा है ।" Pass
specific translation Friendship, "He's (Contextually informal)
salty."

30
CHAPTER - 7
OUTPUT SCREENS

31
CHAPTER – 7

OUTPUT SCREENS

Fig :7.1: Output of finetuned model

32
Fig:7.2 Import and use the fine-tuned model

33
CHAPTER - 8
CONCLUSION AND FUTURE
SCOPE

34
CHAPTER – 8

CONCLUSION AND FUTURE SCOPE

8.1 CONCLUSION
This project demonstrates the effectiveness of fine-tuning a pre-trained Neural Machine
Translation (NMT) model—specifically Facebook's mBART50—for the task of translating
English slang and informal expressions into colloquial Hindi. By leveraging Hugging Face's
Transformers and Datasets libraries, the model was successfully adapted to understand
domain-specific language and context.The approach of adding prompt-based contextual cues
significantly
enhanced the model’s ability to produce more accurate and culturally relevant translations.
The use of a multilingual, many-to-many architecture allowed seamless integration between
English and Hindi, while fine-tuning ensured the model became domain-aware without
requiring training from scratch.Overall, this project highlights the power of transfer learning
and demonstrates a practical, scalable method to improve machine translation systems for
informal and low-resource language tasks

8 .2 FUTURE ENHANCEMENT

Handling Idiomatic and Figurative Language


Idioms and figurative expressions pose a significant challenge in translation,
especially when taken literally. Future enhancements could include training the model on
datasets enriched with idioms, proverbs, and culturally specific phrases. This would help
the system understand that "kick the bucket" should be translated as "मर गया" instead of a
literal translation like "बाली को लात मारना". Accurate handling of idioms will greatly
improve translation quality in informal and conversational contexts.

35
Support for More Local Languages
Extend the model to include other Indian languages like Tamil, Bengali, or Marathi by
using the same multilingual backbone.
Human-in-the-Loop Feedback
Integrate human reviewers to provide corrections during inference, enabling iterative
model refinement and improving trustworthiness.
Evaluation with Human Metrics
Go beyond BLEU scores and integrate human evaluation metrics such as fluency, cultural
appropriateness, and comprehension.
Code-Mixed and Hinglish Handling
Train the model to handle code-mixed (English-Hindi) sentences, which are very common
in everyday communication in India.

36
REFERENCES

[1] Context-Aware Neural Machine Translation for Low-Resource


Languages Using Slang-Infused Corpora Gupta, A., Verma, R.,
& Singh, K. 2024
[2] Multilingual Pre-training Model-Assisted Contrastive Learning
Neural Machine Translation Chen, J., Liu, X., & Zhang, Y.
2024
[3] Fine-Tuning Self-Supervised Multilingual Sequence-to-Sequence
Pretraining for Neural Machine Translation Liu, Y., Ji, H., & Huang,
M. 2023
[4] Research on Machine Translation (MT) System Based on Deep
Learning Zhang, Y., & Wang, L. 2023
[5] Improving Indonesian Informal to Formal Style Transfer via Pre-
trained Language Models Setya, A., Mahendra, R., & Adriani,
M. 2023
[6] An Efficient Way to Incorporate BERT Into Neural Machine
Translation Yang, Z., Dai, Y., & Meng, Z. 2023
0

37

You might also like