Mini Proj
Mini Proj
on
Submitted by
Dr. Kiran K
Associate Professor
Bangalore University
January 2025
Bangalore University
University Visvesvaraya College of Engineering
K.R. Circle, Bangalore – 560001
CERTIFICATE
This is to certify that Kishor Sinnur (U03NM21T006024), Santosh Patil
(U03NM21T006046) have successfully completed the Mini-Project work entitled
“Kannada Sentiment Analysis”, in Partial Fulfillment for the Requirement of the Mini
Project (21ECMP608) of VI Semester prescribed by the Bangalore University during the
Academic Year 2023 - 2024.
Guide: Chairperson:
Examiners:
1. ………………………… 2. ...……………………….
ACKNOWLEDGEMENT
The knowledge and satisfaction that accompany the successful completion of any task would be
incomplete without acknowledging the invaluable contributions of individuals who made it possible. It is
with immense gratitude that we take this opportunity to express our heartfelt thanks to all those who
provided their support, guidance, and encouragement throughout the course of this mini project.
We are profoundly grateful to our project guide, Dr. Kiran K, Associate Professor, Department of
Computer Science and Engineering, UVCE, for his constant support, expert advice, and invaluable
insights that guided us through the challenges we faced during the project. His encouragement and
constructive suggestions were instrumental in shaping the direction and success of our work.
Our sincere thanks extend to Dr. Thriveni J, Chairperson and Professor, Department of Computer Science
and Engineering, UVCE, for her guidance, timely advice, and unwavering encouragement. Her insightful
feedback and support have been invaluable to the successful completion of this project.
We are deeply indebted to our esteemed Director ,Prof. Subhasish Tripathy, for providing us with the
necessary infrastructure, resources, and the opportunity to carry out this project. His inspiring leadership
and constant motivation have been a driving force for us.
We would like to express our heartfelt gratitude to the Faculty members of the Department of Computer
Science and Engineering, UVCE, for their dedicated teaching and support throughout our academic
journey. Their vast knowledge and expertise have enriched our learning experience, laying the foundation
for this project.
We extend our special thanks to our Batchmates and Classmates, whose collaboration, discussions, and
camaraderie added value to our work and made the project journey enjoyable and fulfilling.
This mini project would not have been possible without the contributions of all these wonderful
individuals. We are truly grateful for the collective effort and support that made this endeavor a success.
.
TABLE OF CONTENTS
Title Page No
1. Introduction 1
2. Literature Review 2
4. Proposed Work 6
6. Conclusions 13
Bibliography 14
APPENDIX B : Screenshots 18
ABSTRACT
Social media platforms have revolutionized the way individuals express their opinions
and share experiences. While English dominates as a primary language for online interactions,
a growing number of users now prefer expressing their views in native languages, including
Kannada. This shift has introduced unique challenges, particularly with the prevalence of code-
mixed texts that blend Kannada and English. Sentiment Analysis, which involves extracting
opinions and emotions from such posts, is complicated by the linguistic diversity and rich
structure of Kannada.
In multilingual communities like India, where languages and dialects coexist, Code-
Mixing—the blending of native languages with English—is a common phenomenon. This is
particularly evident in social media texts where users often type in Romanized scripts for
convenience. While this facilitates easier communication, it poses significant challenges for
traditional NLP systems, as these are primarily designed for monolingual texts.
This project focuses on developing a sentiment analysis system for Kannada code-
mixed texts using a transformer-based approach. By leveraging Indic-BERT, a pre-trained
model optimized for Indian languages, the system is fine-tuned to analyze social media content
and classify sentiments into positive, negative, or neutral categories. This approach addresses
the challenges posed by code-mixing and demonstrates the potential of advanced NLP
techniques in processing under-resourced languages like Kannada.
Studies on sentiment analysis for Indian languages have primarily focused on Hindi, Tamil,
and Telugu, leveraging approaches ranging from traditional machine learning algorithms to
deep learning models. For instance, Tamil-English code-mixed sentiment analysis has been
explored using transformer models like Indic-BERT, achieving promising results in handling
linguistic diversity and syntactic complexity. These findings highlight the potential of pre-
trained transformer models in analyzing under-resourced languages.
Kannada, like other Dravidian languages, poses distinct challenges for NLP systems:
Code-Mixing: Social media platforms often feature Kannada text written in Roman
script, interleaved with English words. This mix complicates tokenization, syntactic
parsing, and sentiment classification.
Early efforts in Kannada sentiment analysis relied on rule-based or statistical methods, using
lexicons to identify sentiment polarity. These methods, while simple, often failed to capture
the nuances of complex sentence structures and code-mixed content.
Recent studies have shifted towards machine learning and deep learning approaches:
Code-Mixed Sentiment Analysis: Most existing systems are monolingual and fail to
address the challenges of code-switching at lexical and syntactic levels.
3.1 REQUIREMENTS
Processor: Intel Core i5 or AMD Ryzen 5 and above (multi-core processors are
preferred for faster training and inference).
Memory (RAM): 8 GB (min) or 16 GB (recommended) or more for efficient handling
of large datasets and model training.
Graphics Processing Unit (GPU): NVIDIA GPU with CUDA support (e.g., RTX
3050, or higher) for faster deep learning model training.
Storage: 256 GB SSD (min) for storing datasets, pre-trained models, and results.
Recommended 512 GB or more to handle larger datasets and backups.
Network: Stable internet connection for downloading libraries, pre-trained models, and
datasets.
Pre-trained Model Resources: Hugging Face model hub or local fine-tuned Indic-
BERT/mBERT models.
Additional Tools:
o CUDA Toolkit for GPU acceleration (if using NVIDIA GPUs).
o SentencePiece or FastText for subword tokenization (if applicable).
o Accuracy =
The Kannada review sentiment analysis project showcases the ability to apply natural
language processing (NLP) techniques for understanding sentiments in Kannada. By
leveraging Indic-BERT, a transformer-based language model tailored for Indian languages,
the system effectively classifies sentiments into `Positive`, `Neutral`, and `Negative`
categories.
The project utilized a dataset of Kannada text reviews and employed rigorous pre-
processing, tokenization, and data handling techniques. The model was fine-tuned using a
balanced dataset, achieving high accuracy and robust generalization. Key steps such as custom
loss functions with class weights, effective training schedules, and optimized hyper parameters
contributed to the model's success. Validation metrics, including classification reports and
confusion matrices, highlighted its strong performance across different sentiment classes.
This sentiment analysis system holds potential for real-world applications, such as
analyzing feedback on Kannada-language platforms, monitoring public sentiment on social
media, and improving user experiences in regional markets. However, the project also
identifies areas for improvement, including expanding the dataset, addressing linguistic
diversity in Kannada dialects, and exploring ensemble models for enhanced accuracy.
Overall, this study reinforces the significance of integrating advanced NLP tools like
Indic-BERT for regional language processing, paving the way for broader adoption in
multilingual AI systems.
Figure 1 shows the training loss, validation loss, and validation accuracy over 8 epochs.
The training loss (red) decreases steadily, showing effective learning on the training
data.
The validation loss (orange) decreases initially but stabilizes after epoch 4, indicating
limited improvement on unseen data.
The validation accuracy (blue) increases rapidly early on and plateaus around 0.55
after epoch 4.