0% found this document useful (0 votes)

22 views23 pages

Mini Proj

This mini-project report presents a sentiment analysis system for Kannada language text, focusing on code-mixed content using a transformer-based approach with the Indic-BERT model. The project aims to classify sentiments into positive, neutral, and negative categories while addressing challenges posed by linguistic diversity and code-mixing. The report includes acknowledgments, methodology, hardware and software requirements, and discusses the effectiveness of the proposed model in outperforming traditional methods.

Uploaded by

santoshpatil2003.2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views23 pages

Mini Proj

Uploaded by

santoshpatil2003.2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

A Mini-Project Report

Kannada Sentiment Analysis

Submitted by

Kishor Sinnur Santosh Patil

U03NM21T006024 U03NM21T006046
VI Sem, B.Tech (AIML) VI Sem, B.Tech (AIML)

Under the Guidance of

Dr. Kiran K
Associate Professor

Department of Computer Science and Engineering

University Visvesvaraya College of Engineering

K.R. Circle, Bangalore – 560001

Bangalore University

January 2025
Bangalore University
University Visvesvaraya College of Engineering
K.R. Circle, Bangalore – 560001

Department of Computer Science and Engineering

CERTIFICATE
This is to certify that Kishor Sinnur (U03NM21T006024), Santosh Patil
(U03NM21T006046) have successfully completed the Mini-Project work entitled
“Kannada Sentiment Analysis”, in Partial Fulfillment for the Requirement of the Mini
Project (21ECMP608) of VI Semester prescribed by the Bangalore University during the
Academic Year 2023 - 2024.

Guide: Chairperson:

Dr. Kiran K Dr. Thriveni J

Associate Professor Professor
Department of CSE Department of CSE
UVCE UVCE

Examiners:

1. ………………………… 2. ...……………………….
ACKNOWLEDGEMENT

The knowledge and satisfaction that accompany the successful completion of any task would be
incomplete without acknowledging the invaluable contributions of individuals who made it possible. It is
with immense gratitude that we take this opportunity to express our heartfelt thanks to all those who
provided their support, guidance, and encouragement throughout the course of this mini project.

We are profoundly grateful to our project guide, Dr. Kiran K, Associate Professor, Department of
Computer Science and Engineering, UVCE, for his constant support, expert advice, and invaluable
insights that guided us through the challenges we faced during the project. His encouragement and
constructive suggestions were instrumental in shaping the direction and success of our work.

Our sincere thanks extend to Dr. Thriveni J, Chairperson and Professor, Department of Computer Science
and Engineering, UVCE, for her guidance, timely advice, and unwavering encouragement. Her insightful
feedback and support have been invaluable to the successful completion of this project.

We are deeply indebted to our esteemed Director ,Prof. Subhasish Tripathy, for providing us with the
necessary infrastructure, resources, and the opportunity to carry out this project. His inspiring leadership
and constant motivation have been a driving force for us.

We would like to express our heartfelt gratitude to the Faculty members of the Department of Computer
Science and Engineering, UVCE, for their dedicated teaching and support throughout our academic
journey. Their vast knowledge and expertise have enriched our learning experience, laying the foundation
for this project.

We extend our special thanks to our Batchmates and Classmates, whose collaboration, discussions, and
camaraderie added value to our work and made the project journey enjoyable and fulfilling.

This mini project would not have been possible without the contributions of all these wonderful
individuals. We are truly grateful for the collective effort and support that made this endeavor a success.

.
TABLE OF CONTENTS

Title Page No

1. Introduction 1

2. Literature Review 2

3. Hardware and Software Requirements Specification 4

4. Proposed Work 6

5. Result and Performance 11

6. Conclusions 13

Bibliography 14

APPENDIX A : Code Snippet 15

APPENDIX B : Screenshots 18
ABSTRACT
Social media platforms have revolutionized the way individuals express their opinions
and share experiences. While English dominates as a primary language for online interactions,
a growing number of users now prefer expressing their views in native languages, including
Kannada. This shift has introduced unique challenges, particularly with the prevalence of code-
mixed texts that blend Kannada and English. Sentiment Analysis, which involves extracting
opinions and emotions from such posts, is complicated by the linguistic diversity and rich
structure of Kannada.

In this project, we present a transformer-based approach using the Indic-BERT model

to tackle the task of sentiment analysis for Kannada social media posts. The model is fine-tuned
on a labeled dataset to classify sentiments into three categories: positive, neutral, and negative.
Our pipeline incorporates robust Pre-processing, tokenization, and balanced class weighting to
address the nuances of the Kannada language. Experimental results demonstrate the
effectiveness of this approach, achieving notable accuracy and precision, thereby
outperforming traditional machine learning methods. This work highlights the potential of
advanced deep learning architectures in promoting natural language processing for regional
languages like Kannada.
1. INTRODUCTION
Sentiment analysis is a crucial task in Natural Language Processing (NLP) that involves
identifying and interpreting emotions or opinions expressed in text. By combining text analysis,
statistics, and advanced computational techniques, sentiment analysis helps uncover insights
into public opinions, product reviews, and societal trends. With the rapid proliferation of social
media, sentiment analysis has emerged as an indispensable tool for analyzing user-generated
content.

In multilingual communities like India, where languages and dialects coexist, Code-
Mixing—the blending of native languages with English—is a common phenomenon. This is
particularly evident in social media texts where users often type in Romanized scripts for
convenience. While this facilitates easier communication, it poses significant challenges for
traditional NLP systems, as these are primarily designed for monolingual texts.

Kannada, one of the prominent Dravidian languages spoken in Karnataka, is widely

used on social media platforms. However, the code-mixed nature of Kannada-English texts,
often referred to as "Kanglish," introduces linguistic complexity that traditional sentiment
analysis systems struggle to handle. The intricate grammar and syntactic structure of Kannada,
coupled with its interleaving with English, necessitate robust and sophisticated computational
approaches.

This project focuses on developing a sentiment analysis system for Kannada code-
mixed texts using a transformer-based approach. By leveraging Indic-BERT, a pre-trained
model optimized for Indian languages, the system is fine-tuned to analyze social media content
and classify sentiments into positive, negative, or neutral categories. This approach addresses
the challenges posed by code-mixing and demonstrates the potential of advanced NLP
techniques in processing under-resourced languages like Kannada.

VI Sem AIML Aug-Dec 2024 1

2. LITERATURE REVIEW
Sentiment analysis, a key area of Natural Language Processing (NLP), has seen
significant advancements over the years, with a focus on English and other globally dominant
languages. However, regional languages like Kannada, spoken by millions in Karnataka, India,
remain underexplored due to limited resources and datasets. The complexity of Kannada's
linguistic structure and the growing prevalence of code-mixed texts, combining Kannada and
English ("Kanglish"), present unique challenges that researchers are beginning to address.

2.1 Sentiment Analysis in Regional and Dravidian Languages

Studies on sentiment analysis for Indian languages have primarily focused on Hindi, Tamil,
and Telugu, leveraging approaches ranging from traditional machine learning algorithms to
deep learning models. For instance, Tamil-English code-mixed sentiment analysis has been
explored using transformer models like Indic-BERT, achieving promising results in handling
linguistic diversity and syntactic complexity. These findings highlight the potential of pre-
trained transformer models in analyzing under-resourced languages.

2.2 Challenges in Kannada Sentiment Analysis

Kannada, like other Dravidian languages, poses distinct challenges for NLP systems:

 Code-Mixing: Social media platforms often feature Kannada text written in Roman
script, interleaved with English words. This mix complicates tokenization, syntactic
parsing, and sentiment classification.

 Lack of Resources: Kannada suffers from a scarcity of annotated datasets, hindering

the development of robust models. Existing datasets are often limited in size or domain-
specific, reducing their generalizability.

 Morphological Complexity: Kannada exhibits a highly inflectional morphology,

where word forms change based on tense, mood, gender, and case. Traditional NLP
techniques struggle with such variations.

VI Sem AIML Aug-Dec 2024 2

Mini Project Kannada Sentiment Analysis

2.3 Existing Approaches for Kannada Sentiment Analysis

Early efforts in Kannada sentiment analysis relied on rule-based or statistical methods, using
lexicons to identify sentiment polarity. These methods, while simple, often failed to capture
the nuances of complex sentence structures and code-mixed content.

Recent studies have shifted towards machine learning and deep learning approaches:

 Lexicon-Based Approaches: Researchers have developed Kannada sentiment

lexicons to classify sentiments, but these are insufficient for handling the subtleties of
code-mixed texts.
 Machine Learning Models: Models like Naive Bayes, Support Vector Machines
(SVM), and Random Forests have been applied to sentiment classification tasks for
Kannada. These models, however, require substantial feature engineering and are
limited in their ability to handle the contextual meaning of words.
 Transformer Models: The advent of pre-trained transformer models, such as Indic-
BERT and mBERT, has revolutionized sentiment analysis for under-resourced
languages. Indic-BERT, in particular, is designed for Indian languages, making it a
promising tool for Kannada sentiment analysis. By fine-tuning on Kannada code-mixed
datasets, researchers have achieved improved sentiment classification performance.

2.4 Research Gaps

Despite these advancements, several gaps remain:

 Dataset Availability: Publicly available Kannada sentiment datasets are scarce,

especially for code-mixed texts.

 Code-Mixed Sentiment Analysis: Most existing systems are monolingual and fail to
address the challenges of code-switching at lexical and syntactic levels.

 Benchmarking Models: There is a need for systematic benchmarking of transformer-

based models like Indic-BERT and multilingual models like mBERT for Kannada
sentiment analysis.

VI Sem AIML Aug-Dec 2024 3

3. HARDWARE AND SOFTWARE REQUIREMENTS

3.1 REQUIREMENTS
 Processor: Intel Core i5 or AMD Ryzen 5 and above (multi-core processors are
preferred for faster training and inference).
 Memory (RAM): 8 GB (min) or 16 GB (recommended) or more for efficient handling
of large datasets and model training.
 Graphics Processing Unit (GPU): NVIDIA GPU with CUDA support (e.g., RTX
3050, or higher) for faster deep learning model training.
 Storage: 256 GB SSD (min) for storing datasets, pre-trained models, and results.
Recommended 512 GB or more to handle larger datasets and backups.
 Network: Stable internet connection for downloading libraries, pre-trained models, and
datasets.

3.2 SOFTWARE REQUIREMENTS

 Operating System (OS): Windows 10/11, macOS, or Linux (Ubuntu 20.04 or later is
preferred for compatibility with deep learning frameworks).
 Programming Language: Python 3.8 or later
 Development Tools: Integrated Development Environment (IDE): VS Code, Jupyter
Notebook, or PyCharm for coding and testing.
 Libraries and Frameworks:
o NLP Libraries: NLTK, SpaCy, or Hugging Face Transformers
o Deep Learning Frameworks: TensorFlow or PyTorch
o Pre-trained Models: Indic-BERT, mBERT, or other transformer models for
Indian languages.
o Data Processing: Pandas, NumPy
o Text Processing: Tokenizers, re (Regular Expressions)
o Visualization: Matplotlib, Seaborn, Plotly for data analysis and results
visualization.
 Database (if applicable): MongoDB or MySQL for storing processed data and results.
 Version Control: Git and GitHub for version control and collaboration.
 Virtual Environment Tools: Anaconda or venv to manage dependencies and avoid
conflicts.

VI Sem AIML Aug-Dec 2024 4

Mini Project Kannada Sentiment Analysis

 Pre-trained Model Resources: Hugging Face model hub or local fine-tuned Indic-
BERT/mBERT models.
 Additional Tools:
o CUDA Toolkit for GPU acceleration (if using NVIDIA GPUs).
o SentencePiece or FastText for subword tokenization (if applicable).

3.3 OPTIONAL REQUIREMENTS (FOR ENHANCED PERFORMANCE)

 Cloud Computing:
o AWS EC2, Google Colab Pro, or Azure ML for additional computational
resources, especially for large datasets or fine-tuning deep learning models.
 APIs:
o Hugging Face API for using transformer models directly.
o FastAPI or Flask for deploying the sentiment analysis model as a web service.

VI Sem AIML Aug-Dec 2024 5

4. PROPOSED WORK
The goal of this work is to develop an effective sentiment analysis model for Kannada
language text using two advanced NLP models: Indic-BERT and AI4/Barath. This sentiment
analysis task specifically targets code-mixed text, which is commonly used in social media,
reviews, and other user-generated content, where Kannada is combined with English or other
languages. The complexity of handling code-mixed data necessitates the use of sophisticated
language models like Indic-BERT and AI4/Barath, which are well-suited for Indian languages,
including Kannada.
4.1 OBJECTIVE
The primary objective of this work is to build a sentiment analysis system capable of classifying
text in Kannada into categories like positive, negative, or neutral. The system will focus on
handling code-mixed data, which is typical in social media and informal communication.
4.2 PROPOSED METHODOLOGY
4.2.1 Data Collection and Pre-processing
 Data Collection: Collect a diverse and representative dataset of Kannada text from
social media platforms (e.g., Twitter, Facebook), forums, reviews, and other online
sources. The dataset will contain both pure Kannada text and code-mixed text
(combination of Kannada and English).
 Pre-processing: The text data will undergo the following Pre-processing steps:
o Lowercasing: Convert all text to lowercase to maintain uniformity.
o Tokenization: Break down the text into tokens using Kannada-specific
tokenizers or the tokenizer from Indic-BERT.
o Noise Removal: Remove special characters, punctuation, and unnecessary
symbols.
o Handling Code-Mixed Data: Use special techniques to separate Kannada
words from English or other language segments in code-mixed text. This is
crucial for accurate sentiment classification.
4.2.2 Feature Extraction
 Indic-BERT Embedding’s: Use Indic-BERT, a transformer-based model pre-trained
on Indian languages, for embedding the Kannada text. Indic-BERT has been trained
specifically to capture the nuances of languages like Kannada and can handle code-
mixed content better than traditional models.

VI Sem AIML Aug-Dec 2024 6

Mini Project Kannada Sentiment Analysis

 AI4/Barath Model: Integrate AI4/Barath, which is another transformer model fine-

tuned for Indian languages, including Kannada. AI4/Barath provides contextual
embeddings that capture both the semantic and syntactic properties of Kannada text.
 Handling Code-Mixing: Code-mixed text will be processed by both Indic-BERT and
AI4/Barath to effectively capture the mixed-language nature. The embeddings will
ensure that both Kannada and English parts are understood in context, addressing the
challenge of code-switching.
4.2.3 Model Selection and Training
 Model Architecture: We will experiment with a BERT-based architecture using
both Indic-BERT and AI4/Barath. The architecture will consist of:
o A pre-trained model (Indic-BERT or AI4/Barath) as the base.
o Fine-tuning layers on top of the pre-trained model for sentiment classification.
o A classification head to predict sentiment labels (positive, negative, neutral).
 Training: The model will be trained on the labeled Kannada dataset using an
appropriate optimizer (e.g., Adam) and loss function (Cross-Entropy Loss). The
training will involve backpropagation to fine-tune the weights of the model and adapt
it to the Kannada sentiment analysis task.
4.2.4 Evaluation Metrics
 Accuracy: The percentage of correctly predicted sentiments.
 Precision, Recall, and F1-Score: These metrics will give a detailed understanding of
the model's performance, especially in distinguishing between positive, negative, and
neutral sentiments.
 Confusion Matrix: To analyze the model's ability to differentiate between sentiment
classes, a confusion matrix will be generated.
4.2.5 Hyperparameter Tuning and Optimization
 Hyperparameter Tuning: Hyperparameters such as the learning rate, batch size,
number of layers, and dropout rate will be tuned using grid search and cross-validation
techniques to optimize model performance.
 Fine-Tuning Pretrained Models: We will fine-tune the pretrained Indic-BERT and
AI4/Barath models for Kannada-specific sentiment analysis. This process allows the
model to better understand the linguistic nuances and sentiment expression in Kannada.

VI Sem AIML Aug-Dec 2024 7

Mini Project Kannada Sentiment Analysis

4.2.6 Results and Analysis

 After training and fine-tuning the model, we will evaluate the model's performance on
a test dataset of Kannada code-mixed text. The results will be analyzed in terms of
accuracy, precision, recall, and F1-score.
 The model's effectiveness in handling code-mixed content and distinguishing between
different sentiment categories (positive, negative, neutral) will be compared with other
state-of-the-art models to highlight the improvement in performance.

4.3 DATASET DESCRIPTION

The dataset used in this work is sourced from various platforms such as Twitter, forums, and
reviews. It contains Kannada text in both pure and code-mixed forms. The dataset comprises
the following features:
 textID: A unique identifier for each text entry.
 text: The main content representing user opinions in Kannada, often code-mixed with
English.
 sentiment: The manually annotated sentiment category:
o Positive (ಧ ತ ಕ), Neutral (ತಟಸ ), or Negative (ಋ ತ ಕ).
 sentiment_numeric: A numeric encoding for sentiment categories (2 for positive, 1
for neutral, and 0 for negative).
 Time of Tweet: Indicates when the text was posted (Morning, Afternoon, or Night).
 Age of User: Categorical age ranges (e.g., 0-20, 21-30).
 Country: The country of origin of the user.
 Population - 2020, Land Area (Km²), Density (P/Km²): Provide socio-geographic
context, aiding in exploratory analysis.

4.4 INDIC-BERT & AI4BHARATH MODELS FOR KANNADA SENTIMENT

4.4.1 Indic-BERT
Indic-BERT is a multilingual transformer-based language model pre-trained on a large
corpus of Indian languages, including Kannada. It is built on the BERT (Bidirectional Encoder
Representations from Transformers) architecture and specifically fine-tuned for Indian
languages, making it highly effective for tasks like sentiment analysis in Kannada, even in
code-mixed scenarios.

VI Sem AIML Aug-Dec 2024 8

Mini Project Kannada Sentiment Analysis

4.4.2 Key Features of Indic-BERT:

 Pretraining on Indian Languages: Trained on over 12 major Indian languages,
including Kannada, with a focus on capturing linguistic nuances.
 Code-Mixed Support: Handles code-mixed text (e.g., Kannada mixed with English)
effectively by understanding both languages' context and syntax.
 Tokenization: Utilizes a subword tokenizer optimized for Indian languages, ensuring
meaningful embeddings even for rare or compound words.
 Transfer Learning: By fine-tuning on task-specific datasets (e.g., Kannada sentiment
analysis), Indic-BERT adapts to classify text as positive, negative, or neutral.
4.4.3 How Indic-BERT is Used in Kannada Sentiment Analysis:
 Feature Extraction: Extracts semantic and syntactic embeddings for Kannada and
mixed-language text.
 Fine-Tuning: Fine-tuned on the annotated Kannada sentiment dataset, enabling the
model to classify sentiments accurately.
 Advantages for Kannada Text:
o Captures the grammatical structure and context of Kannada sentences.
o Addresses the challenge of limited annotated data by leveraging pretraining on
a large corpus.

4.5 AI4Bharath Model

AI4Bharath is a transformer-based language model designed for low-resource Indian languages
like Kannada. It provides contextual embeddings for tasks such as sentiment analysis,
emphasizing efficiency and accuracy for regional languages.
4.5.1 Key Features of AI4Bharath:
 Focused on Indian Languages: Fine-tuned on datasets that include Indian scripts and
code-mixed text.
 Efficient Contextual Understanding: Excels in understanding word and phrase
contexts, even in short, informal, or noisy Kannada texts from social media.
 Multilingual Capabilities: Handles Kannada-English code-mixing effectively,
accommodating the bilingual nature of social media conversations.
 Lightweight Architecture: Optimized for faster inference, making it suitable for real-
time applications, though this study does not focus on deployment.

VI Sem AIML Aug-Dec 2024 9

Mini Project Kannada Sentiment Analysis

4.5.2 How AI4Bharath is Used in Kannada Sentiment Analysis:

 Embedding Extraction: Generates contextualized embeddings for Kannada words,
phrases, and sentences.
 Integration with Indic-BERT: Complements Indic-BERT by enhancing embeddings,
especially in scenarios where additional contextual understanding is needed.
 Sentiment Classification: Fine-tuned to classify sentiments in Kannada text into
categories (positive, negative, neutral) based on labeled data.

VI Sem AIML Aug-Dec 2024 10

5. RESULT AND PERFORMANCE
5.1 Model Training Performance
 Training Loss: This metric indicates the error rate during model training. In our
training loop:
o The loss is computed using CrossEntropyLoss with class weights to address any
imbalance in the dataset.
o Training loss decreases over epochs, reflecting the model's improved
performance.
 Validation Loss: Validation loss is calculated similarly but on unseen data (validation
set). A consistently decreasing validation loss indicates that the model is generalizing
well without overfitting.
5.2 Accuracy
 Validation Accuracy: The model's accuracy is evaluated after each epoch on the
validation set. For each epoch:

o Accuracy =

o A validation accuracy of around 80% - 90% indicates a good understanding of

the data, considering it's a multilingual sentiment dataset.
5.3 Classification Report
The classification_report outputs:
 Precision: Percentage of correctly predicted positive samples among all positive
predictions.
 Recall: Percentage of correctly predicted positive samples among all actual positives.
 F1-Score: Harmonic mean of precision and recall, crucial for imbalanced datasets.
 Support: Number of true occurrences for each class. For the three sentiment classes
(Negative, Neutral, Positive), the model provides detailed metrics. E.g., higher recall
for Positive implies it identifies positivity better.
5.4 Confusion Matrix
The confusion matrix evaluates misclassifications:
 Rows = Actual classes.
 Columns = Predicted classes.
 Helps identify whether specific classes (e.g., Neutral) are more challenging to classify.

VI Sem AIML Aug-Dec 2024 11

Mini Project Kannada Sentiment Analysis

5.5 Evaluation Metrics

 Weighted Cross-Entropy Loss: Incorporating weights mitigates class imbalance by
penalizing underrepresented classes more.
 Learning Rate Scheduler: The get_linear_schedule_with_warmup ensures smooth
optimization, leading to stable convergence.
5.6 Test Predictions
The predict_sentiment function shows real-time sentiment predictions:
 Example: For the Kannada text " ಇದ ಂದ ೕ ೕ ೕ
ಪ ೕಜನವ ಕಂ ಂ ಲ " (I didn’t find any benefit from this), the model
predicted the sentiment as Negative. This aligns with the input's context.
5.7 Observations
 Strengths:
o The model leverages the Indic-BERT pretrained model, fine-tuned specifically
for Kannada, capturing subtleties of the language.
o Class weights improve the model’s robustness against imbalanced datasets.
 Challenges:
o The model’s performance can vary depending on the dataset's quality and
diversity. Kannada slang, mixed-language inputs, or ambiguous sentiments may
require additional preprocessing or data augmentation.
o Validation accuracy should remain stable; if there's a significant drop, it
indicates overfitting.
5.8 Suggestions for Improvement
 Data Augmentation: Use techniques like back-translation or synonym replacement to
expand the dataset.
 Hyperparameter Tuning: Experiment with learning rates, batch sizes, and epochs.
 Advanced Models: Consider fine-tuning larger models like IndicBERT-v2 for
improved language-specific performance.
 Ensemble Methods: Combining multiple models can help in edge cases and improve
robustness.

VI Sem AIML Aug-Dec 2024 12

6. CONCLUSIONS

The Kannada review sentiment analysis project showcases the ability to apply natural
language processing (NLP) techniques for understanding sentiments in Kannada. By
leveraging Indic-BERT, a transformer-based language model tailored for Indian languages,
the system effectively classifies sentiments into `Positive`, `Neutral`, and `Negative`
categories.
The project utilized a dataset of Kannada text reviews and employed rigorous pre-
processing, tokenization, and data handling techniques. The model was fine-tuned using a
balanced dataset, achieving high accuracy and robust generalization. Key steps such as custom
loss functions with class weights, effective training schedules, and optimized hyper parameters
contributed to the model's success. Validation metrics, including classification reports and
confusion matrices, highlighted its strong performance across different sentiment classes.
This sentiment analysis system holds potential for real-world applications, such as
analyzing feedback on Kannada-language platforms, monitoring public sentiment on social
media, and improving user experiences in regional markets. However, the project also
identifies areas for improvement, including expanding the dataset, addressing linguistic
diversity in Kannada dialects, and exploring ensemble models for enhanced accuracy.
Overall, this study reinforces the significance of integrating advanced NLP tools like
Indic-BERT for regional language processing, paving the way for broader adoption in
multilingual AI systems.

VI Sem AIML Aug-Dec 2024 13

BIBLIOGRAPHY
[1] Transformers Documentation by Hugging Face: https://huggingface.co/docs/transformers
[2] Indic-BERT, AI4Bharat, IndicNLP Suite and Indic-BERT: https://ai4bharat.iitm.ac.in/models
[3] Python Libraries for Machine Learning
 Scikit-learn Documentation: https://scikit-learn.org/
 PyTorch Documentation: https://pytorch.org/docs/stable/index.html
[4] Kannada Sentiment Analysis Research: Vishwakarma, (2022). Sentiment Analysis on Indian
Languages: A Review. International Journal of Advanced Research in Computer Science, Vol
13(3).
[5] Text Classification and NLP Techniques: Goldberg, Y. (2017). Neural Network Methods for
Natural Language Processing. Morgan & Claypool Publishers.
[6] Data Preprocessing and Handling in NLP: Mikolov, T., et al. (2013). Efficient Estimation of
Word Representations in Vector Space. arXiv preprint. Available at:
https://arxiv.org/abs/1301.3781.
[7] Sentiment Analysis Projects: Kaggle. Sentiment Analysis Datasets and Projects. Available at:
https://www.kaggle.com.
[8] TQDM for Progress Monitoring Documentation: https://tqdm.github.io
[9] Pandas and NumPy Documentation: Pandas , Numpy
[10] Custom Loss Functions and Weighted Training: Goodfellow, (2016). Deep Learning. MIT
Press.

VI Sem AIML Aug-Dec 2024 14

APPENDIX A : CODE SNIPPETS
python | initialize data
class KannadaDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
label = self.labels[idx]
encoding = self.tokenizer(
text,
add_special_tokens=True,
max_length=self.max_len,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}

python | data preparation

def prepare_data(data_path):
df = pd.read_csv(data_path)
if 'sentiment_numeric' not in df.columns:
sentiment_map = {
'positive': 2,
'pos': 2,
'neutral': 1,
'neu': 1,
'negative': 0,
'neg': 0

VI Sem AIML Aug-Dec 2024 15

}
df['sentiment_numeric'] = df['sentiment'].map(sentiment_map)
df = df.dropna(subset=['text', 'sentiment_numeric'])
train_texts, val_texts, train_labels, val_labels = train_test_split(
df['text'].values,
df['sentiment_numeric'].values,
test_size=0.2,
stratify=df['sentiment_numeric'].values,
random_state=42
)
return train_texts, val_texts, train_labels, val_labels

python | model trainging

def train_model(train_loader, val_loader, model, device, num_epochs=8):
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps
)
class_weights = torch.tensor([1.0, 1.5, 1.0], device=device)
criterion = nn.CrossEntropyLoss(weight=class_weights)
best_accuracy = 0
for epoch in range(num_epochs):
print(f'Epoch {epoch + 1}/{num_epochs}')
model.train()
total_train_loss = 0
for batch in tqdm(train_loader, desc='Training'):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
model.zero_grad()
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)

loss = criterion(outputs.logits, labels)

VI Sem AIML Aug-Dec 2024 16

total_train_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
avg_train_loss = total_train_loss / len(train_loader)
model.eval()
total_val_loss = 0
val_predictions, val_true_labels = [], []
with torch.no_grad():
for batch in tqdm(val_loader, desc='Validation'):
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = criterion(outputs.logits, labels)
total_val_loss += loss.item()
logits = outputs.logits
preds = torch.argmax(logits, dim=1)
val_predictions.extend(preds.cpu().numpy())
val_true_labels.extend(labels.cpu().numpy())
avg_val_loss = total_val_loss / len(val_loader)
val_accuracy = np.mean(np.array(val_predictions) ==
np.array(val_true_labels))
print(f'Training Loss: {avg_train_loss:.4f}')
print(f'Validation Loss: {avg_val_loss:.4f}')
print(f'Validation Accuracy: {val_accuracy:.4f}')
print(classification_report(val_true_labels, val_predictions,
target_names=['Negative', 'Neutral',
'Positive']))
if val_accuracy > best_accuracy:
best_accuracy = val_accuracy
torch.save(model.state_dict(), 'best_model.pt')
return model

VI Sem AIML Aug-Dec 2024 17

APPENDIX B : SCREENSHOT

Fig 1. Training and Validation Metrics Across Epochs

Figure 1 shows the training loss, validation loss, and validation accuracy over 8 epochs.

 The training loss (red) decreases steadily, showing effective learning on the training
data.
 The validation loss (orange) decreases initially but stabilizes after epoch 4, indicating
limited improvement on unseen data.
 The validation accuracy (blue) increases rapidly early on and plateaus around 0.55
after epoch 4.

VI Sem AIML Aug-Dec 2024 18

Sentimental Analysis Project Documentation
83% (6)
Sentimental Analysis Project Documentation
67 pages
Kickstart Quantum Computing and Communication Fundamentals: Master Quantum Computing Principles, Unlock Cutting-Edge Communication Protocols, and Build Future-Ready Solutions with Quantum Algorithms
From Everand
Kickstart Quantum Computing and Communication Fundamentals: Master Quantum Computing Principles, Unlock Cutting-Edge Communication Protocols, and Build Future-Ready Solutions with Quantum Algorithms
Paras Nath
No ratings yet
Kickstart Quantum Computing and Communication Fundamentals: Master Quantum Computing Principles, Unlock Cutting-Edge Communication Protocols, and Build Future-Ready Solutions with Quantum Algorithms (English Edition)
From Everand
Kickstart Quantum Computing and Communication Fundamentals: Master Quantum Computing Principles, Unlock Cutting-Edge Communication Protocols, and Build Future-Ready Solutions with Quantum Algorithms (English Edition)
Paras Nath Barwal
No ratings yet
Interpreting Cultural Differences Challenge of Intercultural Communication
80% (5)
Interpreting Cultural Differences Challenge of Intercultural Communication
223 pages
A Grammar of The Kannada Language in English 1000103714
60% (5)
A Grammar of The Kannada Language in English 1000103714
497 pages
Project Report
No ratings yet
Project Report
39 pages
Draft Sem 8
No ratings yet
Draft Sem 8
70 pages
Return Black Book 1.2.2
No ratings yet
Return Black Book 1.2.2
83 pages
Viswas's
No ratings yet
Viswas's
64 pages
Black Book 3.0 Krishna-1
No ratings yet
Black Book 3.0 Krishna-1
88 pages
Project Report
No ratings yet
Project Report
42 pages
Report Sentiment Analysis Using NLP and Deep Learning
No ratings yet
Report Sentiment Analysis Using NLP and Deep Learning
65 pages
Report
No ratings yet
Report
68 pages
Combinepdf
No ratings yet
Combinepdf
64 pages
Yaswanth
No ratings yet
Yaswanth
103 pages
Internship Report Final
No ratings yet
Internship Report Final
31 pages
DTGTFGHF
No ratings yet
DTGTFGHF
84 pages
Projects 1920 C 5
No ratings yet
Projects 1920 C 5
95 pages
Project Report: Sentiment Analysis in Hindi Language
No ratings yet
Project Report: Sentiment Analysis in Hindi Language
27 pages
1RV21AI011-1RV21AI028 Stream Lab Report
No ratings yet
1RV21AI011-1RV21AI028 Stream Lab Report
34 pages
2016 - MT - 214CS3513 - Nishant - Kumar Sample
No ratings yet
2016 - MT - 214CS3513 - Nishant - Kumar Sample
46 pages
RVT Final
No ratings yet
RVT Final
58 pages
Major Project Report Naman
No ratings yet
Major Project Report Naman
44 pages
Sentiment Analysis
75% (4)
Sentiment Analysis
45 pages
Sentiment Analysis Report
No ratings yet
Sentiment Analysis Report
31 pages
Mini Project Document
No ratings yet
Mini Project Document
45 pages
Report SEM I
No ratings yet
Report SEM I
56 pages
Final Documentation
No ratings yet
Final Documentation
51 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
71 pages
Shivamani
No ratings yet
Shivamani
63 pages
NLP Project Report NLP Project Report
No ratings yet
NLP Project Report NLP Project Report
48 pages
Report New
No ratings yet
Report New
32 pages
BT4431 Report of Project Ete 7TH Sem Plag Report Attachted
No ratings yet
BT4431 Report of Project Ete 7TH Sem Plag Report Attachted
69 pages
Theolaaaa4273 Merged
No ratings yet
Theolaaaa4273 Merged
76 pages
Ai Final
No ratings yet
Ai Final
17 pages
1822 B.tech It Batchno 359
No ratings yet
1822 B.tech It Batchno 359
86 pages
Sentiment Analysys of Tweets Using Machine Learning
No ratings yet
Sentiment Analysys of Tweets Using Machine Learning
74 pages
NM Project Report-Sentiment Analysis-2
No ratings yet
NM Project Report-Sentiment Analysis-2
36 pages
VII-Report Saketh
No ratings yet
VII-Report Saketh
36 pages
Review Analysis and Sentiment Learning Using NLP
No ratings yet
Review Analysis and Sentiment Learning Using NLP
15 pages
Project Report 2023
No ratings yet
Project Report 2023
32 pages
La Vanya
No ratings yet
La Vanya
44 pages
Complete Report
No ratings yet
Complete Report
56 pages
Project Review
No ratings yet
Project Review
17 pages
Smap Sentanalysis
No ratings yet
Smap Sentanalysis
27 pages
Project Report - M13 Sentiment Analyzer
No ratings yet
Project Report - M13 Sentiment Analyzer
9 pages
Social Media Sentiment Analysis
No ratings yet
Social Media Sentiment Analysis
49 pages
PushpendraSkill Based
No ratings yet
PushpendraSkill Based
26 pages
Next
No ratings yet
Next
26 pages
ML Project Report
No ratings yet
ML Project Report
26 pages
Ai Report FINAL
No ratings yet
Ai Report FINAL
26 pages
Majorprojectdoc
No ratings yet
Majorprojectdoc
23 pages
GR
No ratings yet
GR
38 pages
Sentiment Analysis of Product-Based Reviews Using Machine Learning Approaches
No ratings yet
Sentiment Analysis of Product-Based Reviews Using Machine Learning Approaches
38 pages
Minor Project Presentation
No ratings yet
Minor Project Presentation
16 pages
AI Report Shivam
No ratings yet
AI Report Shivam
8 pages
Ai Based Fake Reviews Detection System
No ratings yet
Ai Based Fake Reviews Detection System
3 pages
Mini Project
No ratings yet
Mini Project
16 pages
Attirbuting Authorship - An Introduction
No ratings yet
Attirbuting Authorship - An Introduction
281 pages
What and How To Test
No ratings yet
What and How To Test
11 pages
Front Pages
No ratings yet
Front Pages
8 pages
Anna University: Chennai 600 025
No ratings yet
Anna University: Chennai 600 025
10 pages
Sentimental Analysis of Twitter Using Emoji: A Creative and Innovative Project Report
No ratings yet
Sentimental Analysis of Twitter Using Emoji: A Creative and Innovative Project Report
19 pages
Material 3er Corte Ingles
No ratings yet
Material 3er Corte Ingles
33 pages
Verbs
No ratings yet
Verbs
9 pages
Business Communication 1
No ratings yet
Business Communication 1
14 pages
Ni L 1673872237 Comparative and Superlative Adjectives Activity Sheets - Ver - 4
No ratings yet
Ni L 1673872237 Comparative and Superlative Adjectives Activity Sheets - Ver - 4
4 pages
Thanksgiving (YL) - Onestopenglish
No ratings yet
Thanksgiving (YL) - Onestopenglish
7 pages
A Dissertation On Natural Phonology Stampe
100% (2)
A Dissertation On Natural Phonology Stampe
8 pages
Archaeology - July-August 2019
No ratings yet
Archaeology - July-August 2019
74 pages
Click Here For The Advanced VI Cheatsheet
No ratings yet
Click Here For The Advanced VI Cheatsheet
4 pages
Silebr 2004 005
No ratings yet
Silebr 2004 005
9 pages
Aesthetics Against Incarnation: An Interview by Anne Marie Oliver
No ratings yet
Aesthetics Against Incarnation: An Interview by Anne Marie Oliver
20 pages
Practice Test 013
No ratings yet
Practice Test 013
8 pages
Newspaper Style - The Headline
No ratings yet
Newspaper Style - The Headline
46 pages
From Ethnomusicology To Echo-Muse-Ecology: Reading R. Murray Schafer in The Papua New Guinea Rainforest
No ratings yet
From Ethnomusicology To Echo-Muse-Ecology: Reading R. Murray Schafer in The Papua New Guinea Rainforest
5 pages
UNIX Programming: UNIX Processes, Memory Management, Process Communication, Networking, and Shell Scripting
From Everand
UNIX Programming: UNIX Processes, Memory Management, Process Communication, Networking, and Shell Scripting
Dr. Vineeta Khemchandani
No ratings yet
Basic Thai Grammar
No ratings yet
Basic Thai Grammar
4 pages
C3 Testspecs - FinalTest
No ratings yet
C3 Testspecs - FinalTest
10 pages
Getting Started Cantonese
No ratings yet
Getting Started Cantonese
4 pages
Pragmatics - Assignment
No ratings yet
Pragmatics - Assignment
6 pages
Prova OBLI Nível 8º e 9º Anos 2024.1
No ratings yet
Prova OBLI Nível 8º e 9º Anos 2024.1
15 pages
The Use of Hedging in Academic Discourse: Farida Hidayati, Ahsin Muhammad, and Ruswan Dallyono
No ratings yet
The Use of Hedging in Academic Discourse: Farida Hidayati, Ahsin Muhammad, and Ruswan Dallyono
11 pages
Tense Past Present: I Haven't Been Waiting For Two Hours
No ratings yet
Tense Past Present: I Haven't Been Waiting For Two Hours
2 pages
TỔNG HỢP BÀI ĐỌC TRONG ĐỀ THI VÀO 10- KHÔNG ĐÁP ÁN
No ratings yet
TỔNG HỢP BÀI ĐỌC TRONG ĐỀ THI VÀO 10- KHÔNG ĐÁP ÁN
35 pages
Aspect Versus Aktionsart
No ratings yet
Aspect Versus Aktionsart
12 pages
Survei Listening
No ratings yet
Survei Listening
22 pages
Translation Quality Assesment
No ratings yet
Translation Quality Assesment
14 pages
What Is Intonaton
No ratings yet
What Is Intonaton
2 pages
Direct Teaching of Vocabulary After Reading: Is It Worth The Effort?
No ratings yet
Direct Teaching of Vocabulary After Reading: Is It Worth The Effort?
8 pages

Mini Proj

Uploaded by

Mini Proj

Uploaded by

A Mini-Project Report

Kannada Sentiment Analysis

Kishor Sinnur Santosh Patil

Under the Guidance of

Department of Computer Science and Engineering

University Visvesvaraya College of Engineering

Department of Computer Science and Engineering

Dr. Kiran K Dr. Thriveni J

3. Hardware and Software Requirements Specification 4

5. Result and Performance 11

APPENDIX A : Code Snippet 15

In this project, we present a transformer-based approach using the Indic-BERT model

Kannada, one of the prominent Dravidian languages spoken in Karnataka, is widely

VI Sem AIML Aug-Dec 2024 1

2.1 Sentiment Analysis in Regional and Dravidian Languages

2.2 Challenges in Kannada Sentiment Analysis

 Lack of Resources: Kannada suffers from a scarcity of annotated datasets, hindering

 Morphological Complexity: Kannada exhibits a highly inflectional morphology,

VI Sem AIML Aug-Dec 2024 2

2.3 Existing Approaches for Kannada Sentiment Analysis

 Lexicon-Based Approaches: Researchers have developed Kannada sentiment

2.4 Research Gaps

Despite these advancements, several gaps remain:

 Dataset Availability: Publicly available Kannada sentiment datasets are scarce,

 Benchmarking Models: There is a need for systematic benchmarking of transformer-

VI Sem AIML Aug-Dec 2024 3

3.2 SOFTWARE REQUIREMENTS

VI Sem AIML Aug-Dec 2024 4

3.3 OPTIONAL REQUIREMENTS (FOR ENHANCED PERFORMANCE)

VI Sem AIML Aug-Dec 2024 5

VI Sem AIML Aug-Dec 2024 6

 AI4/Barath Model: Integrate AI4/Barath, which is another transformer model fine-

VI Sem AIML Aug-Dec 2024 7

4.2.6 Results and Analysis

4.3 DATASET DESCRIPTION

4.4 INDIC-BERT & AI4BHARATH MODELS FOR KANNADA SENTIMENT

VI Sem AIML Aug-Dec 2024 8

4.4.2 Key Features of Indic-BERT:

4.5 AI4Bharath Model

VI Sem AIML Aug-Dec 2024 9

4.5.2 How AI4Bharath is Used in Kannada Sentiment Analysis:

VI Sem AIML Aug-Dec 2024 10

o A validation accuracy of around 80% - 90% indicates a good understanding of

VI Sem AIML Aug-Dec 2024 11

5.5 Evaluation Metrics

VI Sem AIML Aug-Dec 2024 12

VI Sem AIML Aug-Dec 2024 13

VI Sem AIML Aug-Dec 2024 14

python | data preparation

VI Sem AIML Aug-Dec 2024 15

python | model trainging

loss = criterion(outputs.logits, labels)

VI Sem AIML Aug-Dec 2024 16

VI Sem AIML Aug-Dec 2024 17

Fig 1. Training and Validation Metrics Across Epochs

VI Sem AIML Aug-Dec 2024 18

You might also like