Projecr Report - Pagenumber
Projecr Report - Pagenumber
Sentimental Analysis
A PROJECT REPORT
Submitted by
Rahul S RA1911003040114
Kevin Kurien Joseph RA1911003040040
Siddhaarthan V RA1911003040098
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
of
FACULTY OF ENGINEERING AND TECHNOLOGY
MAY 2023
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
BONAFIDE CERTIFICATE
This sheet must be filled in and signed with dated along with student registration
number, work will not be marked unless this is done.
To be completed by the student for all assessments
Degree/ Course :
Student Name :
Registration Number :
Title of Work :
I / We hereby certify that this assessment compiles with the University’s Rules
and Regulations relating to Academic misconduct and plagiarism**, as listed in
the University Website, Regulations, and the Education Committee guidelines.
I / We confirm that all the work contained in this assessment is my / our own
except where indicated, and that I / We have met the following conditions:
I understand that any false claim for this work will be penalized in
accordance with theUniversity policies and regulations.
DECLARATION:
I am aware of and understand the University’s policy on Academic misconduct and plagiarism
and I certify that this assessment is my / our own work, except where indicated by referring,
and that I have followed the good academic practices noted above.
If you are working in a group, please write your registration numbers and sign with
the date forevery student in your group.
gggvv
ACKNOWLEDGEMENT
We sincerely thank all faculty of DCSE, staff and students of the department who have
directly or indirectly helped our project.
Finally, we would like to thank our parents, our family members and our friends for their
unconditional love, constant support and encouragement.
Rahul S
Kevin Kurien Joseph
Siddhaarthan.V
gggvv
ABSTRACT
Interest in people's opinions about tweets has lately grown as a result of opinionated data that can
be obtained on blogs and social media platforms. With everyone having access to the internet and
the ability to tweet whatever they want, Twitter is one of the most widely used information
sources in the world today. Therefore, there is a higher likelihood that people would be misled.
It would be very helpful to come to a consensus on what the general attitude of the populace is,
particularly during times of panic like the pandemic, in order to better grasp how ready the
populace is to face such a potentially disastrous crisis.To handle these kinds of challenging
problems, machine learning methods are typically applied. Analysing this kind of data manually
requires more time. It is challenging to categorise opinions according to their polarity. This
project was created specifically to analyse the moods of Twitter users using ML techniques and
NLP tools to pinpoint the cause of COVID-19's reappearance. The perspectives of different
people can diverge. After the system can classify the text's emotions, we train it using historical
data. Machine learning methods are crucial to this project's ability to classify data. The Random
Forest Classifier has also been chosen as the framework's preferred ML approach, with an
accuracy rate of 88%. We assessed the framework using a variety of metrics, including F1-Score,
precision, accuracy, and recall.
iii
gggvv
TABLE OF CONTENTS
ABSTRACT iii
TABLE OF CONTENTS iv
1 INTRODUCTION 1
1.1 Overview ................................................................................................................ 1
1.2 Natural Language Processing.................................................................................1
1.3 Machine Learning...................................................................................................2
1.4 Motivation .............................................................................................................. 3
2 LITERATURE SURVEY 4
3 SYSTEM ARCHITECTURE AND DESIGN 7
3.1 System Architecture ............................................................................................... 7
3.2 Problem Statement ................................................................................................. 7
3.3 Existing System ...................................................................................................... 7
3.4 Challenges of the Existing System ......................................................................... 8
3.5 Proposed System .................................................................................................... 8
3.6 Architecture Diagram ............................................................................................. 8
3.7 UML Diagrams......................................................................................................10
3.7.1 Class Diagram ......................................................................................... 11
3.7.2 Activity Diagram ..................................................................................... 11
3.7.3 Sequence Diagram ................................................................................... 11
3.7.4 Entity Relationship Diagram ................................................................... 12
4 METHODOLOGY 13
4.1 Modules .................................................................................................................. 13
4.1.1 Data Pre-processing. ................................................................................. 13
4.1.2 EDA of Visualization...................................................................... 14
4.1.3 Implementing Logistic Regression. ........................................................... 14
4.1.4 Implementing Catboost ............................................................................. 15
4.1.5 Implementing Random forest .................................................................... 16
4.1.6 Deployment Using Flask. .......................................................................... 17
gggvv
5.1 Data Pre-processing ........................................................................................... 19
5.2 EDA of Visualization……………………………………………………… 22
5.3 Training and testing Algorithms…………………………………………… 26
5.3.1 Logistic Regression……………………………………………….. 26
5.3.2 Catboost……………………………………………………………. 28
5.3.3 Random Forest……………………………………………………... 30
5.4 Deployment Using Flask…………………………………………………. 32
6 RESULTS AND DISCUSSIONS 33
7 CONCLUSION AND FUTURE ENHANCEMENT 35
7.1 Conclusion ......................................................................................................... 35
7.2 Future Work………………………………………………………………. 35
REFERENCES 36
APPENDIX 38
A CONFERENCE PUBLICATION 38
B JOURNAL PUBLICATION 40
C PLAGIARISM REPORT 41
gggvv
LIST OF FIGURES
3.1 General Architecture Diagram ................................................................................. 9
3.2 Random Forest Architecture Diagram .....................................................................9
3.3 Structure of Random forest ..................................................................................... 10
3.4 Class Diagram ......................................................................................................... 11
3.5 Activity Diagram ..................................................................................................... 11
3.6 Sequence Diagram ................................................................................................... 12
3.7 ERD Diagram .......................................................................................................... 12
4.1 Implementation of Logistic Regression ...................................................................14
4.2 Classification Report of Logistic Regression ........................................................... 15
4.3 Confusion Matrix – Logistic Regression...................................................................15
4.4 Implementation of Catboost Algorithm ...................................................................15
4.5 Classification Report of Catboost Algorithm .......................................................... 16
4.6 Confusion Matrix – Catboost Algorithm..................................................................16
4.7 Implementation of Random Forest .......................................................................... 17
4.8 Classification Report of Random Forest .................................................................. 17
4.9 Confusion Matrix - Random Forest..........................................................................17
4.10 UI Screen of Input .................................................................................................... 18
4.11 UI Screen of Output..................................................................................................18
6.1 Accuracy Comparision Bar ...................................................................................... 33
LIST OF TABLES
vi
gggvv
CHAPTER 1
INTRODUCTION
1.1 Overview
A variety of ideas and feelings were sparked by the Coronavirus's widespread distribution.
Due to the sheer nature of the COVID-19 pandemic, confusion and panic spread quickly
through society. People from many countries responded on social networking networks
in different ways. The COVID-19 epidemic has helped to highlighting urban dwellers'
vulnerabilities and constitutes a significant public health risk. When there was a
pandemic, people's emotions varied, which led to mental problems including fear, worry,
and many other horrifying symptoms. Twitter posts including the phrases "updates about
confirmed cases," "COVID-19-related death," "early signs of the outbreak," "economic
impact," and "preventive measures" reveal feelings of dread and fear. Additionally, Public
thoughts about COVID-19 news posted on microblogging platforms have the power to
propagate different points of view. Natural language processing techniques are used, we
describe a sentimental analysis of COVID-19 in this work. By examining news articles
and social media data, we hope to assess how the general public is feeling about COVID-
19. We'll examine the feelings people have about the pandemic and pinpoint the key issues
and themes that are influencing perception. User-generated content on social media is
significant since it can serve as a significant information source during times of crisis.
Social media and microblogging platforms like Facebook and Twitter were massively
accepted by people. Since the majority of the data were from the early 2020s to the late
2021s, selecting a dataset at first presented us with some difficulties. However, we only
selected our dataset from the more than 2500 English-language tweets that were sent on
Covid-19 during October and December of 2022. From the word corpus, we identified the
most widely used words. Additionally, Using the sentiment scores of our preprocessed
tweets, we classified our dataset and assigned sentiment scores to the tweets based on the
polarity of their sentiment. Finally, we trained our model using those tweets and their
sentiment scores.
Natural Language Processing also know as NLP is a study on how computers interact
with human language. It involves assessing, interpreting, and creating human language
1
gggvv
using computational linguistics and machine learning methods. NLP attempts to make it
possible for machines to understand and interpret human language in order to improve
communication between humans and machines. It is utilized in a wide range of
applications, including chatbots, virtual assistants, speech recognition systems, sentiment
analysis, and machine translation. All of these applications use Natural language
processing techniques to convert human language into machine readable format
andprovide accurate responses.
NLP is a rapidly developing field that could greatly improve interaction between
machines and humans in the coming years. Advances in machine learning have greatly
improved the performance of NLP algorithms. Researchers continue to come up with new
techniques for processing and understanding human language.
The main challenge in NLP is handling the vagueness of human language. The context of
how words are used is crucial as the exact same words can be interpreted in a variety of
ways depending on it. NLP algorithms must be able to accurately identify the intended
meaning of words and phrases in order to produce meaningful responses. NLP involves
several subfields, including discourse analysis, syntactic analysis, and semantic analysis.
Syntactic analysis is a subfield which deals with analysing the grammatical structure of
sentences to identify the correlations between words. Understanding the meaning of
words and phrases as well as the connections between them is the goal of semantic
analysis.
Machine learning is a method that has proven indispensable in the study of NLP, and has
been used for a number of tasks, including the classification of sentiments. One of the key
benefits of using machine learning in sentiment classification is its ability to process and
examine large sets of raw input rapidly with decent accuracy, allowing for more effective
and efficient identification and classification of tweet sentiments. Additionally, ML
models can be trained to recognize patterns and relationships in the data, enabling them
to improve their accuracy over time. This is particularly important as the accuracy needs
to be maintained even if the size of the input data increases substantially.
The use of machine learning also allows for the development of more nuanced and
comprehensive classification models. The multi-sentiment model presented in this
2
gggvv
paper allows users a more flexible and adaptable approach. On the whole, the
significance of ML tweet sentiment classification lies in its ability to process and
examine vast amounts of data rapidly and accurately, refine over time, and develop
more nuanced and comprehensive classification models.
The model used in this paper is the Random Forest Classifier that is competent at
processing the vectorized data and has the ability to remember long term
dependencies, as well as provide us with a result obtained to though the averaging
of the output of multiple decision trees, therefore maximizing the accuracy as well
as efficiency of the model.
1.4 Motivation: -
We wanted to broaden our focus in order to better understand how the world felt at
the time about Covid-19 in light of the unexpected increase of instances that is
currently occurring in China. to use Twitter to gather consensus on how the world
felt about the possibility of its revival. By reaching a consensus, we hope to convey
to the user the general opinion held by those who have posted the tweet or other
related tweets about the widespread illness. to comprehend how people throughout
the world feel about another possible global shutdown. The COVID-19 pandemic
has also generated a great deal of false information and fake news, which has left
people perplexed and afraid. Sentiment analysis can assist in identifying such
inaccurate and misleading information and in eradicating it by supplying the public
with reliable and accurate information.
3
gggvv
CHAPTER 2
LITERATURE SURVEY
A number of nations have declared total lockdown in an effort to contain the Covid-19
epidemic as it spreads fast over the globe and claims the lives of millions of people each
day. Due to the prevalence of social media as a means of emotional expression at this time
of lockdown, these channels have been crucial in the global dissemination of information
concerning the epidemic. We created an experimental methodology to examine Twitter
users' reactions while taking into account the phrases that are frequently used to refer to
the epidemic, either directly or indirectly. In this study, the sentiment analysis of a sizable
quantity of tweets mentioning the coronavirus or Covid-19 is given. To evaluate the public
opinion trend on topics pertinent to the Covid-19 epidemic, we first utilise an evolutionary
categorization followed by an n-gram analysis. The sentiment scores for the collected
tweets were then calculated based on each tweet's class. We trained the long-short term
network using two different types of rated tweets in order to estimate sentiment on Covid-
19 data, and we attained an overall accuracy of 84.46%.
4
gggvv
n-gram analysis to assess the public opinion trend on issues relevant to the Covid-19
outbreak. The sentiment scores for the amassed tweets were then determined using the
class of each tweet. In order to measure sentiment using Covid-19 data, we trained the
long-short term network using two different types of rated tweets, and we achieved an
overall accuracy of 84.46%.
2.3 Title : Public Sentiment Analysis on Twitter Data during COVID-19 Outbreak
Author: Mohammad Abu Kausar, Arockiasamy Soosaimanickam
Year : 2021
The global community is now dealing with the COVID-19 pandemic, sometimes referred
to as the coronavirus pandemic. In Wuhan, China, the epidemic was first discovered in
December 2019. On March 11, 2020, the World Health Organisation proclaimed this to
be a pandemic. The COVID-19 virus killed hundreds of thousands of people and infected
millions of others in the US, Brazil, Russia, India, and other nations. Millions of people
are still being affected by this pandemic, and several nations have implemented either a
partial or total lockdown. During this lockdown, people expressed their feelings and
views via social media in an effort to calm themselves and stop talking.In this study,
tweets from people in the top 10 affected countries were analysed for sentiment. Data
from tweets of people from the top ten affected countries and one more Gulf country,
Sultanate of Oman, was included in the experiments. A dataset of more than 50,000
tweets with hashtags like #covid-19, #COVID19, #CORONAVIRUS, #CORONA,
#StayHomeStaySafe, #Stay Home, #Covid_19, #CovidPandemic, #covid19, #Corona
Virus, #Lockdown, #Qurantine, #qurantine, #Coronavirus Outbreak, #COVID etc.
posted betweenJune 21, 2020 till July 20, 2020 was considered in this research. The
tweets in English were used as the source for a sentiment analysis. This survey was done
to learn how individuals in the many affected nations view the problem. The tweets were
collected, prepared, and subjected to text mining algorithms before sentiment analysis
was complete and the findings were displayed. This study attempts to better understand
the perspectives of individuals from nations with COVID-19 infections.
5
gggvv
The COVID-19 epidemic is unlike any other in that it has caused a significant number of
fatalities. Thanks to Twitter, which has developed into an important platform for public
interactions, researchers now have the opportunity to investigate how the general public
is reacting to the outbreak. The researchers looked over 100,000 tweets with the hashtags
#coronavirus, #coronavirusoutbreak, #coronavirusPandemic, #COVID19, #COVID-19,
#epitwitter, #ihavecorona, #StayHomeStaySafe, and #TestTraceIsolate. Python, Google
NLP, NVivo, and other programming languages are used to conduct sentiment analysis
and theme analysis. According to the data, 29.56 percent of tweets were positive, 29.54
percent were mixed, 23.23% were neutral, and 18.06 percent were unfavourable. Popular
search terms include "cases", "home", "people", and "help". Public health, COVID-19
globally, and the number of cases/death were the "30" of these issues that we uncovered,
and we categorised them into "three" categories. This study shows how the usage of
Twitter data and an NLP method may be utilised to examine public discourse and opinions
during the COVID-19 pandemic. Real-time analysis can reduce false signals and improve
the effectiveness of giving people the right guidance.
The global spread of COVID-19 has had a significant impact on how serious the
educational system is. In this study, we present a thorough sentiment analysis of tweets
about education made on the Twitter network and make conclusions on how the epidemic
affected people's emotions as it progressed over the course of the months. Twitter has
collected more than 90,000 tweets about the circumstances underlying the change in
educational systems around the world. Natural language tool kit (NLTK) capabilities and
Naive Bayes Classifier were used to do a sentiment analysis on the given dataset.
According to the results of this investigation, we should show how COVID-19 affected
education and how people's attitudes changed as a result of the modifications made to the
educational system. We want to provide a deeper knowledge of people's attitudes towards
education in order to better tackle the epidemic under these extraordinary conditions.
6
gggvv
CHAPTER 3
The existing systems mostly focuses on using rudimentary algorithms such as BoW
Decisions Trees, or complex methods such as LSTM based classification which
works on the concept of neural networks and they have high time complexity. They
7
gggvv
use lemmatization method for the purpose of tokenization and eliminating stop
words. They use Plotly package in python for the purpose of visualization of the
variations of the dataset.
Dataset:-
The Dataset containing Tweets about Covid-19 from 2021 onwards is filtered and cleaned
to remove unnessasary fields and additional attributes. The data set is acquired from
Kaggle.
8
gggvv
Fig 3.1: General Architecture Diagram
Random Forest is one of the most popular algorithms used by Data Scientists.
Random forest is a Supervised Machine Learning Algorithm that is used widely in
Classification and Regression problems. Random forest works on the Bagging
9
gggvv
principle. Bagging chooses a random sample/random subset from the entire data
set. Hence each model is generated from the samples (Bootstrap Samples) provided
by the Original Data with replacement known as row sampling. Random Forest is
a classifier that contains a number of decision trees on various subsets of the
given dataset and takes the average to improve the predictive accuracy of that
dataset.
10
gggvv
Fig 3.4 Class Diagram
11
gggvv
Fig 3.6 Sequence Diagram
12
gggvv
CHAPTER 4
METHODOLOGY
4.1 MODULES
2) Case Folding: A term's polarity (positive, negative, and neutral) can be significantly
influenced by the uppercase and lowercase variations of that word. Depending on the
nature of the corpus and its intended use, some words in the text may be changed from
uppercase to lowercase or vice versa. Our dataset comprises of short-worded tweets that
typically use lowercase and follow a standard communication style. In order to do case-
folding, we reduced the number of capital words in all tweets.
3)Tokenization: We broke down each tweet message into tokens (words) so that machine
learning (ML) algorithms could grasp the meaning of each word.
4) Stemming, also referred to as the base or root form, is the process of reducing a word
to its stem. Stemming seeks to normalise words such that related terms sound similar.
(like "run", "runs", and "running") are viewed as one and the same. This can help text
13
gggvv
analysis tasks like sentiment analysis, information retrieval, and text classification
perform better.
Data visualisation is a vital skill in applied statistics and machine learning.. Statistics
actually puts a lot of emphasis on numerical estimations and data descriptions. Data
visualisation offers a crucial set of tools for gaining a qualitative insight.. This can aid in
understanding the data distribution of the dataset. Visualisations can be used to express
and convey crucial linkages through plots and charts that are more immediately and
pertinent to stakeholders rather than using measurements of association or relevance.
Even checking for training dataset data that are unrelated to the model can be done using
it.
Despite what its name might imply, the logistic regression statistical classification method
produces the classification probability of the desired output (the dependent variable)..
Using the collected features, the logistic regression model is trained using the labelled
dataset of tweets. The model gains the ability to modify the weights of the features to
reduce the discrepancy between its predictions and the actual labels in the training set.
The output of the linear model is passed via the sigmoid function after the logistic
regression model has been trained, which converts the output to a probability value
between 0 and 1. The sigmoid function has the following formula:
1 / (1 + exp(-z)),
Finally, the sentiment of a given tweet is predicted using the probability value returned
by the sigmoid function
14
gggvv
Fig 4.2 Classification Report of Logistic Regression
CatBoost is a gradient boosting method that can be used for sentiment analysis and other
binary or multi-class classification applications. Through the iterative construction of a
number of decision trees that gauge the emotion of the tweet based on the input features,
CatBoost learns to minimise the log loss or cross-entropy loss function during training.
The input features that were most crucial for predicting the sentiment of the tweets can be
revealed using CatBoost. This can be helpful for figuring out which elements of the tweets
have the most effects on sentiment. The handling of categorical features, the handling of
missing values, and the provision of feature importance analysis are only a few of
CatBoost's benefits for sentiment research. CatBoost also has quick training times and can
handle huge datasets.
15
gggvv
Fig 4.5 Classification Report of Catboost Algorithm
• Diversity: Not all characteristics, elements, or features are taken into account when
building a particular tree because each tree is unique.
• Immune to the dimensionality curse: The feature space is reduced since each tree only
considers a portion of the data.
• Parallelization: Each tree is individually generated using different data and attributes.
This means that building random forests can utilise the CPU to its fullest potential.
• Stability: There is stability since the result is based on a majority vote or an average.
16
gggvv
The Random Forest Classifier runs new input text data through each of the forest's
decision trees. Each decision tree offers a forecast based on the tenor of the input text
data. In order to arrive at the final forecast, all of the decision trees in the forest's
probabilities are averaged, or their projections are combined.
With an accuracy rating of 88.03%, Random Forest Classifier was picked as the algorithm
to use in building our model.
Fig 4.8 Classification Report of Random Forest Fig 4.9 Confusion Matrix - Random Forest
17
gggvv
4.1.5 Deployment Using Flask
Flask:
18
gggvv
CHAPTER 5
19
gggvv
20
gggvv
21
gggvv
5.2 EDA Of Visualization
22
gggvv
23
gggvv
24
gggvv
25
gggvv
5.3 Training and Testing Algorithms
26
gggvv
27
gggvv
5.3.2 Catboost
28
gggvv
29
gggvv
5.3.3 Random forest
30
gggvv
31
gggvv
5.4 Deployment Using Flask
32
gggvv
CHAPTER 6
This section discusses the overall experimental results thus obtained after comparing the
machine learning classification models. According to the Classification Accuracy figure
the Accuracy of random forest is the highest which is 88.32%. The Accuracy of Catboost
is the second highest which is 85.73%. The accuracy of Logistic Regression is slightly less
than Catboost classifier ie 85.54.
Algorithm Accuracy
Logistic Regression 84.935065
Linear Regression 58.961039
Random Forest 88.051948
Decision Tree 74.805195
Catboost 84.935065
The comparison of the accuracy score of each models are represented above. We can
observe from the above table that the accuracy score of The Random Forest Model is the
highest.
While through additional research the LSTM model showed comparable results to the
Random Forest Classifier. Even though LSTM is more commonly used to handle NLP
based problems, it is more prone to over fitting and extremely computationally expensive
to train. Hence Random Forest Classifier has been chosen as the algorithm to be used to
train our model. However, in the future if need to deal with a much larger dataset LSTM
can be preferred since because they can learn long-term dependencies between inputs and
33
gggvv
outputs, which is often necessary in complex problems.
The model uses phonetics based stemming process, since lemmatization takes too long to
train and the accuracy drop of isn’t that significant. However in case of a smaller training
dataset lemmatization could be used.
Data preparation and processing, missing value analysis, exploratory analysis, and model
construction and evaluation came first in the analytical process. Random forest was shown
to have the best accuracy on a public test set of higher accuracy score algorithms, with an
accuracy of 88.32%. In order to reach a consensus about the overall sentiment embodied
by all the tweets, it is employed in the application that can assist in identifying the
sentiment hidden in the text of a tweet or a collection of tweets.
34
gggvv
CHAPTER 7
Conclusion
• Emerging Technologies like Machine Learning and NLP can play an important part
in preventing the spread of misinformation
• The process began with data cleaning and preprocessing, exploratory data analysis,
model building and training and finally evaluation
• The algorithm with the best accuracy score which can help to determine the
sentiment of the tweets was chosen.
Future Work
• We are keeping an eye out for the development of an even better algorithm that can
classify the sentiments with even better accuracy and efficiency
• We want to keep a real-time dataset that can be used to re-train the model on regular
occasions so has to keep our model updated
35
gggvv
REFERENCES
36
gggvv
9) Rohitash ChandraID1☯*, Aswin Krishna2, “COVID-19 sentiment analysis via deep
learning during the rise of novel cases”, Chandra R, Krishna A (2021) COVID-19
sentiment analysis via deep learning during the rise of novel cases. PLoS ONE 16(8):
e0255615.
10) Rubul Kumar Bania, “COVID-19 Public Tweets Sentiment Analysis using TF-IDF
and Inductive Learning Models”, INFOCOMP, v. 19, no. 2, p. 23-41, December, 2020
11) Muhammad Ali Shaikh, “Covid-19 Twitter Sentiment Analysis Using Machine
Learning”, 2022 University of Victoria
13) Khairiyah Mohamed Ridhwan, Carol Anne Hargreaves, “Leveraging Twitter data to
understand public sentiment for the COVID‐19 outbreak in Singapore”, International
Journal of Information Management Data Insights 1 (2021) 100021
14) Jim Samuel , G. G. Md. Nawaz Ali , Md. Mokhlesur Rahman , Ek Esawi and Yana
Samuel, “COVID-19 Public Sentiment Insights and Machine Learning for Tweets
Classification”, Information 2020, 11, 314
37
gggvv
APPENDIX A
CONFERENCE
PRESENTATION
at SRM.
38
gggvv
39
APPENDIX B
PUBLICATION DETAILS
40
APPENDIX C
PLAGIARISM REPORT
41