0% found this document useful (0 votes)

139 views49 pages

Projecr Report - Pagenumber

Mechanical engineer project work Research of polymers and storage

Uploaded by

kneuphoriank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views49 pages

Projecr Report - Pagenumber

Mechanical engineer project work Research of polymers and storage

Uploaded by

kneuphoriank

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Prevalent Disease Tweet Classification Using

Sentimental Analysis

A PROJECT REPORT

Submitted by

Rahul S RA1911003040114
Kevin Kurien Joseph RA1911003040040
Siddhaarthan V RA1911003040098

Under the Guidance of

Ms.M.Indumathy
(Assistant Professor, Department of Computing Science and Engineering)

in partial fulfillment of the requirementsfor the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
of
FACULTY OF ENGINEERING AND TECHNOLOGY

Department of Computer Science And Engineering

Vadapalani Campus,Chennai

MAY 2023
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

BONAFIDE CERTIFICATE

Certified that 18CSP109L project report titled “PREVALENT DISEASE

TWEET CLASSIFICATION USING SENTIMENTAL ANALYSIS” is the
bonafide work of “RAHUL S [RA1911003040114], KEVIN KURIEN JOSEPH
[RA1911003040040], SIDDHAARTHAN V [RA1911003040098]” who carried out the project
work under my supervision. Certified further, that to the best of my knowledge the
work reported herein does not form any other project report or dissertation on the
basis of which a degree or award was conferred on an earlier occasion on this or any
other candidate.

GUIDE HEAD OF THE DEPARTMENT

Ms.M.Indumathy M.TECH, (Ph.d) Dr. S Prasanna Devi B.E, M.E, PhD, PGDHRM, PDF(IISc)
Assistant Professor Professor and Head,
Dept of Computer Science & Engg Dept of Computer Science & Engg SRMIST,
Vadapalani Campus Vadapalani Campus

INTERNAL EXAMINER EXTERNAL EXAMINER

Department of Computer Science and Engineering
SRM Institute of Science and Technology
Own Work Declaration Form

This sheet must be filled in and signed with dated along with student registration
number, work will not be marked unless this is done.
To be completed by the student for all assessments

Degree/ Course :

Student Name :

Registration Number :

Title of Work :

I / We hereby certify that this assessment compiles with the University’s Rules
and Regulations relating to Academic misconduct and plagiarism**, as listed in
the University Website, Regulations, and the Education Committee guidelines.

I / We confirm that all the work contained in this assessment is my / our own
except where indicated, and that I / We have met the following conditions:

• Clearly referenced / listed all sources as appropriate

• Referenced and put in inverted commas all quoted text (from books, web, etc)
• Given the sources of all pictures, data etc. that are not my own
• Not made any use of the report(s) or essay(s) of any other student(s) either past
or present
• Acknowledged in appropriate places any help that I have received
from others (e.g. fellow students, technicians, statisticians, external
sources)
• Compiled with any other plagiarism criteria specified in the Course
handbook /University website

I understand that any false claim for this work will be penalized in
accordance with theUniversity policies and regulations.
DECLARATION:
I am aware of and understand the University’s policy on Academic misconduct and plagiarism
and I certify that this assessment is my / our own work, except where indicated by referring,
and that I have followed the good academic practices noted above.

If you are working in a group, please write your registration numbers and sign with
the date forevery student in your group.

gggvv
ACKNOWLEDGEMENT

We express our humble gratitude to our Honorable Chancellor Dr. T. R.

Paarivendhar, Pro Chancellor (Administration), Dr. Ravi Pachamoothoo, Pro Chancellor
(Academic) Dr. P. Sathyanarayanan for the facilities extended for the completion of the
project work.

We would record our sincere gratitude to our Vice Chancellor,Dr. C.

Muthamizhchelvan and Registrar, Dr. S. Ponnusamy for their support to complete our
project work by giving us the best of academic excellence support system in place. We
extend our sincere thanks to our Dean, Dr. C V Jayakumar and Vice Principal –
Academics, Dr. C. Gomathy and Vice Principal - Examination - Dr. S. Karthikeyan for
their invaluable support.

We wish to thank Dr. S Prasanna Devi, Professor & Head, Department of

CSE, SRM Institute of Science and Technology, Vadapalani Campus for her valuable
suggestions and encouragement throughout the period of the project work and the course.

We are extremely grateful to our Project Coordinator, Ms. R. Deepa,

Associate Professor, Department of CSE, SRM Institute of Science and Technology,
Vadapalani Campus, for leading and helping us to complete our course.

We extend our gratitude to our guide, Ms. M. Indumathy, Assistant

Professor, Department of CSE, SRM Institute of Science and Technology, Vadapalani
Campus for providing us an opportunity to pursue our project under her mentorship. She
provided us the freedom and support to explore the research topics of our interest.

We sincerely thank all faculty of DCSE, staff and students of the department who have
directly or indirectly helped our project.

Finally, we would like to thank our parents, our family members and our friends for their
unconditional love, constant support and encouragement.

Rahul S
Kevin Kurien Joseph
Siddhaarthan.V

gggvv
ABSTRACT

Interest in people's opinions about tweets has lately grown as a result of opinionated data that can
be obtained on blogs and social media platforms. With everyone having access to the internet and
the ability to tweet whatever they want, Twitter is one of the most widely used information
sources in the world today. Therefore, there is a higher likelihood that people would be misled.
It would be very helpful to come to a consensus on what the general attitude of the populace is,
particularly during times of panic like the pandemic, in order to better grasp how ready the
populace is to face such a potentially disastrous crisis.To handle these kinds of challenging
problems, machine learning methods are typically applied. Analysing this kind of data manually
requires more time. It is challenging to categorise opinions according to their polarity. This
project was created specifically to analyse the moods of Twitter users using ML techniques and
NLP tools to pinpoint the cause of COVID-19's reappearance. The perspectives of different
people can diverge. After the system can classify the text's emotions, we train it using historical
data. Machine learning methods are crucial to this project's ability to classify data. The Random
Forest Classifier has also been chosen as the framework's preferred ML approach, with an
accuracy rate of 88%. We assessed the framework using a variety of metrics, including F1-Score,
precision, accuracy, and recall.

iii

gggvv
TABLE OF CONTENTS

ABSTRACT iii

TABLE OF CONTENTS iv

LIST OF FIGURES ...................................................................................................................... vi

LIST OF TABLES ........................................................................................................................ vi

1 INTRODUCTION 1
1.1 Overview ................................................................................................................ 1
1.2 Natural Language Processing.................................................................................1
1.3 Machine Learning...................................................................................................2
1.4 Motivation .............................................................................................................. 3
2 LITERATURE SURVEY 4
3 SYSTEM ARCHITECTURE AND DESIGN 7
3.1 System Architecture ............................................................................................... 7
3.2 Problem Statement ................................................................................................. 7
3.3 Existing System ...................................................................................................... 7
3.4 Challenges of the Existing System ......................................................................... 8
3.5 Proposed System .................................................................................................... 8
3.6 Architecture Diagram ............................................................................................. 8
3.7 UML Diagrams......................................................................................................10
3.7.1 Class Diagram ......................................................................................... 11
3.7.2 Activity Diagram ..................................................................................... 11
3.7.3 Sequence Diagram ................................................................................... 11
3.7.4 Entity Relationship Diagram ................................................................... 12

4 METHODOLOGY 13
4.1 Modules .................................................................................................................. 13
4.1.1 Data Pre-processing. ................................................................................. 13
4.1.2 EDA of Visualization...................................................................... 14
4.1.3 Implementing Logistic Regression. ........................................................... 14
4.1.4 Implementing Catboost ............................................................................. 15
4.1.5 Implementing Random forest .................................................................... 16
4.1.6 Deployment Using Flask. .......................................................................... 17

5 CODING AND TESTING 19

gggvv
5.1 Data Pre-processing ........................................................................................... 19
5.2 EDA of Visualization……………………………………………………… 22
5.3 Training and testing Algorithms…………………………………………… 26
5.3.1 Logistic Regression……………………………………………….. 26
5.3.2 Catboost……………………………………………………………. 28
5.3.3 Random Forest……………………………………………………... 30
5.4 Deployment Using Flask…………………………………………………. 32
6 RESULTS AND DISCUSSIONS 33
7 CONCLUSION AND FUTURE ENHANCEMENT 35
7.1 Conclusion ......................................................................................................... 35
7.2 Future Work………………………………………………………………. 35
REFERENCES 36
APPENDIX 38
A CONFERENCE PUBLICATION 38
B JOURNAL PUBLICATION 40
C PLAGIARISM REPORT 41

gggvv
LIST OF FIGURES
3.1 General Architecture Diagram ................................................................................. 9
3.2 Random Forest Architecture Diagram .....................................................................9
3.3 Structure of Random forest ..................................................................................... 10
3.4 Class Diagram ......................................................................................................... 11
3.5 Activity Diagram ..................................................................................................... 11
3.6 Sequence Diagram ................................................................................................... 12
3.7 ERD Diagram .......................................................................................................... 12
4.1 Implementation of Logistic Regression ...................................................................14
4.2 Classification Report of Logistic Regression ........................................................... 15
4.3 Confusion Matrix – Logistic Regression...................................................................15
4.4 Implementation of Catboost Algorithm ...................................................................15
4.5 Classification Report of Catboost Algorithm .......................................................... 16
4.6 Confusion Matrix – Catboost Algorithm..................................................................16
4.7 Implementation of Random Forest .......................................................................... 17
4.8 Classification Report of Random Forest .................................................................. 17
4.9 Confusion Matrix - Random Forest..........................................................................17
4.10 UI Screen of Input .................................................................................................... 18
4.11 UI Screen of Output..................................................................................................18
6.1 Accuracy Comparision Bar ...................................................................................... 33

LIST OF TABLES

7.1 Accuracy Table .......................................................................................................... 33

gggvv
CHAPTER 1

INTRODUCTION

1.1 Overview
A variety of ideas and feelings were sparked by the Coronavirus's widespread distribution.
Due to the sheer nature of the COVID-19 pandemic, confusion and panic spread quickly
through society. People from many countries responded on social networking networks
in different ways. The COVID-19 epidemic has helped to highlighting urban dwellers'
vulnerabilities and constitutes a significant public health risk. When there was a
pandemic, people's emotions varied, which led to mental problems including fear, worry,
and many other horrifying symptoms. Twitter posts including the phrases "updates about
confirmed cases," "COVID-19-related death," "early signs of the outbreak," "economic
impact," and "preventive measures" reveal feelings of dread and fear. Additionally, Public
thoughts about COVID-19 news posted on microblogging platforms have the power to
propagate different points of view. Natural language processing techniques are used, we
describe a sentimental analysis of COVID-19 in this work. By examining news articles
and social media data, we hope to assess how the general public is feeling about COVID-
19. We'll examine the feelings people have about the pandemic and pinpoint the key issues
and themes that are influencing perception. User-generated content on social media is
significant since it can serve as a significant information source during times of crisis.
Social media and microblogging platforms like Facebook and Twitter were massively
accepted by people. Since the majority of the data were from the early 2020s to the late
2021s, selecting a dataset at first presented us with some difficulties. However, we only
selected our dataset from the more than 2500 English-language tweets that were sent on
Covid-19 during October and December of 2022. From the word corpus, we identified the
most widely used words. Additionally, Using the sentiment scores of our preprocessed
tweets, we classified our dataset and assigned sentiment scores to the tweets based on the
polarity of their sentiment. Finally, we trained our model using those tweets and their
sentiment scores.

1.2 Natural Language Processing:

Natural Language Processing also know as NLP is a study on how computers interact
with human language. It involves assessing, interpreting, and creating human language

1
gggvv
using computational linguistics and machine learning methods. NLP attempts to make it
possible for machines to understand and interpret human language in order to improve
communication between humans and machines. It is utilized in a wide range of
applications, including chatbots, virtual assistants, speech recognition systems, sentiment
analysis, and machine translation. All of these applications use Natural language
processing techniques to convert human language into machine readable format
andprovide accurate responses.

NLP is a rapidly developing field that could greatly improve interaction between
machines and humans in the coming years. Advances in machine learning have greatly
improved the performance of NLP algorithms. Researchers continue to come up with new
techniques for processing and understanding human language.

The main challenge in NLP is handling the vagueness of human language. The context of
how words are used is crucial as the exact same words can be interpreted in a variety of
ways depending on it. NLP algorithms must be able to accurately identify the intended
meaning of words and phrases in order to produce meaningful responses. NLP involves
several subfields, including discourse analysis, syntactic analysis, and semantic analysis.
Syntactic analysis is a subfield which deals with analysing the grammatical structure of
sentences to identify the correlations between words. Understanding the meaning of
words and phrases as well as the connections between them is the goal of semantic
analysis.

1.3 Machine learning:

Machine learning is a method that has proven indispensable in the study of NLP, and has
been used for a number of tasks, including the classification of sentiments. One of the key
benefits of using machine learning in sentiment classification is its ability to process and
examine large sets of raw input rapidly with decent accuracy, allowing for more effective
and efficient identification and classification of tweet sentiments. Additionally, ML
models can be trained to recognize patterns and relationships in the data, enabling them
to improve their accuracy over time. This is particularly important as the accuracy needs
to be maintained even if the size of the input data increases substantially.

The use of machine learning also allows for the development of more nuanced and
comprehensive classification models. The multi-sentiment model presented in this

2
gggvv
paper allows users a more flexible and adaptable approach. On the whole, the
significance of ML tweet sentiment classification lies in its ability to process and
examine vast amounts of data rapidly and accurately, refine over time, and develop
more nuanced and comprehensive classification models.

The model used in this paper is the Random Forest Classifier that is competent at
processing the vectorized data and has the ability to remember long term
dependencies, as well as provide us with a result obtained to though the averaging
of the output of multiple decision trees, therefore maximizing the accuracy as well
as efficiency of the model.

1.4 Motivation: -
We wanted to broaden our focus in order to better understand how the world felt at
the time about Covid-19 in light of the unexpected increase of instances that is
currently occurring in China. to use Twitter to gather consensus on how the world
felt about the possibility of its revival. By reaching a consensus, we hope to convey
to the user the general opinion held by those who have posted the tweet or other
related tweets about the widespread illness. to comprehend how people throughout
the world feel about another possible global shutdown. The COVID-19 pandemic
has also generated a great deal of false information and fake news, which has left
people perplexed and afraid. Sentiment analysis can assist in identifying such
inaccurate and misleading information and in eradicating it by supplying the public
with reliable and accurate information.

Finally the main motivation behind sentiment analysis of multiple different

sentiments at once is the fact that it can act as a poll of sorts giving us an estimate
of how a majority of people feel regarding any prevailing topic on the platform
there by researchers as well as people with influence can develop and modify their
strategies based on the opinion of the people.

3
gggvv
CHAPTER 2

LITERATURE SURVEY

2.1 Title : Sentiment Analysis of Covid-19 Tweets using Evolutionary

Author: Arunava Kumar Chakraborty, Sourav Das
Year : 2021

A number of nations have declared total lockdown in an effort to contain the Covid-19
epidemic as it spreads fast over the globe and claims the lives of millions of people each
day. Due to the prevalence of social media as a means of emotional expression at this time
of lockdown, these channels have been crucial in the global dissemination of information
concerning the epidemic. We created an experimental methodology to examine Twitter
users' reactions while taking into account the phrases that are frequently used to refer to
the epidemic, either directly or indirectly. In this study, the sentiment analysis of a sizable
quantity of tweets mentioning the coronavirus or Covid-19 is given. To evaluate the public
opinion trend on topics pertinent to the Covid-19 epidemic, we first utilise an evolutionary
categorization followed by an n-gram analysis. The sentiment scores for the collected
tweets were then calculated based on each tweet's class. We trained the long-short term
network using two different types of rated tweets in order to estimate sentiment on Covid-
19 data, and we attained an overall accuracy of 84.46%.

2.2 Title : Twitter Sentiment Analysis During Covid-19 In Florida

Author: JINGYI LI
Year : 2022

Several countries have enacted comprehensive lockdowns in an effort to contain the

Covid-19 epidemic, which is quickly spreading across the globe and claiming millions of
lives every day. These platforms have been essential in the global distribution of
information regarding the epidemic because social media is so widely used as a platform
for emotional expression during this period of lockdown. In order to investigate Twitter
users' responses, we developed an experimental methodology that took into account the
terms that are frequently used to allude to the epidemic, either directly or indirectly. In
this study, the sentiment analysis of a sizable quantity of tweets mentioning the
coronavirus or Covid-19 is given We first use an evolutionary classification and then an

4
gggvv
n-gram analysis to assess the public opinion trend on issues relevant to the Covid-19
outbreak. The sentiment scores for the amassed tweets were then determined using the
class of each tweet. In order to measure sentiment using Covid-19 data, we trained the
long-short term network using two different types of rated tweets, and we achieved an
overall accuracy of 84.46%.

2.3 Title : Public Sentiment Analysis on Twitter Data during COVID-19 Outbreak
Author: Mohammad Abu Kausar, Arockiasamy Soosaimanickam
Year : 2021

The global community is now dealing with the COVID-19 pandemic, sometimes referred
to as the coronavirus pandemic. In Wuhan, China, the epidemic was first discovered in
December 2019. On March 11, 2020, the World Health Organisation proclaimed this to
be a pandemic. The COVID-19 virus killed hundreds of thousands of people and infected
millions of others in the US, Brazil, Russia, India, and other nations. Millions of people
are still being affected by this pandemic, and several nations have implemented either a
partial or total lockdown. During this lockdown, people expressed their feelings and
views via social media in an effort to calm themselves and stop talking.In this study,
tweets from people in the top 10 affected countries were analysed for sentiment. Data
from tweets of people from the top ten affected countries and one more Gulf country,
Sultanate of Oman, was included in the experiments. A dataset of more than 50,000
tweets with hashtags like #covid-19, #COVID19, #CORONAVIRUS, #CORONA,
#StayHomeStaySafe, #Stay Home, #Covid_19, #CovidPandemic, #covid19, #Corona
Virus, #Lockdown, #Qurantine, #qurantine, #Coronavirus Outbreak, #COVID etc.
posted betweenJune 21, 2020 till July 20, 2020 was considered in this research. The
tweets in English were used as the source for a sentiment analysis. This survey was done
to learn how individuals in the many affected nations view the problem. The tweets were
collected, prepared, and subjected to text mining algorithms before sentiment analysis
was complete and the findings were displayed. This study attempts to better understand
the perspectives of individuals from nations with COVID-19 infections.

2.4 Title : Understanding COVID-19 response by twitter users: A text analysis

approach
Author: Digvijay Pandey, Subodh Wairya
Year : 2021

5
gggvv
The COVID-19 epidemic is unlike any other in that it has caused a significant number of
fatalities. Thanks to Twitter, which has developed into an important platform for public
interactions, researchers now have the opportunity to investigate how the general public
is reacting to the outbreak. The researchers looked over 100,000 tweets with the hashtags
#coronavirus, #coronavirusoutbreak, #coronavirusPandemic, #COVID19, #COVID-19,
#epitwitter, #ihavecorona, #StayHomeStaySafe, and #TestTraceIsolate. Python, Google
NLP, NVivo, and other programming languages are used to conduct sentiment analysis
and theme analysis. According to the data, 29.56 percent of tweets were positive, 29.54
percent were mixed, 23.23% were neutral, and 18.06 percent were unfavourable. Popular
search terms include "cases", "home", "people", and "help". Public health, COVID-19
globally, and the number of cases/death were the "30" of these issues that we uncovered,
and we categorised them into "three" categories. This study shows how the usage of
Twitter data and an NLP method may be utilised to examine public discourse and opinions
during the COVID-19 pandemic. Real-time analysis can reduce false signals and improve
the effectiveness of giving people the right guidance.

2.5 Title : Twitter Based Sentiment Analysis Of Impact Of Covid-19 On Education

Globally
Author: Swetha Sree Cheeti, Yanyan Li
Year : 2021

The global spread of COVID-19 has had a significant impact on how serious the
educational system is. In this study, we present a thorough sentiment analysis of tweets
about education made on the Twitter network and make conclusions on how the epidemic
affected people's emotions as it progressed over the course of the months. Twitter has
collected more than 90,000 tweets about the circumstances underlying the change in
educational systems around the world. Natural language tool kit (NLTK) capabilities and
Naive Bayes Classifier were used to do a sentiment analysis on the given dataset.
According to the results of this investigation, we should show how COVID-19 affected
education and how people's attitudes changed as a result of the modifications made to the
educational system. We want to provide a deeper knowledge of people's attitudes towards
education in order to better tackle the epidemic under these extraordinary conditions.

6
gggvv
CHAPTER 3

SYSTEM ARCHITECTURE AND DESIGN

3.1 System Architecture

The initial task is to pre-process the covid-19 dataset into the model to remove the
unnecessary fields or attributes in the dataset. Further exploration of the data set is
done to extrapolate and visualize the data so as to check irregularities NLP
algorithms are used to convert the data into machine readable format and then are
used to train the Random Forest Classifier, to classify the data into 5 different
classes, using the principle of Bagging. Bagging chooses a random sample/random
subset from the entire data set. Hence each model is generated from the samples
(Bootstrap Samples) provided by the Original Data with replacement known as row
sampling. Finally the model is deployed using Flask.

3.2 Problem Statement

Twitter is considered to be one of the most accessible sources of information in this

world that is developing towards a future where anybody with an active internet
connection can tweet whatever they feel like and put their opinions out onto the
world. So, the probability for people to get misguided by people’s personal opinions
which are not entirely based on actual facts is ever increasing. It is challenging to
categorize opinions according to their sentiments. This project was created
specifically to analyze the moods of Twitter users using ML techniques and NLP
tools to pinpoint their sentiment regarding the reappearance of COVID-19. So, we
train past data after the machine can classify the emotions of the text. In this project,
machine learning algorithms play a crucial role to identify the classification.

3.3 Existing System

The existing systems mostly focuses on using rudimentary algorithms such as BoW
Decisions Trees, or complex methods such as LSTM based classification which
works on the concept of neural networks and they have high time complexity. They

7
gggvv
use lemmatization method for the purpose of tokenization and eliminating stop
words. They use Plotly package in python for the purpose of visualization of the
variations of the dataset.

3.4 Challenges of the Existing System

The system used either sacrifices accuracy for speed or speed for accuracy. They
use lemmatization method which is an extremely slow method for tokenization. The
BoW Decision Tree is not generally recommended for sentiment analysis as it has
a tendency to overfit the data, and LSTM is not recommended for smaller datasets
and are also prone to overfitting.

3.5 Proposed System

The prevalent disease tweet sentiment analysis system is a model used to determine
the overall consensus of the sentiment shared by the people who have shared their
opinions on Twitter. Our framework is novel in that it is a hybrid framework that
combines a NLP-based technique for the tokenization of the tweets and using
Random Forest Classifier for tweet sentiment analysis with supervised ML
techniques for tweet classification. We also check for data imbalance and use the
RandomOverSampler to rebalance the data. It uses a stemming based tokenization
is used rather than lemmatization to improve the speed of the model. The PS-
stemmer of the NLP package is used for this purpose it is a phonetics based
stemmer. The Random Forest Classifier’s parameters has been Hypertuned to its
extreme to provide maximum accuracy. We have chosen to deploy the model using
Flask since it is more lightweight.

3.6 Architecture Diagram

Dataset:-
The Dataset containing Tweets about Covid-19 from 2021 onwards is filtered and cleaned
to remove unnessasary fields and additional attributes. The data set is acquired from
Kaggle.

8
gggvv
Fig 3.1: General Architecture Diagram

Random Forest Classifier:-

Fig 3.2: Random Forest Architecture Diagram

Random Forest is one of the most popular algorithms used by Data Scientists.
Random forest is a Supervised Machine Learning Algorithm that is used widely in
Classification and Regression problems. Random forest works on the Bagging

9
gggvv
principle. Bagging chooses a random sample/random subset from the entire data
set. Hence each model is generated from the samples (Bootstrap Samples) provided
by the Original Data with replacement known as row sampling. Random Forest is
a classifier that contains a number of decision trees on various subsets of the
given dataset and takes the average to improve the predictive accuracy of that
dataset.

Important Features of Random Forest

• Diversity: Not all attributes/variables/features are considered while making

an individual tree; each tree is different.
• Immune to the curse of dimensionality: Since each tree does not consider
all the features, the feature space is reduced.
• Parallelization: Each tree is created independently out of different data and
attributes. This means we can fully use the CPU to build random forests.
• Stability: Stability arises because the result is based on majority voting/
averaging.

Fig 3.3 Structure of Random forest

3.7 UML Diagrams

3.7.1 Class Diagram

10
gggvv
Fig 3.4 Class Diagram

3.7.2 Activity Diagram:

Fig 3.5 Activity Diagram

3.7.3 Sequence Diagram

11
gggvv
Fig 3.6 Sequence Diagram

3.7.4 Entity Relationship Diagram (ERD):

Fig 3.7 ERD Diagram

12
gggvv
CHAPTER 4

METHODOLOGY

4.1 MODULES

The Project has been divided into 3 major modules:

● MODULE 1 - Data Pre-processing
● MODULE 2 - Data Analysis of Visualization
● MODULE 3 - Implementing Logistic Regression Algorithm
● MODULE 4 - Implementing Catboost algorithm
● MODULE 5 - Implementing Random Forest Algorithm
● MODULE 6 - Deployment Using Flask

4.1.1 Data Pre-processing

1) With the help of the Python-based NLP toolkit (NLTK), we have eliminated stop words
like "over," "under," "again," "further," "then," "once," and "there." Using regular
expressions (RegEx), we also eliminated URLs, punctuation (*, $, and! ), user mentions
(@), and hashtags (#).

2) Case Folding: A term's polarity (positive, negative, and neutral) can be significantly
influenced by the uppercase and lowercase variations of that word. Depending on the
nature of the corpus and its intended use, some words in the text may be changed from
uppercase to lowercase or vice versa. Our dataset comprises of short-worded tweets that
typically use lowercase and follow a standard communication style. In order to do case-
folding, we reduced the number of capital words in all tweets.

3)Tokenization: We broke down each tweet message into tokens (words) so that machine
learning (ML) algorithms could grasp the meaning of each word.

4) Stemming, also referred to as the base or root form, is the process of reducing a word
to its stem. Stemming seeks to normalise words such that related terms sound similar.
(like "run", "runs", and "running") are viewed as one and the same. This can help text

13
gggvv
analysis tasks like sentiment analysis, information retrieval, and text classification
perform better.

4.1.2 Data Analysis of Visualization

Data visualisation is a vital skill in applied statistics and machine learning.. Statistics
actually puts a lot of emphasis on numerical estimations and data descriptions. Data
visualisation offers a crucial set of tools for gaining a qualitative insight.. This can aid in
understanding the data distribution of the dataset. Visualisations can be used to express
and convey crucial linkages through plots and charts that are more immediately and
pertinent to stakeholders rather than using measurements of association or relevance.
Even checking for training dataset data that are unrelated to the model can be done using
it.

4.1.3 Implementing Logistic Regression Algorithm

Despite what its name might imply, the logistic regression statistical classification method
produces the classification probability of the desired output (the dependent variable)..
Using the collected features, the logistic regression model is trained using the labelled
dataset of tweets. The model gains the ability to modify the weights of the features to
reduce the discrepancy between its predictions and the actual labels in the training set.
The output of the linear model is passed via the sigmoid function after the logistic
regression model has been trained, which converts the output to a probability value
between 0 and 1. The sigmoid function has the following formula:
1 / (1 + exp(-z)),
Finally, the sentiment of a given tweet is predicted using the probability value returned
by the sigmoid function

Fig 4.1 Implementation of Logistic Regression

14
gggvv
Fig 4.2 Classification Report of Logistic Regression

Fig 4.3 Confusion Matrix – Logistic Regression

4.1.4 Implementing Catboost Algorithm

CatBoost is a gradient boosting method that can be used for sentiment analysis and other
binary or multi-class classification applications. Through the iterative construction of a
number of decision trees that gauge the emotion of the tweet based on the input features,
CatBoost learns to minimise the log loss or cross-entropy loss function during training.
The input features that were most crucial for predicting the sentiment of the tweets can be
revealed using CatBoost. This can be helpful for figuring out which elements of the tweets
have the most effects on sentiment. The handling of categorical features, the handling of
missing values, and the provision of feature importance analysis are only a few of
CatBoost's benefits for sentiment research. CatBoost also has quick training times and can
handle huge datasets.

Fig 4.4 Implementation of Catboost Algorithm

15
gggvv
Fig 4.5 Classification Report of Catboost Algorithm

Fig 4.6 Confusion Matrix – Catboost Algorithm

4.1.5 Implementing Random forest Algorithm

Data scientists frequently employ the algorithm Random Forest. A popular supervised
machine learning method for classification and regression issues is random forest. The
Random Forest combines different Decision Trees to provide an output that was chosen
by all trees. The Random Forest method does not overfit and is quick, scalable, noise-
reduced, and accurate.

Important Random Forest Features:

• Diversity: Not all characteristics, elements, or features are taken into account when
building a particular tree because each tree is unique.

• Immune to the dimensionality curse: The feature space is reduced since each tree only
considers a portion of the data.

• Parallelization: Each tree is individually generated using different data and attributes.
This means that building random forests can utilise the CPU to its fullest potential.

• Stability: There is stability since the result is based on a majority vote or an average.

16
gggvv
The Random Forest Classifier runs new input text data through each of the forest's
decision trees. Each decision tree offers a forecast based on the tenor of the input text
data. In order to arrive at the final forecast, all of the decision trees in the forest's
probabilities are averaged, or their projections are combined.

Fig 4.7 Implementation of Random Forest

With an accuracy rating of 88.03%, Random Forest Classifier was picked as the algorithm
to use in building our model.

Fig 4.8 Classification Report of Random Forest Fig 4.9 Confusion Matrix - Random Forest

17
gggvv
4.1.5 Deployment Using Flask

Flask:

• It is a lightweight API of python which allows us to build medium sized web

applications
• Flask provides the basic tools and libraries needed to build a web application, such
as routing, templating, and handling HTTP requests and responses.
• Flask is used for backend and it is built on top of Python which makes it powerful
to use all the python features.

Fig 4.10 UI Screen of Input

Fig 4.11 UI Screen of Output

18
gggvv
CHAPTER 5

CODING AND TESTING

5.1 Data Pre-processing

19
gggvv
20
gggvv
21
gggvv
5.2 EDA Of Visualization

22
gggvv
23
gggvv
24
gggvv
25
gggvv
5.3 Training and Testing Algorithms

5.3.1 Logistic Regression

26
gggvv
27
gggvv
5.3.2 Catboost

28
gggvv
29
gggvv
5.3.3 Random forest

30
gggvv
31
gggvv
5.4 Deployment Using Flask

32
gggvv
CHAPTER 6

RESULTS AND DISCUSSION

This section discusses the overall experimental results thus obtained after comparing the
machine learning classification models. According to the Classification Accuracy figure
the Accuracy of random forest is the highest which is 88.32%. The Accuracy of Catboost
is the second highest which is 85.73%. The accuracy of Logistic Regression is slightly less
than Catboost classifier ie 85.54.

Algorithm Accuracy
Logistic Regression 84.935065
Linear Regression 58.961039
Random Forest 88.051948
Decision Tree 74.805195
Catboost 84.935065

Table 1: Accuracy Table

The comparison of the accuracy score of each models are represented above. We can
observe from the above table that the accuracy score of The Random Forest Model is the
highest.

Fig 6.1 Accuracy Comparision Bar

While through additional research the LSTM model showed comparable results to the
Random Forest Classifier. Even though LSTM is more commonly used to handle NLP
based problems, it is more prone to over fitting and extremely computationally expensive
to train. Hence Random Forest Classifier has been chosen as the algorithm to be used to
train our model. However, in the future if need to deal with a much larger dataset LSTM
can be preferred since because they can learn long-term dependencies between inputs and

33
gggvv
outputs, which is often necessary in complex problems.

The model uses phonetics based stemming process, since lemmatization takes too long to
train and the accuracy drop of isn’t that significant. However in case of a smaller training
dataset lemmatization could be used.

Data preparation and processing, missing value analysis, exploratory analysis, and model
construction and evaluation came first in the analytical process. Random forest was shown
to have the best accuracy on a public test set of higher accuracy score algorithms, with an
accuracy of 88.32%. In order to reach a consensus about the overall sentiment embodied
by all the tweets, it is employed in the application that can assist in identifying the
sentiment hidden in the text of a tweet or a collection of tweets.

34
gggvv
CHAPTER 7

CONCLUSION AND FUTURE ENHANCEMENT

Conclusion

• Emerging Technologies like Machine Learning and NLP can play an important part
in preventing the spread of misinformation

• The process began with data cleaning and preprocessing, exploratory data analysis,
model building and training and finally evaluation

• The algorithm with the best accuracy score which can help to determine the
sentiment of the tweets was chosen.

• After implementation it is concluded that random forest classifier is best when

compared to the remaining algorithm

Future Work

• We are keeping an eye out for the development of an even better algorithm that can
classify the sentiments with even better accuracy and efficiency

• We want to keep a real-time dataset that can be used to re-train the model on regular
occasions so has to keep our model updated

• There is an evergrowing need for multilingual sentiment analysis that can

accurately classify across various languages

35
gggvv
REFERENCES

1) NilufaYeasmin, NosinIbna Mahbub, Mrinal Kanti Baowaly, Bikash Chandra

Singh, Zulfikar Alom, Zeyar Aung and Mohammad Abdul Azim, “Analysis and
Prediction of User Sentiment on COVID-19 Pandemic Using Tweets”, Big Data Cogn.
Comput. 2022, 6, vol 65

2) Arunava Kumar Chakraborty(RCC Institute of Information Technology,) , Sourav

Das(Abul Kalam Azad University of Technology), Anup Kumar Kolya(RCC Institute of
Information Technology), “Sentiment Analysis of Covid-19 Tweets using Evolutionary
Classification-Based LSTM Model”, 2022 15th IEEE International Conference on
Recent Advances.

3) SotiriaVernikou, Athanasios Lyras, Andreas Kanavos, “Multiclass sentiment analysis

on COVID-19-related tweets using deep learning models”, Published on 6th august 2022
on the springer- verlag London

4) Swetha SreeCheeti, Yanyan Li and Ahmad Hadaegh(California State University-San

Marcos, San Marcos, California, USA), “Twitter based sentiment analysis of impact of
covid-19 on education”, International Journal of Artificial Intelligence and Applications
(IJAIA), Vol.12, No.3, May 2021

5) DharmaiahDevarapalli, Medapati Swapna Sri, Pallem Kavya Sri, Padmanabhuni

Charishma, and Penmetsa Venu Naga Mounika, “Sentiment Analysis of COVID-19
Tweets Using Classification Algorithms”, IEEE Journal of Biomedical and Health
Informatics ( Volume: 24, Issue: 12, December 2020)

6) Mohammad Abu Kausar ,ArockiasamySoosaimanickam , Mohammad

Nasar(Department of Computing & Informatics, Mazoon College, Sultanate of Oman),
“Public Sentiment Analysis on Twitter Data during COVID-19 Outbreak”, International
Journal of Advanced Computer Science and Applications, Vol. 12, No. 2, 2021

7) Christina Knight, Chiara Biondi,Elyse Cornwall, “Twitter Sentiment Analysis: Global

Attitudes Towards COVID-19 Policies”, IEEE 4th international conference on geo
science, networking and applications. Vol 12 2022

8) Tahani Soud Alharbi, FethiFkih, “Building and Testing Fine-Grained Dataset of

COVID-19 Tweets for Worry Prediction”, (IJACSA) International Journal of Advanced
Computer Science and Applications, Vol. 13, No. 8, 2022

36
gggvv
9) Rohitash ChandraID1☯*, Aswin Krishna2, “COVID-19 sentiment analysis via deep
learning during the rise of novel cases”, Chandra R, Krishna A (2021) COVID-19
sentiment analysis via deep learning during the rise of novel cases. PLoS ONE 16(8):
e0255615.

10) Rubul Kumar Bania, “COVID-19 Public Tweets Sentiment Analysis using TF-IDF
and Inductive Learning Models”, INFOCOMP, v. 19, no. 2, p. 23-41, December, 2020

11) Muhammad Ali Shaikh, “Covid-19 Twitter Sentiment Analysis Using Machine
Learning”, 2022 University of Victoria

12) Adeola Ayandeyi, “SENTIMENT ANALYSIS OF TWITTER DATA TO

ANALYZE THE EFFECT OF COVID-19”, Concordia University of Edmonton 2021

13) Khairiyah Mohamed Ridhwan, Carol Anne Hargreaves, “Leveraging Twitter data to
understand public sentiment for the COVID‐19 outbreak in Singapore”, International
Journal of Information Management Data Insights 1 (2021) 100021

14) Jim Samuel , G. G. Md. Nawaz Ali , Md. Mokhlesur Rahman , Ek Esawi and Yana
Samuel, “COVID-19 Public Sentiment Insights and Machine Learning for Tweets
Classification”, Information 2020, 11, 314

15) Ramadhan Renaldy;Muljono;Guruh Fajar Shidik;CaturSupriyanto;M.A.

Soeleman;AhmadZainulFanani;AdityaDwiPratama, “Hybrid Method Based on Stacking
for Sentiment Analysis of Indonesian Tweet Responding to COVID-19 Pandemic”, 2021
International Seminar on Application for Technology of Information and
Communication (iSemantic)

37
gggvv
APPENDIX A

CONFERENCE
PRESENTATION

Our paper on PREVALENT DISEASE TWEET CLASSIFICATION USING

SENTIMENTAL ANALYSIS was presented at IRCICD 2023 conference held

at SRM.

38
gggvv
39
APPENDIX B

PUBLICATION DETAILS

40
APPENDIX C

PLAGIARISM REPORT

Project Report
70% (10)
Project Report
47 pages
Standard Operating Procedure On Coal Loss Accounting
100% (2)
Standard Operating Procedure On Coal Loss Accounting
62 pages
Yiye Avila - Dones Del Espíritu
50% (4)
Yiye Avila - Dones Del Espíritu
1 page
Gold Code
100% (1)
Gold Code
3 pages
Water Bodies Prohibition Circular 2013
100% (2)
Water Bodies Prohibition Circular 2013
3 pages
La Vanya
No ratings yet
La Vanya
44 pages
1822 B.tech It Batchno 359
No ratings yet
1822 B.tech It Batchno 359
86 pages
Industrial Training Report Format
No ratings yet
Industrial Training Report Format
22 pages
Final Minor Project
No ratings yet
Final Minor Project
83 pages
17BIT051
No ratings yet
17BIT051
26 pages
Ganesh Final
No ratings yet
Ganesh Final
6 pages
Theolaaaa4273 Merged
No ratings yet
Theolaaaa4273 Merged
76 pages
Major Project
No ratings yet
Major Project
16 pages
Teja Report Merged
No ratings yet
Teja Report Merged
39 pages
Modelling Accelerated Proficiency in Organisations: Practices and Strategies to Shorten Time-to-Proficiency of the Workforce
From Everand
Modelling Accelerated Proficiency in Organisations: Practices and Strategies to Shorten Time-to-Proficiency of the Workforce
Dr Raman K Attri
No ratings yet
Project Report
No ratings yet
Project Report
15 pages
Seminar Report
No ratings yet
Seminar Report
20 pages
Project Evaluation
No ratings yet
Project Evaluation
42 pages
Report Doucmentation
No ratings yet
Report Doucmentation
20 pages
Send To Hem Report
No ratings yet
Send To Hem Report
61 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
71 pages
Team 5 - Rit B Section - Prediction of Mental Health Using Machine Learning Techniques
No ratings yet
Team 5 - Rit B Section - Prediction of Mental Health Using Machine Learning Techniques
72 pages
DSBDA Mini Project
No ratings yet
DSBDA Mini Project
10 pages
Final Twitter - Sentiment - Analysis - Report
100% (1)
Final Twitter - Sentiment - Analysis - Report
14 pages
Report
No ratings yet
Report
37 pages
Detection of Depression in Speech1
No ratings yet
Detection of Depression in Speech1
60 pages
D13 - Project Report
No ratings yet
D13 - Project Report
33 pages
Dsbda
No ratings yet
Dsbda
12 pages
Combinepdf
No ratings yet
Combinepdf
64 pages
Social Media Sentiment Analysis
No ratings yet
Social Media Sentiment Analysis
49 pages
DBMS MiniProject Report2 Submission10022024
No ratings yet
DBMS MiniProject Report2 Submission10022024
11 pages
Cotaskreport
No ratings yet
Cotaskreport
43 pages
Major ppt-1
No ratings yet
Major ppt-1
8 pages
Mtech Thesis 2020-22
No ratings yet
Mtech Thesis 2020-22
28 pages
Wa0002.
No ratings yet
Wa0002.
11 pages
Maria Pavithra Full Document
No ratings yet
Maria Pavithra Full Document
41 pages
Final
No ratings yet
Final
52 pages
Project Report
No ratings yet
Project Report
47 pages
Network Attack
No ratings yet
Network Attack
84 pages
PushpendraSkill Based
No ratings yet
PushpendraSkill Based
26 pages
Docdownloader Com PDF To Study Fake News Detection in Online Social Media in Context of Machine DD
No ratings yet
Docdownloader Com PDF To Study Fake News Detection in Online Social Media in Context of Machine DD
78 pages
Ce 21 PDF
No ratings yet
Ce 21 PDF
75 pages
Multiple Disease Pridiction Using Machine Learning1
No ratings yet
Multiple Disease Pridiction Using Machine Learning1
48 pages
Mini Project 1.1
No ratings yet
Mini Project 1.1
55 pages
MOODMATE Front
No ratings yet
MOODMATE Front
6 pages
Final Project Document
No ratings yet
Final Project Document
70 pages
Project Report 2023
No ratings yet
Project Report 2023
32 pages
Sentiment Analysys of Tweets Using Machine Learning
No ratings yet
Sentiment Analysys of Tweets Using Machine Learning
74 pages
Report 1111
No ratings yet
Report 1111
87 pages
Amnu Khan
No ratings yet
Amnu Khan
31 pages
Machine Learning: Master Supervised and Unsupervised Learning Algorithms with Real Examples (English Edition)
From Everand
Machine Learning: Master Supervised and Unsupervised Learning Algorithms with Real Examples (English Edition)
Kamalkant Hiran
No ratings yet
Final Report
No ratings yet
Final Report
54 pages
Multilevel Predictive Model For Detecting Depression in Social Media Users
No ratings yet
Multilevel Predictive Model For Detecting Depression in Social Media Users
79 pages
Project Report
No ratings yet
Project Report
47 pages
Project Report
No ratings yet
Project Report
50 pages
Kickstart Compiler Design Fundamentals
From Everand
Kickstart Compiler Design Fundamentals
Sandeep Telkar R
No ratings yet
Kickstart Compiler Design Fundamentals: Practical Techniques and Solutions for Compiler Design, Parsing, Optimization, and Code Generation (English Edition)
From Everand
Kickstart Compiler Design Fundamentals: Practical Techniques and Solutions for Compiler Design, Parsing, Optimization, and Code Generation (English Edition)
Sandeep Telkar R
No ratings yet
Akshada Tweet Report With Pages Removed
No ratings yet
Akshada Tweet Report With Pages Removed
15 pages
Mali
No ratings yet
Mali
39 pages
MAJOR PROJECT - Report
No ratings yet
MAJOR PROJECT - Report
26 pages
d01 Batch15 Mental Health Ai Bot
No ratings yet
d01 Batch15 Mental Health Ai Bot
29 pages
Cookbook for Mobile Robotic Platform Control: With Internet of Things And Ti Launch Pad
From Everand
Cookbook for Mobile Robotic Platform Control: With Internet of Things And Ti Launch Pad
Dr. Anita Gehlot
No ratings yet
DSBDA Mini Project Risihi Bhai
No ratings yet
DSBDA Mini Project Risihi Bhai
10 pages
Smap Sentanalysis
No ratings yet
Smap Sentanalysis
27 pages
Rev Ahara Paka
No ratings yet
Rev Ahara Paka
5 pages
Customer Service - 5 Stages of Customer Service Improvement Process
No ratings yet
Customer Service - 5 Stages of Customer Service Improvement Process
3 pages
Dokumen - Tips - Wood Group Sps Surface Pumping Systems
No ratings yet
Dokumen - Tips - Wood Group Sps Surface Pumping Systems
7 pages
Law of Karma Value Systems For Success.
No ratings yet
Law of Karma Value Systems For Success.
51 pages
Beginners Simple Enhancement For SE38: Applies To
No ratings yet
Beginners Simple Enhancement For SE38: Applies To
16 pages
International Student Application 2022-01-28
No ratings yet
International Student Application 2022-01-28
2 pages
C111 Installation Manual V1 2 2006
No ratings yet
C111 Installation Manual V1 2 2006
23 pages
Repaso 2 Evaluacion
No ratings yet
Repaso 2 Evaluacion
4 pages
Investments in Associate PAS 28 Part 3
No ratings yet
Investments in Associate PAS 28 Part 3
8 pages
WDB2110422A604934
No ratings yet
WDB2110422A604934
4 pages
CM19352 Process Optimization
No ratings yet
CM19352 Process Optimization
2 pages
Merchandising Operations
50% (2)
Merchandising Operations
39 pages
Information and Resources For Starting A Home-Based Food Business
No ratings yet
Information and Resources For Starting A Home-Based Food Business
2 pages
Turnover Checklist
No ratings yet
Turnover Checklist
5 pages
Physical Exam-1 2022 Revamp
No ratings yet
Physical Exam-1 2022 Revamp
6 pages
EBS 수능특강 Light - 영어독해연습 - 1-3강 - 분석노트
No ratings yet
EBS 수능특강 Light - 영어독해연습 - 1-3강 - 분석노트
2 pages
National Institute of Disaster Management: TH TH
No ratings yet
National Institute of Disaster Management: TH TH
15 pages
Comparative Efficacy and Safety of Cilnidipine and Telmisartan in Hypertension: An 8 Week, Prospective Study
No ratings yet
Comparative Efficacy and Safety of Cilnidipine and Telmisartan in Hypertension: An 8 Week, Prospective Study
5 pages
Specialties and Accessories: Buffer Tank Hydraulic Separator
No ratings yet
Specialties and Accessories: Buffer Tank Hydraulic Separator
4 pages
Corporate Governance
No ratings yet
Corporate Governance
10 pages
AP1000 Design Control Document
No ratings yet
AP1000 Design Control Document
159 pages
Valve Operator Matl Control Data
No ratings yet
Valve Operator Matl Control Data
12 pages
5200077-8.1 RaySafe X2 Leaflet EN
No ratings yet
5200077-8.1 RaySafe X2 Leaflet EN
12 pages
Andman and Nicobar
No ratings yet
Andman and Nicobar
8 pages
Ielts 1
No ratings yet
Ielts 1
1 page
Data Structures & Algorithm Design: Trees
No ratings yet
Data Structures & Algorithm Design: Trees
38 pages