0% found this document useful (0 votes)

30 views33 pages

Batch-6c Minipro Doc Rev-2

Uploaded by

avisirisyam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views33 pages

Batch-6c Minipro Doc Rev-2

Uploaded by

avisirisyam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 33

TWITTER SENTIMENT ANALYSIS USING

NAVIE BAYES CLASSIFIER ALGORITHM

A PROJECT REPORT

Submitted in partial fulfillment of the requirements for the award of the degree of

Bachelor of Technology

COMPUTER SCIENCE AND ENGINEERING

BY
POTLA SIRISHA TELAGAREDDI NAVYA
(21331A05E4) (21331A05H3)

Y SURYA SASAANK CH SHANMUKA VARDHAN

(21331A05I7) (21331A05J8)

Under the Supervision of

Dr. P. Rama Santosh Naidu
Senior Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

MAHARAJ VIJAYARAM GAJAPATHI RAJ COLLEGE OF ENGINEERING
(Autonomous)
(Approved by AICTE, New Delhi, and permanently affiliated to JNTUGV, Vizianagaram), Listed u/s 2(f)
& 12(B) of UGC Act 1956.
Vijayaram Nagar Campus, Chintalavalasa,Vizianagaram-535005, Andhra Pradesh
APRIL, 2024
CERTIFICATE

This is to certify that the project report entitled “TWITTER SENTIMENT ANALYSIS
USING NAVIE BAYES CLASSIFIER ALGORITHM” being submitted by P
Sirisha(21331A05E4), T Navya(21331A05H3), Y Sasaank(21331A05I9), CH Shanmuk
vardhan(21335A05J8) in partial fulfillment for the award of the degree of “Bachelor of
Technology” in Computer Science and Engineering is a record of bonafide work done by
them under my supervision during the academic year 2023-2024.

Dr. P. Rama Santosh Naidu Dr. T. Pavan Kumar

Senior Assistant professor, Associate Professor,
Supervisor, Head of the Department,
Department of CSE, Department of CSE,
MVGR College of Engineering(A), MVGR College of Engineering(A),
Vizianagaram. Vizianagaram.

External Examiner
DECLARATION

We hereby declare that the work done on the dissertation entitled “TWITTER SENTIMENT
ANALYSIS USING NAVIE BAYES CLASSIFIER ALGORITHM” has been carried out by
us and submitted in partial fulfilment for the award of credits in Bachelor of Technology in
Computer Science and Engineering of MVGR College of Engineering (Autonomous) and
affiliated to JNTUGV, Vizianagaram. The various contents incorporated in the dissertation
have not been submitted for the award of any degree of any other institution or university.
ABSTRACT

This project aims to analyze the sentiment of tweets using Naive Bayes classifier. The project
will involve collecting a dataset of tweets and preprocessing the data to remove stop words
and stemming. The Naive Bayes classifier will then be trained on the dataset to classify the
tweets as positive, negative, or neutral. The project also aims to demonstrate the effectiveness
of sentiment analysis in understanding public opinion on Twitter and its potential applications
in various domains such as marketing, politics, and customer feedback analysis. The Natural
Language Toolkit (NLTK) is a popular Python library for natural language processing tasks,
including sentiment analysis. It provides a variety of tools for preprocessing and analyzing
text data.
CONTENTS
Page No
List of libraries Used 1
Problem statement 2
1. Introduction 3
1.1. Problem definition 3
2. Literature Survey 4
3. Theoretical Background 5
3.1. Machine learning 5
3.1.1. What is Machine Learning 5
3.1.2. Why Machine Learning 5
3.2. Machine Learning Models 5
3.2.1. Naïve Bayes 7
3.3. Confusion matrix and its metrics 7
3.3.1. Confusion matrix 7
3.3.2. Metrics- Accuracy, Precision, Recall, F1 Score, FPR 8
4. Approach Description 9
4.1. Understanding the Concept 9
4.2. Project approach 9
5.Data Exploration 10
5.1. Improvements 10
6.Modelling 12
6.1. Model Development 12
6.1.1. Naïve Bayes classifier 12
6.2. Model Evaluation 12
6.3. Implementation 13
7. Results and Conclusions 15
7.1 Results 15
7.2 Conclusions 16
References 17
Appendix A: Packages, Tools Used & Working Process 18
Python Programming Language 18
Libraries 18
NumPy 18
Pandas 19
Matplotlib 19
Seaborn 20
Sklearn 21
Appendix B: Source Code 22
List of Libraries
 pandas (import pandas as pd)
 zipfile (import zipfile)
 requests (import requests)
 io (import io)
 sklearn.model_selection (from sklearn.model_selection import train_test_split)
 seaborn (import seaborn as sns)
 matplotlib.pyplot (import matplotlib.pyplot as plt)
 nltk (import nltk)
 re (import re)
 flask (from flask import Flask, render_template, request)

Page-1
PROBLEM STATEMENT
The problem statement for the Twitter Sentiment Analysis using Naive Bayes Classifier
project is to develop a system that can accurately classify tweets as positive, negative, or
neutral based on their sentiment. This project aims to leverage natural language processing
techniques and machine learning algorithms to analyze the text data and extract meaningful
insights from it. The ultimate goal is to provide a tool that can help businesses and
organizations better understand their customers' opinions and preferences, and make data-
driven decisions accordingly.

Page-2
CHAPTER 1
INTRODUCTION
Twitter Sentiment Analysis is the process of computationally identifying and categorizing
tweets expressed in a piece of text, especially in order to determine whether the writer’s
attitude towards a particular topic, product, etc. is positive, negative, or neutral.
What is Twitter sentiment analysis?
It's the process of using natural language processing (NLP) and machine learning (ML)
techniques to analyze the sentiment (positive, negative, or neutral) expressed in tweets.
Essentially, it allows us to interpret the emotions and opinions embedded within those 280-
character bursts of information.
Why analyze Twitter sentiment?
Twitter acts as a giant pulse check for the world, offering real-time insights into public
perception of various topics, brands, events, and figures.
1.1. PROBLEM DEFINITION
Sentiment analysis, also refers as opinion mining, is a sub machine learning task where we
want to determine which is the general sentiment of a given document. Using machine
learning techniques and natural language processing we can extract the subjective
information of a document and try to classify it according to its polarity such as positive,
neutral or negative. It is a really useful analysis since we could possibly determine the overall
opinion about a selling object, or predict stock markets for a given company like, if most
people think positive about it, possibly its stock markets will increase, and so on. Sentiment
analysis is actually far from to be solved since the language is very complex
(objectivity/subjectivity, negation, vocabulary, grammar...) but it is also why it is very
interesting to working on. In this project I choose to try to classify tweets from Twitter into
“positive” or “negative” sentiment by building a model based on probabilities. Twitter is a
microblogging website where people can share their feelings quickly and spontaneously by
sending a tweet limited by 140 characters. You can directly address a tweet to someone by
adding the target sign “@” or participate to a topic by adding an hastag “#” to your tweet.
Because of the usage of Twitter, it is a perfect source of data to determine the current overall
opinion about anything.

Page-3
CHAPTER 2
LITERATURE SURVEY
Available Technologies
There are several technologies available for Twitter sentiment analysis, including natural
language processing (NLP), machine learning, and deep learning. These technologies use
various algorithms and techniques to analyze the sentiment of tweets and classify them as
positive, negative.
Drawbacks
One of the main drawbacks of Twitter sentiment analysis is the potential for bias in the data.
This can occur when the dataset used for training the algorithm is not representative of the
population being analyzed. Additionally, the accuracy of the analysis can be affected by the
complexity of the language used in the tweets, as well as the context in which they are
posted.
Differences from Other Approaches
Our proposed approach differs from other methods of Twitter sentiment analysis in several
ways. First, we use a combination of NLP and machine learning techniques to analyze the
sentiment of tweets. This allows us to capture more nuanced aspects of the language used in
the tweets, and to identify patterns and trends that may not be apparent using other methods.
Additionally, our approach is designed to be more accurate and reliable, as it takes into
account the potential for bias in the data and uses advanced algorithms to analyze the
language used in the tweets.

Page-4
CHAPTER 3
THEORETICAL BACKGROUND
3.1 MACHINE LEARNING
3.1.1 What is Machine Learning?
Machine learning is an application of AI that enables systems to learn and improve from
experience without being explicitly programmed. Machine learning focuses on developing
computer programs that can access data and use it to learn for themselves. Machine learning
is something that is capable to imitate the intelligence of the human behavior. Machine
learning is used to perform complex tasks in a way that humans solve the problems. Machine
learning can be descriptive it uses the data to explain, predictive, and prescription.
3.1.2 Why Machine Learning?
Machine learning involves computers learning from data provided so that they carry out
certain tasks. For more advanced tasks, it can be challenging for a human to manually create
the needed algorithms. In practice, it can turn out to be more effective to help the machine
develop its own algorithm, rather than having human programmers specify every needed step.
The discipline of machine learning employs various approaches to teach computers to
accomplish tasks where no fully satisfactory algorithm is available. In cases where vast
numbers of potential answers exist, one approach is to label some of the correct answers as
valid. This can then be used as training data for the computer to improve the algorithms it
uses to determine correct answers. The nearly limitless quantity of available data,
affordable data storage, and growth of less expensive and more powerful processing has
propelled the growth of ML. Now many industries are developing more robust models
capable of analysing bigger and more complex data while delivering faster, more accurate
results on vast scales. ML tools enable organizations to more quickly identify profitable
opportunities and potential risks.
The practical applications of machine learning drive business results which can
dramatically affect a company’s bottom line. New techniques in the field are evolving
rapidly and expanded the application of ML to nearly limitless possibilities. Industries that
depend on vast quantities of data—and need a system to analyse it efficiently and
accurately, have embraced ML as the best way to build models, strategize, and plan.
3.2.MACHINE LEARNING MODELS

Page-5
3.2.1 Naïve Bayes Classifier
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, which can be
described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of colour, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an apple without depending
on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:

where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
 Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
 Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc., The classifier uses the frequency of words for the predictors.

Page-6
 Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.
3.3. CONFUSION MATRIX AND METRICS
3.3.1 Confusion matrix
A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of test data for which the true values are known. All the measures can be
calculated by using left most four parameters. So, let’s talk about those four parameters first.

Figure 3.1: Confusion matrix with 4 parameters

True positive and true negatives are the observations that are correctly predicted. We want to
minimize false positives and false negatives. These terms are a bit confusing. So, let’s take
each term one by one and understand it fully.
True Positives (TP) - These are the correctly predicted positive values which means that the
value of actual class is yes and the value of predicted class is also yes.
True Negatives (TN) - These are the correctly predicted negative values which means that
the value of actual class is no and value of predicted class is also no.
False positives and false negatives, these values occur when your actual class contradicts with
the predicted class.
False Positives (FP) – When actual class is no and predicted class is yes.
False Negatives (FN) – When actual class is yes but predicted class in no.

Page-7
3.3.1.Metrics
Accuracy
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly
predicted observation to the total observations.
Accuracy = TP+TN/TP+FP+FN+TN
Precision
Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations.
Precision = TP/TP+FP
Re-call/ Sensitivity/ True Positive Rate (TPR)
Recall is the ratio of correctly predicted positive observations to the all observations in actual
class - yes.
Recall = TP/TP+FN
F1 Score
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both
false positives and false negatives into account. Intuitively it is not as easy to understand as
accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class
distribution.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
False Positive Rate (FPR)
FPR tells us what proportion of the negative class got incorrectly classified by the classifier.
FPR=FP/TN+FP

Page-8
CHAPTER 4
APPROACH DESCRIPTION
4.1. UNDERSTANDING THE CONCEPT
 Twitter sentiment analysis is a powerful tool that can be used to gain insights into
public opinion, improve products and services, predict market trends, and monitor
brand reputation. By analyzing the sentiment of tweets, we can learn what people are
thinking and feeling about a particular topic, company, or event.
 Understand public opinion: This is a broad, but powerful purpose. By analyzing the
sentiment of tweets on a particular topic, you can gain valuable insights into what
people are thinking and feeling about it. This could be useful for market
research, political campaigns, public relations, or simply understanding the social
landscape around a certain issue.
 Improve your product or service: If you're a business owner or
entrepreneur, analyzing customer sentiment on Twitter can help you identify areas
where you can improve your offerings.
4.2. PROJECT APPROACH
Data Collection
• Gathered a large dataset of tweets using the Twitter API.
• Focused on tweets related to the identified problem to ensure relevance and accuracy
of sentiment analysis.
Preprocessing

Page-9
• Cleaned the collected data by removing irrelevant information, such as URLs and
special characters.
• Tokenized the tweets into individual words to prepare them for analysis.
Sentiment Analysis
• Utilized machine learning algorithms, such as Naive Bayes and Support Vector
Machines, to classify the sentiment of each tweet.
• Trained the models using labeled data to accurately predict sentiment.

CHAPTER 5
DATA EXPLORATION
5.1. IMPROVEMENTS:
From the baseline, the goal is to improve the accuracy of the classifier, which is 0.77, in
order to determine better which tweet is positive or negative. There are several ways of doing
this and we present only few possible improvements (or not). First, we could try to removed
what we called, stop words. Stop words usually refer to the most common words in the
English language (in our case) such as: "the", "of", “to” and so on. They do not indicate any
valuable information about the sentiment of a sentence and it can be necessary to remove
them from the tweets in order to keep only words for which we are interested. To do this we
use the list of 635 stopwords that we found. In the table below, you can see the most frequent
words in the data set with their counts,

Page-10
We can derive from the table, some interesting statistics like the number of times the tags
used in the preprocessing step appear,

Recall that ||url|| corresponds to the URLs, ||target|| the twitter usernames with the symbol
“@” before, ||not|| replaces the negation words, ||pos|| and ||neg|| replace the positive and
negative smiley respectively. After removing the stop words.
Compared to the previous result, we lose 0.02 in accuracy and the number of false positive
goes from 126305 to 154015. We conclude that stop words seem to be useful for our
classification task and remove them do not represent an improvement. We could also try to
stem the words in the data set. Stemming’s the process by which endings are removed from
words in order to remove things like tense or plurality. The stem form of a word could not
exist in a dictionary (different from Lemmatization). This technique allows to unify words
and reduce the dimensionality of the dataset. It's not appropriate for all cases but can make it
easier to connect together tenses to see if you're covering the same subject matter. It is faster
than Lemmatization (remove inflectional endings only and return the base or dictionary form
of a word, which is known as the lemma). Using the library NLTK which is a library in
Python specialized in natural language processing, we get the following results after
stemming the words in the data set, We actually lose 0.002 in accuracy score compared to the
results of the baseline. We conclude that stemming words does not improve the classifier’s
accuracy and actually do not make any sensible changes.

Page-11
CHAPTER 6
MODELLING
6.1. MODEL DEVELOPMENTS

6.1.1. Naive Bayes

Naive Bayes is one of the successful classifiers which is based on MAP (Maximum a
posterior) estimation.
The Naïve Bayes algorithm contains two words Naïve and Bayes, which are described as
below:
Naïve: It is referred as Naïve, because it assumes in the way for certain features occurrence is
independent of other features occurrences. Such as if the fruit is recognized on the bases of
colour if it red, and taste if it is sweet and shape if it is spherical, then that fruit is recognized
as an apple. Hence every feature contributes individually it’s part to identify whether it is an
apple or not, without relying on other features.
Bayes: As it depends on the law of Bayes’ theorem, it is known as Bayes.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' rule or Bayes' law. It depends on the conditional
probability. Bayes’ theorem is helpful to determine the probability of a hypothesis with
earlier knowledge. The formula for Bayes' theorem is:

Page-12
Here,
P(A|B): Posterior probability which means it is the probability of hypothesis A on the
observed event B.
P(B|A): Likelihood probability which means it is the probability of the evidence given that
the probability of a hypothesis is true.
P(A): Prior Probability which means it is the probability of hypothesis before observing the
evidence.
P(B): Marginal Probability which means it is the probability of Evidence.

6.2. MODEL EVALUATION

We have created the model and trained it using Naïve Bayes classifier. Now, our task is to
test the testing dataset and predict different metrics like precision score, accuracy, re-call, f1
score and ROC-Curve to determine how well our model is being worked.
6.3. IMPLEMENTATION:
The code implements a sentiment analysis model using a Multinomial Naive Bayes classifier
and Flask for creating a web application.
Data Preprocessing and Model Training:

• The code downloads a dataset from a URL, preprocesses the text data by removing
stopwords, punctuation, and stemming, and then splits the data into training and
testing sets.

• It trains a Multinomial Naive Bayes classifier on the preprocessed training data and
evaluates its performance using accuracy, precision, recall, and F1 score metrics.

• The confusion matrix is generated to visualize the performance of the classifier.

Error Analysis:

• The code identifies and prints misclassified examples (instances where the predicted
sentiment differs from the actual sentiment) from the test set.

Page-13
Flask Web Application:

• The code defines a Flask web application with a route /content/.

• When the route is accessed, it returns the rendered template index.html, passing the
sentiment analysis result (sentiment_result) as context.

Ngrok Integration:
The code includes functionality to start an Ngrok tunnel (start_ngrok) printing the url
(print_ngrok_url) after the tunnel is established.

• Install Software: Get and install programs like pandas, numpy, scikit-learn, nltk,
seaborn, matplotlib, and Flask on your computer.

• Create Files: Make a new Python file named "app.py" and a folder named "templates"
where you place an HTML file called "index.html."

• Copy Code: Copy the provided code into "app.py" for data preprocessing, model
training, and setting up a web application using Flask.

• Run Flask: Run your Flask application locally on your computer, which starts a web
server hosting your sentiment analysis app.

• Ngrok Tunnel: As Flask runs, it will automatically start an Ngrok tunnel, a tool that
exposes your local server to the internet.

• Access App: Use the Ngrok URL provided in your terminal or command prompt to
access your sentiment analysis app's interface. You can input text and get sentiment
analysis results displayed on the web page

Page-14
CHAPTER 7
RESULTS AND CONCLUSIONS
7.1. RESULTS
Model Performance Metrics:
Accuracy: Percentage of correctly classified instances among all instances.
Precision (weighted average): Precision measures the proportion of true positive predictions
among all positive predictions. The weighted average considers class imbalances.

Recall (weighted average): Recall measures the proportion of true positive predictions among
all actual positives. The weighted average considers class imbalances.
F1 Score (weighted average): F1 score is the harmonic mean of precision and recall. The
weighted average considers class imbalances.

Confusion Matrix:
Visual representation of the classifier's performance across different sentiment classes
(negative and positive).
Cells represent the number of instances where the actual sentiment (rows) matches the
predicted sentiment (columns).

Page-15
Helps in identifying true positives, true negatives, false positives, and false negatives.

Error Analysis:
Identifies misclassified examples where the predicted sentiment does not match the actual
sentiment.
Shows the text, actual sentiment, and predicted sentiment for each misclassified example.

Flask Web Application:

Sets up a Flask web application that can serve the sentiment analysis results.
Ngrok is used to create a public URL for the Flask application.
When you run the Flask application (app.run()), it will start the web server, and you can
access the sentiment analysis results through the specified route ('/content/'). The results will
include model performance metrics, confusion matrix visualization, and misclassified
examples.

7.2. CONCLUSION
Nowadays, sentiment analysis or opinion mining is a hot topic in machine learning. We are
still far to detect the sentiments of s corpus of texts very accurately because of the complexity
in the English language and even more if we consider other languages such as Chinese. In
this project we tried to show the basic way of classifying tweets into positive or negative
category using Naive Bayes as baseline and how language models are related to the Naive
Bayes and can produce better results. We could further improve our classifier by trying to
extract more features from the tweets, trying different kinds of features, tuning the parameters
of the naïve Bayes classifier, or trying another classifier all together.

Page-16
REFERENCES
1. https://towardsdatascience.com/twitter-sentiment-analysis-classification-using-nltk-
python-fa912578614c
2. https://www.geeksforgeeks.org/naive-bayes-classifiers/
3. https://www.sciencedirect.com/science/article/pii/S1877050919305557
4. Youtube

Page-17
Page-18
Appendix: A- Packages, Tools used & Working Process
Python Programming language
Python is a high-level Interpreter based programming language used especially for general-
purpose programming. Python features a dynamic type of system and supports automatic
memory management.
It supports multiple programming paradigms, including object-oriented, functional and
Procedural and also has its a large and comprehensive standard library. Python is of two
versions. They are Python 2 and Python 3.
This project uses the latest version of Python, i.e., Python 3. This python language uses
different types of memory management techniques such as reference counting and a cycle-
detecting garbage collector for memory management. One of its features is late binding
(dynamic name resolution), which binds method and variable names during program
execution.
Python's offers a design that supports some of things that are used for functional
programming in the Lisp tradition. It has vast usage of functions for faster results such as
filter, map, split, list comprehensions, dictionaries, sets and expressions. The standard library
of python language has two modules like itertools and functools that implement functional
tools taken from Standard machine learning.
Libraries
NumPy
Numpy is the basic package for scientific calculations and computations used along with
Python. NumPy was created in 2005 by Travis Oliphant. It is open source so can be used
freely. NumPy stands for Numerical Python. And it is used for working with arrays and
mathematical computations.
Using NumPy in Python gives you much more functional behavior comparable to MATLAB
because they both are interpreted, and they both allows the users to quickly write fast
programs as far as most of the operations work on arrays, matrices instead of scalars. Numpy
is a library consisting of array objects and a collection of those routines for processing those
arrays.
Numpy has also functions that mostly works upon linear algebra, Fourier transform, arrays
and matrices. In general scenario the working of NumPy in the code involves searching, join,
split, reshaping etc. operations using NumPy.

Page-19
The syntax for importing the NumPy package is → import NumPy as np indicates NumPy is
imported alias np.

Pandas
Pandas is used whenever working with matrix data, time series data and mostly on tabular
data. Pandas is also open-source library which provides high-performance, easy-to-use data
structures and data analysis tools for the Python programming language.
This helps extremely in handling large amounts of data with help of data structures like
Series, Data Frames etc. It has inbuilt methods for manipulating data in different formats like
csv, html etc.,
Simply we can define pandas is used for data analysis and data manipulation and extremely
works with data frames objects in our project, where data frame is dedicated structure for
two-dimensional data, and it consists of rows and columns similar to database tables and
excel spreadsheets.
In our code we firstly import pandas package alias pd and use pd in order to read the csv file
and assign to data frame and in the further steps. We work on the data frames by
manipulating them and we perform data cleaning by using functions on the data frames such
as df.isna().sum(). So, finally the whole code depends on the data frames which are to be
acquired by coordinating with pandas. So, this package plays a key role in our project.
Matplotlib
Matplotlib is a library used for plotting in the Python programming language and it is a
numerical mathematical extension of NumPy. Matplotlib is most commonly used for
visualization and data exploration in a way that statistics can be known clearly using different
visual structures by creating the basic graphs like bar plots, scatter plots, histograms etc.
Matplotlib is a foundation for every visualizing library and the library also offers a great
flexibility with regards to formatting and styling plots. We can choose freely certain
assumptions like ways to display labels, grids, legends etc.
In our code firstly we import the matplotlib.pyplot alias plt, This plt comes into picture in the
exploratory data analysis part to analyze and summarize datasets into visual methods, we use
plt to add some characteristics to figures such as title, legends, labels on x and y axis as said
earlier ,to understand more clearly we can also use different plots.
Seaborn
Seaborn is used for drawing attractive statistical graphics with just a few lines of code. In
other words, we can say seaborn is a data visualization library based on the matplotlib and

Page-20
closely combined with Pandas data structures in Python. Visualization is the central theme of
Seaborn which helps in exploration and understanding of data.
Plots are used for visualizing the relationship between variables. Those variables can be
numerical or categorical.
Using Seaborn, we can also plot wide varieties of plots like Distribution plots, Pie chart and
bar chart, Scatter plots, Pair plots, Heat maps.
In our code we use seaborn library in EDA where sns is used to create countplot between skin
disease and count of target values and we use facetgrid with respect to sns for looking out
distribution of age based on the diseases and sns with respect to the countplot is also used to
find perspective of data analysis i.e., is the disease due to family genes i.e., family history vs
count. As said earlier, sns also used in determining the heatmap.
Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python.
Scikit learn is an efficient, and beginner, user friendly tool for predictive data analysis and it
provides a selection of tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistent interface in
Python. This library is built upon different libraries such as NumPy, SciPy and Matplotlib.
Scikit learn is used when identifying to which category an object likely belongs, predicting
continuous values and grouping of similar objects into clusters.
When coming to our code, sklearn plays an important with respect to classification algorithm,
the final result and performance. Accuracy of the algorithm can be determined using sklearn.
The different modules imported from sklearn library is train_test_split, GaussianNB, one vs
rest classifier and all the metrics. When going much detail into the modules and packages,
train_test_split means it splits the data into random training and testing subsets. We used
gaussian naive bayes classification algorithm in order to classify the values of the model. On
taking the syntax as example from sklearn. metrics, import * describes to import all the
metrics required for doing some kind of mathematical, or evidential calculations. Similarly
from sklearn.model_selection, import train_test_split described above and there are few
preprocessing steps such as from sklearn.preprocessing import LabelEncoder where
labelencoder encode labels with a value between 0 and n_classes-1 where n is the number of
distinct labels, and other step is from sklearn.preprocessing import label_binarize where label
binarize is used to convert multi-class labels to binary labels (belong or does not belong to the
class) and we use multiclass classification in our which will be explained in detail in the
above document, and when coming to metrics we use confusion matrix in order to calculate

Page-21
the performance of classification model by using certain measures like precision,recall,f1
sore and threshold value.
Supervised Learning algorithms − Supervised learning is one of the machine learning
approaches through which models are trained using perfectly labelled training data and on the
basis of that models predicts the output. Almost all the popularly known supervised learning
algorithms, Such as Linear Regression, Support Vector Machine (SVM), Decision Tree,
Naïve bayes etc., are the part of scikit-learn.
Unsupervised Learning algorithms − Unsupervised learning is also one of the machine
learning approaches, through which models are not supervised using training data. Instead of
that model itself finds the hidden patterns and insights from the given data.
On the other side it also has all the popular unsupervised learning algorithms from clustering,
factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks.
Clustering − This model can be used to group unlabelled data.
Cross Validation −This process is used to check accuracy on unseen data in supervised
models.
Dimensionality Reduction – Dimensionalities are nothing but attributes of the data. This
step helps in reducing the number of attributes in data which can be used further for tasks like
feature selection, visualization and summarization.
Ensemble methods – Ensemble means to combine. These methods combine various
predictions of multiple supervised models.
Feature extraction – This step is used to define attributes by extracting the features from the
dataset having data of any form.
Feature selection – The extracted features contain lots of features some of which may be not
useful. Feature selection is the process of identifying the important features for the creation of
supervised models.
 List of cells: there are three different types of different cells listed beside — markdown
(display), code (to excite), and output.

Page-22
Appendix: B
Sample Source Code with Execution
Source code:
import pandas as pd
import zipfile
import requests
import io
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt

# Download the ZIP file

url = "https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip"
response = requests.get(url)
zip_file = zipfile.ZipFile(io.BytesIO(response.content))

# Extract the desired CSV file

with zip_file as zf:
with zf.open("training.1600000.processed.noemoticon.csv") as f:
data = pd.read_csv(f, header=None, encoding="ISO-8859-1", names=["polarity",
"tweet_id", "date", "query", "username", "text"])

# Display the first few rows of the dataset to inspect its structure and column names
print(data.head())

# Keep only necessary columns and rename them if needed

data = data[['text', 'polarity']]
data.columns = ['text', 'sentiment']

# Split data into train and test sets

train, test = train_test_split(data, test_size=0.1, stratify=data['sentiment'])

# Remove neutral sentiments (if any)

Page-23
train = train[train.sentiment != 2]

# Continue with your preprocessing and modeling steps

import nltk
import re
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))from sklearn.feature_extraction.text import
CountVectorizer
from nltk.corpus import stopwords

# Download stopwords resource

import nltk
nltk.download('stopwords')

# Tokenization and stopwords removal

stop_words = set(stopwords.words('english'))
vectorizer = CountVectorizer(stop_words='english') # Use 'english' here
# Stemming text data (optional)
stemmer = nltk.SnowballStemmer("english")
def preprocess_text(text):
# Lowercase the text
text = text.lower()
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Tokenize text into words
words = text.split()
# Stem each word (optional)
words = [stemmer.stem(word) for word in words]
# Stopword removal (already included in your code)
# ... (your existing stopword removal code)
# Join the stemmed words back into a string

Page-24
text = ' '.join(words)
return text
# Apply preprocessing to the text data in your train and test sets
train['text'] = train['text'].apply(preprocess_text)
test['text'] = test['text'].apply(preprocess_text)
X_train = vectorizer.fit_transform(train['text'])
y_train = train['sentiment']
X_test = vectorizer.transform(test['text'])
y_test = test['sentiment']
from sklearn.naive_bayes import MultinomialNB

# Train Naive Bayes classifier

classifier = MultinomialNB()
classifier.fit(X_train, y_train)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix

# Predictions
y_pred = classifier.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))

Page-25
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="Blues", xticklabels=['Negative',
'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Error Analysis
test['predicted_sentiment'] = y_pred
misclassified = test[test['sentiment'] != test['predicted_sentiment']]
print("Misclassified examples:")
print(misclassified[['text', 'sentiment', 'predicted_sentiment']])
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted') * 100
recall = recall_score(y_test, y_pred, average='weighted') * 100
f1 = f1_score(y_test, y_pred, average='weighted') * 100

# Convert accuracy to percentage

accuracy_percentage = accuracy * 100

# Print model performance metrics with all metrics as percentages

print("Accuracy:", "{:.2f}%".format(accuracy_percentage))
print("Precision (weighted average):", "{:.2f}%".format(precision))
print("Recall (weighted average):", "{:.2f}%".format(recall))
print("F1 Score (weighted average):", "{:.2f}%".format(f1))

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="Blues", xticklabels=['Negative',
'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

Page-26
plt.show()

# Error Analysis
misclassified = test[test['sentiment'] != test['predicted_sentiment']]
print("Misclassified Examples:")
print(misclassified[['text', 'sentiment', 'predicted_sentiment']])
from flask import Flask, render_template, request

app = Flask(_name_)

@app.route('/content/')
def index():
# Your sentiment analysis code here (assume 'sentiment_result' holds the data)
return render_template('index.html', sentiment_result=sentiment_result)

def start_ngrok():
ngrok_command = "ngrok http 5000"
ngrok_proc = subprocess.Popen(ngrok_command.split(), stdout=subprocess.PIPE)
# Print Ngrok URL after 2 seconds (adjust if necessary)
Timer(2, print_ngrok_url, args=[ngrok_proc]).start()

def print_ngrok_url(proc):
try:
ngrok_url = proc.stdout.readlines()[1].strip().decode("utf-8")
print("Ngrok URL:", ngrok_url)
except:
print("Ngrok URL not found.")

if _name_ == '_main_':
start_ngrok() # Start Ngrok tunnel
app.run()

Page-27
Execution output:

Page-28

Worldwizard 1
No ratings yet
Worldwizard 1
20 pages
The Bride of Ivy Green
No ratings yet
The Bride of Ivy Green
29 pages
Sentiment Analysis of Twitter Data My
75% (4)
Sentiment Analysis of Twitter Data My
14 pages
Sentiment Analysis Final Documentation Report
50% (2)
Sentiment Analysis Final Documentation Report
21 pages
Sentiment Analysis On Twitter
100% (2)
Sentiment Analysis On Twitter
8 pages
Sentiment Analysis of Tweets Using Machine Learning
No ratings yet
Sentiment Analysis of Tweets Using Machine Learning
22 pages
Twitte Analysis
No ratings yet
Twitte Analysis
53 pages
Machine Learning For Sentiment Analysis of Twitter Data
No ratings yet
Machine Learning For Sentiment Analysis of Twitter Data
9 pages
Digital Assignment-1 Literature Review On Twitter Sentiment Analysis Name: G.Tirumala Reg No: 16BCE0202 1)
No ratings yet
Digital Assignment-1 Literature Review On Twitter Sentiment Analysis Name: G.Tirumala Reg No: 16BCE0202 1)
9 pages
Twitter Sentiment Analysis
100% (2)
Twitter Sentiment Analysis
10 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
7 pages
Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Sentiment Analysis Using Machine Learning Algorithms
23 pages
TSA Synopsis
No ratings yet
TSA Synopsis
18 pages
Twitter Sentiment Analysis - Final - Report Copy Sahil
No ratings yet
Twitter Sentiment Analysis - Final - Report Copy Sahil
26 pages
Machine Learning With Sentiment Approach
No ratings yet
Machine Learning With Sentiment Approach
5 pages
ML Project Report
No ratings yet
ML Project Report
26 pages
Introduction
No ratings yet
Introduction
27 pages
Abstract
No ratings yet
Abstract
2 pages
Twitter Sentiment Analysis Using Machine Learning Algorithms IJERTV12IS070128
No ratings yet
Twitter Sentiment Analysis Using Machine Learning Algorithms IJERTV12IS070128
3 pages
Sentiment Analysis On Twitter Using Streaming Api: Abstract
No ratings yet
Sentiment Analysis On Twitter Using Streaming Api: Abstract
5 pages
Abstract Review PPT Tem - 03
No ratings yet
Abstract Review PPT Tem - 03
7 pages
Se Write-Up
No ratings yet
Se Write-Up
2 pages
6 Project Report Sem6
No ratings yet
6 Project Report Sem6
13 pages
FML Project Report
No ratings yet
FML Project Report
18 pages
Twitter Sentiment Analysis by Robin Singh
No ratings yet
Twitter Sentiment Analysis by Robin Singh
57 pages
Fin Ijprems1714118825
No ratings yet
Fin Ijprems1714118825
6 pages
IJCRT2207068
No ratings yet
IJCRT2207068
5 pages
Sentiment Analysys of Tweets Using Machine Learning
No ratings yet
Sentiment Analysys of Tweets Using Machine Learning
74 pages
Twitter Sentiment Analysis Using Hybrid Gated Attention Recurrent Network
No ratings yet
Twitter Sentiment Analysis Using Hybrid Gated Attention Recurrent Network
29 pages
Sentimental Analysis On Twitter Data Using Naive Bayes: Ijarcce
No ratings yet
Sentimental Analysis On Twitter Data Using Naive Bayes: Ijarcce
4 pages
Machine Learning Based Sentiment Analysis For Text Messages
No ratings yet
Machine Learning Based Sentiment Analysis For Text Messages
7 pages
A Review On Twitter Sentiment Analysis Approaches
No ratings yet
A Review On Twitter Sentiment Analysis Approaches
5 pages
10 1109@icaccs48705 2020 9074208
No ratings yet
10 1109@icaccs48705 2020 9074208
3 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
Twitter Analysis
No ratings yet
Twitter Analysis
8 pages
Uno 3
No ratings yet
Uno 3
16 pages
Twitter and Emotions: Exploring Sentiment Detection
No ratings yet
Twitter and Emotions: Exploring Sentiment Detection
6 pages
Cmu CS QTR 127
No ratings yet
Cmu CS QTR 127
38 pages
Sentiment Analysis of Tweets Using Natural Language Processing (#1130188) - 2484168
No ratings yet
Sentiment Analysis of Tweets Using Natural Language Processing (#1130188) - 2484168
3 pages
Sentiment Analysis Using Naive Bayes Algorithm
No ratings yet
Sentiment Analysis Using Naive Bayes Algorithm
4 pages
Senti bp1
No ratings yet
Senti bp1
2 pages
Freport
No ratings yet
Freport
25 pages
IEEE Paper Format
No ratings yet
IEEE Paper Format
4 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
11 pages
Minor Project Report GRP 11
No ratings yet
Minor Project Report GRP 11
21 pages
Twiiter Sentiment Analysis
No ratings yet
Twiiter Sentiment Analysis
15 pages
Sentiment Analysis of Twitter
No ratings yet
Sentiment Analysis of Twitter
26 pages
Finalreview 1
No ratings yet
Finalreview 1
4 pages
571 Document Mod
No ratings yet
571 Document Mod
30 pages
(IJCST-V8I5P3) : Gajendra R. Wani
No ratings yet
(IJCST-V8I5P3) : Gajendra R. Wani
4 pages
Proposalwriting
No ratings yet
Proposalwriting
16 pages
XGBOOST
No ratings yet
XGBOOST
5 pages
Social Media Sentiment
No ratings yet
Social Media Sentiment
8 pages
FULLTEXT02
No ratings yet
FULLTEXT02
46 pages
(IJCST-V9I4P5) :G. Bala Krishna Priya, Dr. Jabeen Sultana, Prof. M. Usha Rani
No ratings yet
(IJCST-V9I4P5) :G. Bala Krishna Priya, Dr. Jabeen Sultana, Prof. M. Usha Rani
5 pages
Effective Sentiment Analysis of Twitter With Apache Spark
No ratings yet
Effective Sentiment Analysis of Twitter With Apache Spark
8 pages
Natural Language Processing (Ue16Cs333) MINI-PROJECT (2019) Sentiment Analysis
No ratings yet
Natural Language Processing (Ue16Cs333) MINI-PROJECT (2019) Sentiment Analysis
2 pages
ProjectFinalReport 2copies
No ratings yet
ProjectFinalReport 2copies
26 pages
Machine Learning With Advance Model
No ratings yet
Machine Learning With Advance Model
19 pages
NLP Exp1
No ratings yet
NLP Exp1
5 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Forex Made Easy - 20250724 - 033014 - 0000
No ratings yet
Forex Made Easy - 20250724 - 033014 - 0000
6 pages
Bacolod Assignment in P.E
No ratings yet
Bacolod Assignment in P.E
6 pages
BACHRACH MOTOR CO Vs TALISAY-SILAY MILLING
No ratings yet
BACHRACH MOTOR CO Vs TALISAY-SILAY MILLING
2 pages
Instructiona L Design: Angelu Adlawan UDM-SN NR-41, GROUP 2
No ratings yet
Instructiona L Design: Angelu Adlawan UDM-SN NR-41, GROUP 2
17 pages
The Bad War: The Truth NEVER Taught About World War II
No ratings yet
The Bad War: The Truth NEVER Taught About World War II
245 pages
11vip90 Tuanso08bodudoandacbiet Phattriendethiminhhoanam2024 Deso04
No ratings yet
11vip90 Tuanso08bodudoandacbiet Phattriendethiminhhoanam2024 Deso04
10 pages
What's Brewing?: An Analysis of India's Coffee Industry
No ratings yet
What's Brewing?: An Analysis of India's Coffee Industry
69 pages
Case of Ampuan
100% (1)
Case of Ampuan
2 pages
U-9 Shutdown Defects
No ratings yet
U-9 Shutdown Defects
5 pages
Chilli Paneer - Cookidoo® - The Official Thermomix® Recipe Platform
No ratings yet
Chilli Paneer - Cookidoo® - The Official Thermomix® Recipe Platform
1 page
TQM MCQs
100% (2)
TQM MCQs
8 pages
Project Procurement Management Plan (PPMP) : Crossing Bayabas National High School
No ratings yet
Project Procurement Management Plan (PPMP) : Crossing Bayabas National High School
6 pages
Consumable Cost Departement Wise APR@2021
No ratings yet
Consumable Cost Departement Wise APR@2021
4 pages
Tabari Volume 08
100% (2)
Tabari Volume 08
239 pages
06 PL & SQL Viva and Interview Questions and Answers..
No ratings yet
06 PL & SQL Viva and Interview Questions and Answers..
11 pages
Indiannotes51928muse BW
No ratings yet
Indiannotes51928muse BW
512 pages
LIA - Water District, Purchase, Lot
No ratings yet
LIA - Water District, Purchase, Lot
3 pages
5408 SSTPC Safety Condition M (Lineside Security) v05
No ratings yet
5408 SSTPC Safety Condition M (Lineside Security) v05
3 pages
Constitutional Status of Women in India
No ratings yet
Constitutional Status of Women in India
9 pages
Saint Patrick's Day Story
No ratings yet
Saint Patrick's Day Story
3 pages
2022 Discrete Mathematics Lecturer
No ratings yet
2022 Discrete Mathematics Lecturer
167 pages
Solar System Starand Satellites
No ratings yet
Solar System Starand Satellites
43 pages
Pricing of Cancer Medicines and Its Impact
No ratings yet
Pricing of Cancer Medicines and Its Impact
173 pages
Purchasing and Supply Chain Management 4th Edition Robert M. Monczka Instant Download
100% (1)
Purchasing and Supply Chain Management 4th Edition Robert M. Monczka Instant Download
61 pages
Data Analytics Basics: A Beginner's Guide
No ratings yet
Data Analytics Basics: A Beginner's Guide
15 pages
PDF Fiske Guide To Colleges 2010 26th Edition Edward Fiske Download
No ratings yet
PDF Fiske Guide To Colleges 2010 26th Edition Edward Fiske Download
51 pages
Pragmatism
No ratings yet
Pragmatism
15 pages
Math4 Q4 Mod4
100% (1)
Math4 Q4 Mod4
18 pages

Batch-6c Minipro Doc Rev-2

Uploaded by

Batch-6c Minipro Doc Rev-2

Uploaded by

TWITTER SENTIMENT ANALYSIS USING

NAVIE BAYES CLASSIFIER ALGORITHM

COMPUTER SCIENCE AND ENGINEERING

Y SURYA SASAANK CH SHANMUKA VARDHAN

Under the Supervision of

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Dr. P. Rama Santosh Naidu Dr. T. Pavan Kumar

Figure 3.1: Confusion matrix with 4 parameters

6.1.1. Naive Bayes

6.2. MODEL EVALUATION

• The confusion matrix is generated to visualize the performance of the classifier.

• The code defines a Flask web application with a route /content/.

Flask Web Application:

# Download the ZIP file

# Extract the desired CSV file

# Keep only necessary columns and rename them if needed

# Split data into train and test sets

# Remove neutral sentiments (if any)

# Continue with your preprocessing and modeling steps

# Download stopwords resource

# Tokenization and stopwords removal

# Train Naive Bayes classifier

# Convert accuracy to percentage

# Print model performance metrics with all metrics as percentages

You might also like