Batch-6c Minipro Doc Rev-2
Batch-6c Minipro Doc Rev-2
A PROJECT REPORT
Submitted in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology
in
BY
POTLA SIRISHA TELAGAREDDI NAVYA
(21331A05E4) (21331A05H3)
This is to certify that the project report entitled “TWITTER SENTIMENT ANALYSIS
USING NAVIE BAYES CLASSIFIER ALGORITHM” being submitted by P
Sirisha(21331A05E4), T Navya(21331A05H3), Y Sasaank(21331A05I9), CH Shanmuk
vardhan(21335A05J8) in partial fulfillment for the award of the degree of “Bachelor of
Technology” in Computer Science and Engineering is a record of bonafide work done by
them under my supervision during the academic year 2023-2024.
External Examiner
DECLARATION
We hereby declare that the work done on the dissertation entitled “TWITTER SENTIMENT
ANALYSIS USING NAVIE BAYES CLASSIFIER ALGORITHM” has been carried out by
us and submitted in partial fulfilment for the award of credits in Bachelor of Technology in
Computer Science and Engineering of MVGR College of Engineering (Autonomous) and
affiliated to JNTUGV, Vizianagaram. The various contents incorporated in the dissertation
have not been submitted for the award of any degree of any other institution or university.
ABSTRACT
This project aims to analyze the sentiment of tweets using Naive Bayes classifier. The project
will involve collecting a dataset of tweets and preprocessing the data to remove stop words
and stemming. The Naive Bayes classifier will then be trained on the dataset to classify the
tweets as positive, negative, or neutral. The project also aims to demonstrate the effectiveness
of sentiment analysis in understanding public opinion on Twitter and its potential applications
in various domains such as marketing, politics, and customer feedback analysis. The Natural
Language Toolkit (NLTK) is a popular Python library for natural language processing tasks,
including sentiment analysis. It provides a variety of tools for preprocessing and analyzing
text data.
CONTENTS
Page No
List of libraries Used 1
Problem statement 2
1. Introduction 3
1.1. Problem definition 3
2. Literature Survey 4
3. Theoretical Background 5
3.1. Machine learning 5
3.1.1. What is Machine Learning 5
3.1.2. Why Machine Learning 5
3.2. Machine Learning Models 5
3.2.1. Naïve Bayes 7
3.3. Confusion matrix and its metrics 7
3.3.1. Confusion matrix 7
3.3.2. Metrics- Accuracy, Precision, Recall, F1 Score, FPR 8
4. Approach Description 9
4.1. Understanding the Concept 9
4.2. Project approach 9
5.Data Exploration 10
5.1. Improvements 10
6.Modelling 12
6.1. Model Development 12
6.1.1. Naïve Bayes classifier 12
6.2. Model Evaluation 12
6.3. Implementation 13
7. Results and Conclusions 15
7.1 Results 15
7.2 Conclusions 16
References 17
Appendix A: Packages, Tools Used & Working Process 18
Python Programming Language 18
Libraries 18
NumPy 18
Pandas 19
Matplotlib 19
Seaborn 20
Sklearn 21
Appendix B: Source Code 22
List of Libraries
pandas (import pandas as pd)
zipfile (import zipfile)
requests (import requests)
io (import io)
sklearn.model_selection (from sklearn.model_selection import train_test_split)
seaborn (import seaborn as sns)
matplotlib.pyplot (import matplotlib.pyplot as plt)
nltk (import nltk)
re (import re)
flask (from flask import Flask, render_template, request)
Page-1
PROBLEM STATEMENT
The problem statement for the Twitter Sentiment Analysis using Naive Bayes Classifier
project is to develop a system that can accurately classify tweets as positive, negative, or
neutral based on their sentiment. This project aims to leverage natural language processing
techniques and machine learning algorithms to analyze the text data and extract meaningful
insights from it. The ultimate goal is to provide a tool that can help businesses and
organizations better understand their customers' opinions and preferences, and make data-
driven decisions accordingly.
Page-2
CHAPTER 1
INTRODUCTION
Twitter Sentiment Analysis is the process of computationally identifying and categorizing
tweets expressed in a piece of text, especially in order to determine whether the writer’s
attitude towards a particular topic, product, etc. is positive, negative, or neutral.
What is Twitter sentiment analysis?
It's the process of using natural language processing (NLP) and machine learning (ML)
techniques to analyze the sentiment (positive, negative, or neutral) expressed in tweets.
Essentially, it allows us to interpret the emotions and opinions embedded within those 280-
character bursts of information.
Why analyze Twitter sentiment?
Twitter acts as a giant pulse check for the world, offering real-time insights into public
perception of various topics, brands, events, and figures.
1.1. PROBLEM DEFINITION
Sentiment analysis, also refers as opinion mining, is a sub machine learning task where we
want to determine which is the general sentiment of a given document. Using machine
learning techniques and natural language processing we can extract the subjective
information of a document and try to classify it according to its polarity such as positive,
neutral or negative. It is a really useful analysis since we could possibly determine the overall
opinion about a selling object, or predict stock markets for a given company like, if most
people think positive about it, possibly its stock markets will increase, and so on. Sentiment
analysis is actually far from to be solved since the language is very complex
(objectivity/subjectivity, negation, vocabulary, grammar...) but it is also why it is very
interesting to working on. In this project I choose to try to classify tweets from Twitter into
“positive” or “negative” sentiment by building a model based on probabilities. Twitter is a
microblogging website where people can share their feelings quickly and spontaneously by
sending a tweet limited by 140 characters. You can directly address a tweet to someone by
adding the target sign “@” or participate to a topic by adding an hastag “#” to your tweet.
Because of the usage of Twitter, it is a perfect source of data to determine the current overall
opinion about anything.
Page-3
CHAPTER 2
LITERATURE SURVEY
Available Technologies
There are several technologies available for Twitter sentiment analysis, including natural
language processing (NLP), machine learning, and deep learning. These technologies use
various algorithms and techniques to analyze the sentiment of tweets and classify them as
positive, negative.
Drawbacks
One of the main drawbacks of Twitter sentiment analysis is the potential for bias in the data.
This can occur when the dataset used for training the algorithm is not representative of the
population being analyzed. Additionally, the accuracy of the analysis can be affected by the
complexity of the language used in the tweets, as well as the context in which they are
posted.
Differences from Other Approaches
Our proposed approach differs from other methods of Twitter sentiment analysis in several
ways. First, we use a combination of NLP and machine learning techniques to analyze the
sentiment of tweets. This allows us to capture more nuanced aspects of the language used in
the tweets, and to identify patterns and trends that may not be apparent using other methods.
Additionally, our approach is designed to be more accurate and reliable, as it takes into
account the potential for bias in the data and uses advanced algorithms to analyze the
language used in the tweets.
Page-4
CHAPTER 3
THEORETICAL BACKGROUND
3.1 MACHINE LEARNING
3.1.1 What is Machine Learning?
Machine learning is an application of AI that enables systems to learn and improve from
experience without being explicitly programmed. Machine learning focuses on developing
computer programs that can access data and use it to learn for themselves. Machine learning
is something that is capable to imitate the intelligence of the human behavior. Machine
learning is used to perform complex tasks in a way that humans solve the problems. Machine
learning can be descriptive it uses the data to explain, predictive, and prescription.
3.1.2 Why Machine Learning?
Machine learning involves computers learning from data provided so that they carry out
certain tasks. For more advanced tasks, it can be challenging for a human to manually create
the needed algorithms. In practice, it can turn out to be more effective to help the machine
develop its own algorithm, rather than having human programmers specify every needed step.
The discipline of machine learning employs various approaches to teach computers to
accomplish tasks where no fully satisfactory algorithm is available. In cases where vast
numbers of potential answers exist, one approach is to label some of the correct answers as
valid. This can then be used as training data for the computer to improve the algorithms it
uses to determine correct answers. The nearly limitless quantity of available data,
affordable data storage, and growth of less expensive and more powerful processing has
propelled the growth of ML. Now many industries are developing more robust models
capable of analysing bigger and more complex data while delivering faster, more accurate
results on vast scales. ML tools enable organizations to more quickly identify profitable
opportunities and potential risks.
The practical applications of machine learning drive business results which can
dramatically affect a company’s bottom line. New techniques in the field are evolving
rapidly and expanded the application of ML to nearly limitless possibilities. Industries that
depend on vast quantities of data—and need a system to analyse it efficiently and
accurately, have embraced ML as the best way to build models, strategize, and plan.
3.2.MACHINE LEARNING MODELS
Page-5
3.2.1 Naïve Bayes Classifier
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, which can be
described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of colour, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an apple without depending
on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc., The classifier uses the frequency of words for the predictors.
Page-6
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.
3.3. CONFUSION MATRIX AND METRICS
3.3.1 Confusion matrix
A confusion matrix is a table that is often used to describe the performance of a classification
model on a set of test data for which the true values are known. All the measures can be
calculated by using left most four parameters. So, let’s talk about those four parameters first.
Page-7
3.3.1.Metrics
Accuracy
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly
predicted observation to the total observations.
Accuracy = TP+TN/TP+FP+FN+TN
Precision
Precision is the ratio of correctly predicted positive observations to the total predicted
positive observations.
Precision = TP/TP+FP
Re-call/ Sensitivity/ True Positive Rate (TPR)
Recall is the ratio of correctly predicted positive observations to the all observations in actual
class - yes.
Recall = TP/TP+FN
F1 Score
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both
false positives and false negatives into account. Intuitively it is not as easy to understand as
accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class
distribution.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
False Positive Rate (FPR)
FPR tells us what proportion of the negative class got incorrectly classified by the classifier.
FPR=FP/TN+FP
Page-8
CHAPTER 4
APPROACH DESCRIPTION
4.1. UNDERSTANDING THE CONCEPT
Twitter sentiment analysis is a powerful tool that can be used to gain insights into
public opinion, improve products and services, predict market trends, and monitor
brand reputation. By analyzing the sentiment of tweets, we can learn what people are
thinking and feeling about a particular topic, company, or event.
Understand public opinion: This is a broad, but powerful purpose. By analyzing the
sentiment of tweets on a particular topic, you can gain valuable insights into what
people are thinking and feeling about it. This could be useful for market
research, political campaigns, public relations, or simply understanding the social
landscape around a certain issue.
Improve your product or service: If you're a business owner or
entrepreneur, analyzing customer sentiment on Twitter can help you identify areas
where you can improve your offerings.
4.2. PROJECT APPROACH
Data Collection
• Gathered a large dataset of tweets using the Twitter API.
• Focused on tweets related to the identified problem to ensure relevance and accuracy
of sentiment analysis.
Preprocessing
Page-9
• Cleaned the collected data by removing irrelevant information, such as URLs and
special characters.
• Tokenized the tweets into individual words to prepare them for analysis.
Sentiment Analysis
• Utilized machine learning algorithms, such as Naive Bayes and Support Vector
Machines, to classify the sentiment of each tweet.
• Trained the models using labeled data to accurately predict sentiment.
CHAPTER 5
DATA EXPLORATION
5.1. IMPROVEMENTS:
From the baseline, the goal is to improve the accuracy of the classifier, which is 0.77, in
order to determine better which tweet is positive or negative. There are several ways of doing
this and we present only few possible improvements (or not). First, we could try to removed
what we called, stop words. Stop words usually refer to the most common words in the
English language (in our case) such as: "the", "of", “to” and so on. They do not indicate any
valuable information about the sentiment of a sentence and it can be necessary to remove
them from the tweets in order to keep only words for which we are interested. To do this we
use the list of 635 stopwords that we found. In the table below, you can see the most frequent
words in the data set with their counts,
Page-10
We can derive from the table, some interesting statistics like the number of times the tags
used in the preprocessing step appear,
Recall that ||url|| corresponds to the URLs, ||target|| the twitter usernames with the symbol
“@” before, ||not|| replaces the negation words, ||pos|| and ||neg|| replace the positive and
negative smiley respectively. After removing the stop words.
Compared to the previous result, we lose 0.02 in accuracy and the number of false positive
goes from 126305 to 154015. We conclude that stop words seem to be useful for our
classification task and remove them do not represent an improvement. We could also try to
stem the words in the data set. Stemming’s the process by which endings are removed from
words in order to remove things like tense or plurality. The stem form of a word could not
exist in a dictionary (different from Lemmatization). This technique allows to unify words
and reduce the dimensionality of the dataset. It's not appropriate for all cases but can make it
easier to connect together tenses to see if you're covering the same subject matter. It is faster
than Lemmatization (remove inflectional endings only and return the base or dictionary form
of a word, which is known as the lemma). Using the library NLTK which is a library in
Python specialized in natural language processing, we get the following results after
stemming the words in the data set, We actually lose 0.002 in accuracy score compared to the
results of the baseline. We conclude that stemming words does not improve the classifier’s
accuracy and actually do not make any sensible changes.
Page-11
CHAPTER 6
MODELLING
6.1. MODEL DEVELOPMENTS
Page-12
Here,
P(A|B): Posterior probability which means it is the probability of hypothesis A on the
observed event B.
P(B|A): Likelihood probability which means it is the probability of the evidence given that
the probability of a hypothesis is true.
P(A): Prior Probability which means it is the probability of hypothesis before observing the
evidence.
P(B): Marginal Probability which means it is the probability of Evidence.
• The code downloads a dataset from a URL, preprocesses the text data by removing
stopwords, punctuation, and stemming, and then splits the data into training and
testing sets.
• It trains a Multinomial Naive Bayes classifier on the preprocessed training data and
evaluates its performance using accuracy, precision, recall, and F1 score metrics.
Error Analysis:
• The code identifies and prints misclassified examples (instances where the predicted
sentiment differs from the actual sentiment) from the test set.
Page-13
Flask Web Application:
• When the route is accessed, it returns the rendered template index.html, passing the
sentiment analysis result (sentiment_result) as context.
Ngrok Integration:
The code includes functionality to start an Ngrok tunnel (start_ngrok) printing the url
(print_ngrok_url) after the tunnel is established.
• Install Software: Get and install programs like pandas, numpy, scikit-learn, nltk,
seaborn, matplotlib, and Flask on your computer.
• Create Files: Make a new Python file named "app.py" and a folder named "templates"
where you place an HTML file called "index.html."
• Copy Code: Copy the provided code into "app.py" for data preprocessing, model
training, and setting up a web application using Flask.
• Run Flask: Run your Flask application locally on your computer, which starts a web
server hosting your sentiment analysis app.
• Ngrok Tunnel: As Flask runs, it will automatically start an Ngrok tunnel, a tool that
exposes your local server to the internet.
• Access App: Use the Ngrok URL provided in your terminal or command prompt to
access your sentiment analysis app's interface. You can input text and get sentiment
analysis results displayed on the web page
Page-14
CHAPTER 7
RESULTS AND CONCLUSIONS
7.1. RESULTS
Model Performance Metrics:
Accuracy: Percentage of correctly classified instances among all instances.
Precision (weighted average): Precision measures the proportion of true positive predictions
among all positive predictions. The weighted average considers class imbalances.
Recall (weighted average): Recall measures the proportion of true positive predictions among
all actual positives. The weighted average considers class imbalances.
F1 Score (weighted average): F1 score is the harmonic mean of precision and recall. The
weighted average considers class imbalances.
Confusion Matrix:
Visual representation of the classifier's performance across different sentiment classes
(negative and positive).
Cells represent the number of instances where the actual sentiment (rows) matches the
predicted sentiment (columns).
Page-15
Helps in identifying true positives, true negatives, false positives, and false negatives.
Error Analysis:
Identifies misclassified examples where the predicted sentiment does not match the actual
sentiment.
Shows the text, actual sentiment, and predicted sentiment for each misclassified example.
7.2. CONCLUSION
Nowadays, sentiment analysis or opinion mining is a hot topic in machine learning. We are
still far to detect the sentiments of s corpus of texts very accurately because of the complexity
in the English language and even more if we consider other languages such as Chinese. In
this project we tried to show the basic way of classifying tweets into positive or negative
category using Naive Bayes as baseline and how language models are related to the Naive
Bayes and can produce better results. We could further improve our classifier by trying to
extract more features from the tweets, trying different kinds of features, tuning the parameters
of the naïve Bayes classifier, or trying another classifier all together.
Page-16
REFERENCES
1. https://towardsdatascience.com/twitter-sentiment-analysis-classification-using-nltk-
python-fa912578614c
2. https://www.geeksforgeeks.org/naive-bayes-classifiers/
3. https://www.sciencedirect.com/science/article/pii/S1877050919305557
4. Youtube
Page-17
Page-18
Appendix: A- Packages, Tools used & Working Process
Python Programming language
Python is a high-level Interpreter based programming language used especially for general-
purpose programming. Python features a dynamic type of system and supports automatic
memory management.
It supports multiple programming paradigms, including object-oriented, functional and
Procedural and also has its a large and comprehensive standard library. Python is of two
versions. They are Python 2 and Python 3.
This project uses the latest version of Python, i.e., Python 3. This python language uses
different types of memory management techniques such as reference counting and a cycle-
detecting garbage collector for memory management. One of its features is late binding
(dynamic name resolution), which binds method and variable names during program
execution.
Python's offers a design that supports some of things that are used for functional
programming in the Lisp tradition. It has vast usage of functions for faster results such as
filter, map, split, list comprehensions, dictionaries, sets and expressions. The standard library
of python language has two modules like itertools and functools that implement functional
tools taken from Standard machine learning.
Libraries
NumPy
Numpy is the basic package for scientific calculations and computations used along with
Python. NumPy was created in 2005 by Travis Oliphant. It is open source so can be used
freely. NumPy stands for Numerical Python. And it is used for working with arrays and
mathematical computations.
Using NumPy in Python gives you much more functional behavior comparable to MATLAB
because they both are interpreted, and they both allows the users to quickly write fast
programs as far as most of the operations work on arrays, matrices instead of scalars. Numpy
is a library consisting of array objects and a collection of those routines for processing those
arrays.
Numpy has also functions that mostly works upon linear algebra, Fourier transform, arrays
and matrices. In general scenario the working of NumPy in the code involves searching, join,
split, reshaping etc. operations using NumPy.
Page-19
The syntax for importing the NumPy package is → import NumPy as np indicates NumPy is
imported alias np.
Pandas
Pandas is used whenever working with matrix data, time series data and mostly on tabular
data. Pandas is also open-source library which provides high-performance, easy-to-use data
structures and data analysis tools for the Python programming language.
This helps extremely in handling large amounts of data with help of data structures like
Series, Data Frames etc. It has inbuilt methods for manipulating data in different formats like
csv, html etc.,
Simply we can define pandas is used for data analysis and data manipulation and extremely
works with data frames objects in our project, where data frame is dedicated structure for
two-dimensional data, and it consists of rows and columns similar to database tables and
excel spreadsheets.
In our code we firstly import pandas package alias pd and use pd in order to read the csv file
and assign to data frame and in the further steps. We work on the data frames by
manipulating them and we perform data cleaning by using functions on the data frames such
as df.isna().sum(). So, finally the whole code depends on the data frames which are to be
acquired by coordinating with pandas. So, this package plays a key role in our project.
Matplotlib
Matplotlib is a library used for plotting in the Python programming language and it is a
numerical mathematical extension of NumPy. Matplotlib is most commonly used for
visualization and data exploration in a way that statistics can be known clearly using different
visual structures by creating the basic graphs like bar plots, scatter plots, histograms etc.
Matplotlib is a foundation for every visualizing library and the library also offers a great
flexibility with regards to formatting and styling plots. We can choose freely certain
assumptions like ways to display labels, grids, legends etc.
In our code firstly we import the matplotlib.pyplot alias plt, This plt comes into picture in the
exploratory data analysis part to analyze and summarize datasets into visual methods, we use
plt to add some characteristics to figures such as title, legends, labels on x and y axis as said
earlier ,to understand more clearly we can also use different plots.
Seaborn
Seaborn is used for drawing attractive statistical graphics with just a few lines of code. In
other words, we can say seaborn is a data visualization library based on the matplotlib and
Page-20
closely combined with Pandas data structures in Python. Visualization is the central theme of
Seaborn which helps in exploration and understanding of data.
Plots are used for visualizing the relationship between variables. Those variables can be
numerical or categorical.
Using Seaborn, we can also plot wide varieties of plots like Distribution plots, Pie chart and
bar chart, Scatter plots, Pair plots, Heat maps.
In our code we use seaborn library in EDA where sns is used to create countplot between skin
disease and count of target values and we use facetgrid with respect to sns for looking out
distribution of age based on the diseases and sns with respect to the countplot is also used to
find perspective of data analysis i.e., is the disease due to family genes i.e., family history vs
count. As said earlier, sns also used in determining the heatmap.
Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python.
Scikit learn is an efficient, and beginner, user friendly tool for predictive data analysis and it
provides a selection of tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistent interface in
Python. This library is built upon different libraries such as NumPy, SciPy and Matplotlib.
Scikit learn is used when identifying to which category an object likely belongs, predicting
continuous values and grouping of similar objects into clusters.
When coming to our code, sklearn plays an important with respect to classification algorithm,
the final result and performance. Accuracy of the algorithm can be determined using sklearn.
The different modules imported from sklearn library is train_test_split, GaussianNB, one vs
rest classifier and all the metrics. When going much detail into the modules and packages,
train_test_split means it splits the data into random training and testing subsets. We used
gaussian naive bayes classification algorithm in order to classify the values of the model. On
taking the syntax as example from sklearn. metrics, import * describes to import all the
metrics required for doing some kind of mathematical, or evidential calculations. Similarly
from sklearn.model_selection, import train_test_split described above and there are few
preprocessing steps such as from sklearn.preprocessing import LabelEncoder where
labelencoder encode labels with a value between 0 and n_classes-1 where n is the number of
distinct labels, and other step is from sklearn.preprocessing import label_binarize where label
binarize is used to convert multi-class labels to binary labels (belong or does not belong to the
class) and we use multiclass classification in our which will be explained in detail in the
above document, and when coming to metrics we use confusion matrix in order to calculate
Page-21
the performance of classification model by using certain measures like precision,recall,f1
sore and threshold value.
Supervised Learning algorithms − Supervised learning is one of the machine learning
approaches through which models are trained using perfectly labelled training data and on the
basis of that models predicts the output. Almost all the popularly known supervised learning
algorithms, Such as Linear Regression, Support Vector Machine (SVM), Decision Tree,
Naïve bayes etc., are the part of scikit-learn.
Unsupervised Learning algorithms − Unsupervised learning is also one of the machine
learning approaches, through which models are not supervised using training data. Instead of
that model itself finds the hidden patterns and insights from the given data.
On the other side it also has all the popular unsupervised learning algorithms from clustering,
factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks.
Clustering − This model can be used to group unlabelled data.
Cross Validation −This process is used to check accuracy on unseen data in supervised
models.
Dimensionality Reduction – Dimensionalities are nothing but attributes of the data. This
step helps in reducing the number of attributes in data which can be used further for tasks like
feature selection, visualization and summarization.
Ensemble methods – Ensemble means to combine. These methods combine various
predictions of multiple supervised models.
Feature extraction – This step is used to define attributes by extracting the features from the
dataset having data of any form.
Feature selection – The extracted features contain lots of features some of which may be not
useful. Feature selection is the process of identifying the important features for the creation of
supervised models.
List of cells: there are three different types of different cells listed beside — markdown
(display), code (to excite), and output.
Page-22
Appendix: B
Sample Source Code with Execution
Source code:
import pandas as pd
import zipfile
import requests
import io
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
# Display the first few rows of the dataset to inspect its structure and column names
print(data.head())
Page-23
train = train[train.sentiment != 2]
Page-24
text = ' '.join(words)
return text
# Apply preprocessing to the text data in your train and test sets
train['text'] = train['text'].apply(preprocess_text)
test['text'] = test['text'].apply(preprocess_text)
X_train = vectorizer.fit_transform(train['text'])
y_train = train['sentiment']
X_test = vectorizer.transform(test['text'])
y_test = test['sentiment']
from sklearn.naive_bayes import MultinomialNB
# Predictions
y_pred = classifier.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
Page-25
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="Blues", xticklabels=['Negative',
'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# Error Analysis
test['predicted_sentiment'] = y_pred
misclassified = test[test['sentiment'] != test['predicted_sentiment']]
print("Misclassified examples:")
print(misclassified[['text', 'sentiment', 'predicted_sentiment']])
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted') * 100
recall = recall_score(y_test, y_pred, average='weighted') * 100
f1 = f1_score(y_test, y_pred, average='weighted') * 100
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="Blues", xticklabels=['Negative',
'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
Page-26
plt.show()
# Error Analysis
misclassified = test[test['sentiment'] != test['predicted_sentiment']]
print("Misclassified Examples:")
print(misclassified[['text', 'sentiment', 'predicted_sentiment']])
from flask import Flask, render_template, request
app = Flask(_name_)
@app.route('/content/')
def index():
# Your sentiment analysis code here (assume 'sentiment_result' holds the data)
return render_template('index.html', sentiment_result=sentiment_result)
def start_ngrok():
ngrok_command = "ngrok http 5000"
ngrok_proc = subprocess.Popen(ngrok_command.split(), stdout=subprocess.PIPE)
# Print Ngrok URL after 2 seconds (adjust if necessary)
Timer(2, print_ngrok_url, args=[ngrok_proc]).start()
def print_ngrok_url(proc):
try:
ngrok_url = proc.stdout.readlines()[1].strip().decode("utf-8")
print("Ngrok URL:", ngrok_url)
except:
print("Ngrok URL not found.")
if _name_ == '_main_':
start_ngrok() # Start Ngrok tunnel
app.run()
Page-27
Execution output:
Page-28