0% found this document useful (0 votes)
6 views

nlp_project(documentation)

The document outlines a Sentiment Analysis project that aims to classify movie reviews as positive or negative using a Naive Bayes model, achieving an accuracy of 85%. It includes a comprehensive methodology covering data collection, preprocessing, feature extraction, model training, and deployment through a user-friendly interface built with Streamlit. The project addresses existing gaps in sentiment analysis by providing a robust preprocessing pipeline and real-time predictions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

nlp_project(documentation)

The document outlines a Sentiment Analysis project that aims to classify movie reviews as positive or negative using a Naive Bayes model, achieving an accuracy of 85%. It includes a comprehensive methodology covering data collection, preprocessing, feature extraction, model training, and deployment through a user-friendly interface built with Streamlit. The project addresses existing gaps in sentiment analysis by providing a robust preprocessing pipeline and real-time predictions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Sentiment Analysis Project

Session: 2021-2025

Submitted By:
Mahwish Noreen (2021-CS-29)

Supervised By:
Dr. Usman Ghani

Department of Computer Science


University of Engineering and Technology, Lahore
Pakistan
Natural Language Processing 2

Contents
1 Abstract 3

2 Introduction 4
2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Scope of the Project 5

4 Research Gaps Addressed 5


4.1 What’s missing? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 What’s new about it? . . . . . . . . . . . . . . . . . . . . . . . . 5

5 Metrics and Literature Review 5


5.1 Evaluation Metrics: . . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2 Literature Review: . . . . . . . . . . . . . . . . . . . . . . . . . . 5

6 Comparison with Existing Works 6

7 Methodology 6
7.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.5 Model Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.6 User Interface (UI) . . . . . . . . . . . . . . . . . . . . . . . . . . 7

8 Results and Discussion 8

9 Conclusion 8

10 References 8

Mahwish Noreen December 2024


Natural Language Processing 3

1 Abstract
This project focuses on developing a Sentiment Analysis model designed to clas-
sify movie reviews as either positive or negative based solely on their textual
content. Sentiment Analysis, a key application within Natural Language Pro-
cessing (NLP), is increasingly valuable for understanding public opinion across
various domains. It enables businesses, analysts, and researchers to gauge sen-
timents in consumer feedback, social media interactions, and product reviews.
In this project, we train a Naive Bayes machine learning model using a labeled
dataset of movie reviews, achieving an accuracy of 85%. The project also im-
plements a user interface (UI) using the Streamlit framework, and the model
is saved using the Joblib library for efficient deployment. This comprehensive
project covers data preprocessing, feature extraction, model training, evalua-
tion, and deployment in a user-friendly interface.

Mahwish Noreen December 2024


Natural Language Processing 4

2 Introduction
Sentiment Analysis is a fundamental task in Natural Language Processing (NLP)
that deals with identifying the sentiment or emotion expressed in textual data.
With the rapid growth of user-generated content on platforms like social me-
dia and review sites, sentiment analysis has become essential for businesses and
analysts to understand customer opinions and feedback.
In this project, the primary dataset used is the IMDB Movie Reviews Dataset,
a widely recognized labeled dataset containing positive and negative movie re-
views. The goal is to develop an effective and accurate model that can auto-
matically classify movie reviews into positive or negative sentiments, aiding in
real-time analysis of public opinion.

2.1 Problem Statement


With the increasing volume of online movie reviews, understanding public sen-
timent towards films has become essential for filmmakers, producers, and mar-
keters. Analyzing the sentiment of movie reviews can provide valuable insights
into audience perception and help in making informed decisions. The challenge
is to automatically classify the sentiment of reviews into categories such as ”pos-
itive” or ”negative” using natural language processing (NLP) techniques.
This project focuses on the development of a Sentiment Analysis model that
can classify movie reviews as either positive or negative, utilizing the IMDB
Movie Reviews Dataset. The objective is to preprocess the data, train an ap-
propriate machine learning model, evaluate its performance, and implement a
user interface for real-time sentiment analysis.

2.2 Objectives
The primary objectives of this project include:

• Developing a sentiment analysis model to classify reviews as positive or


negative.
• Using the Naive Bayes model for sentiment classification, known for its
simplicity and efficiency.

• Achieving an accuracy of 85% on the IMDB dataset.


• Saving the trained model using the Joblib library for future use and easy
deployment.
• Developing an interactive user interface (UI) using Streamlit for real-time
predictions, making the model accessible to users without any technical
knowledge.

Mahwish Noreen December 2024


Natural Language Processing 5

3 Scope of the Project


This project focuses solely on binary classification (positive and negative sen-
timents) of text-based movie reviews. While the project could be expanded
to handle multi-class sentiment analysis or other domains, the current scope is
limited to movie reviews from the IMDB dataset. The project aims to build,
evaluate, and deploy the Naive Bayes model in a user-friendly interface for quick,
real-time predictions.

4 Research Gaps Addressed


4.1 What’s missing?
Many sentiment analysis models lack robust preprocessing pipelines or fail to
provide real-time predictions in a user-friendly format. This project addresses
these gaps by establishing a comprehensive preprocessing pipeline and incorpo-
rating a real-time prediction feature using Streamlit.

4.2 What’s new about it?


A combination of TF-IDF and Naive Bayes has been utilized to develop a
lightweight yet effective model, achieving solid results. Additionally, a user-
friendly web interface has been implemented for real-time sentiment analysis.

5 Metrics and Literature Review


5.1 Evaluation Metrics:
The model was evaluated using the following metrics:

• Accuracy: The percentage of correct predictions.


• Precision: The proportion of correctly identified positive predictions.
• Recall: The proportion of actual positives correctly identified.

• F1-Score: The harmonic mean of precision and recall, balancing both


metrics.

5.2 Literature Review:


While advanced models like LSTM and BERT offer high performance, they
often require significant computational resources. In contrast, this approach
is simpler and achieves competitive results, particularly suitable for real-time
applications or smaller datasets.

Mahwish Noreen December 2024


Natural Language Processing 6

6 Comparison with Existing Works


• What’s different? Existing models may deliver higher accuracy but
at the cost of substantial computational power. This model achieves ap-
proximately 85% accuracy while being faster and more resource-efficient.
Moreover, the integration with Streamlit ensures a seamless and intuitive
user experience, an often-overlooked aspect in similar projects.

7 Methodology
The project workflow follows a structured approach, as outlined below:

7.1 Data Collection


The IMDB Movie Reviews Dataset is used for this project. It consists of 50,000
labeled movie reviews, equally divided between positive and negative sentiments.
This dataset provides the foundation for training and evaluating the model.

7.2 Data Preprocessing


Data preprocessing plays a crucial role in ensuring the quality of the input data
for machine learning models. The following steps were applied to the text data:
• Tokenization: Reviews were split into individual words or tokens. To-
kenization allows the model to understand the structure and meaning of
the text.
• Stopword Removal: Common words such as ”the”, ”and”, and ”is” were
removed, as they do not contribute significantly to sentiment classification.

• Stemming/Lemmatization: Words were reduced to their root forms to


handle variations such as ”running” and ”ran” being treated as ”run.”

7.3 Feature Extraction


The next step involved converting the text data into numerical representations
that can be used by machine learning models. The Term Frequency-Inverse
Document Frequency (TF-IDF) method was employed. This technique helps
identify the most important words in the reviews by considering both the fre-
quency of words within a document and the rarity of words across the entire
dataset. TF-IDF thus emphasizes meaningful words while downplaying com-
monly used ones.

Mahwish Noreen December 2024


Natural Language Processing 7

7.4 Model Training


The Naive Bayes classifier was chosen for training the model due to its simplicity,
speed, and effectiveness in text classification tasks. This probabilistic model
assumes independence between the features (words), which is generally a good
approximation for text data. The model was trained on the preprocessed data
and evaluated using standard metrics such as accuracy, precision, recall, and
F1-score.

7.5 Model Saving


To make the model easy to deploy and use, the trained Naive Bayes model was
saved using the Joblib library. Joblib allows the model to be serialized and
stored, ensuring it can be reloaded without retraining every time it is used.

7.6 User Interface (UI)


To enhance accessibility and usability, a web-based UI was created using the
Streamlit framework. Streamlit enables rapid development of interactive ap-
plications, allowing users to input text (movie reviews) and receive real-time
sentiment predictions from the model. The application was designed to be in-
tuitive and user-friendly, ensuring that non-technical users can interact with it
effectively.

Mahwish Noreen December 2024


Natural Language Processing 8

8 Results and Discussion


The Naive Bayes model achieved an accuracy of 85% on the IMDB dataset,
which is competitive for a text classification task using a simple model. Below
are the results of the evaluation:
The model performed well in identifying both positive and negative senti-
ments, though there are areas for improvement. For example, some reviews
that were sarcastic or ambiguous were classified incorrectly due to the inherent
limitations of the Naive Bayes model. The next steps could involve experiment-
ing with more advanced models or incorporating additional features to improve
accuracy.
• accuracy: 0.85

• precision: 0.84
• recall: 0.85
• F1-score: 0.844

9 Conclusion
The project successfully developed a sentiment analysis model capable of classi-
fying movie reviews as positive or negative with an accuracy of 85%. The model
was saved using the Joblib library, making it easy to deploy and use in future
applications. Additionally, the Streamlit-based user interface offers a simple
way for users to interact with the model and get real-time predictions.
While the model’s accuracy is satisfactory, further improvements can be
made by experimenting with other machine learning models (e.g., Support Vec-
tor Machines, Logistic Regression) or deep learning approaches (e.g., LSTMs or
Transformers). Additionally, expanding the model to handle multi-class senti-
ment classification or domain-specific reviews could further enhance its utility.

10 References
• IMDB Dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-
dataset-of-50k-movie-reviews/data

• Relevant research papers and articles on sentiment analysis, TF-IDF, and


Naive Bayes classifier.
• Streamlit documentation: https://docs.streamlit.io/
• Joblib documentation: https://joblib.readthedocs.io/en/latest/

Mahwish Noreen December 2024

You might also like