0% found this document useful (0 votes)
9 views13 pages

Software Engineering - Project Proposal

The document outlines a project proposal for conducting sentiment analysis using machine learning techniques, aimed at leveraging social media data to improve business insights and brand reputation. It details the problem of underutilized customer feedback, the objectives of developing a sentiment analysis framework, and the methodologies for data collection, preprocessing, and model training. The significance of the project includes enhancing customer understanding, informed decision-making, and providing a competitive advantage across various industries.

Uploaded by

mwenyalightson7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

Software Engineering - Project Proposal

The document outlines a project proposal for conducting sentiment analysis using machine learning techniques, aimed at leveraging social media data to improve business insights and brand reputation. It details the problem of underutilized customer feedback, the objectives of developing a sentiment analysis framework, and the methodologies for data collection, preprocessing, and model training. The significance of the project includes enhancing customer understanding, informed decision-making, and providing a competitive advantage across various industries.

Uploaded by

mwenyalightson7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

CAVENDISH UNIVERSITY ZAMBIA

Faculty of Information and Communications


Technology (ICT)

PROJECT TITLE:
Sentiment Analysis using Machine Learning

“Software Engineering”
ID:001-767
2024

TABLE OF CONTENTS
Abstract ........................... ........................... ........................... ...........
CHAPTER ONE........................... ........................... .........................
INTRODUCTION........................... ........................... ......................
Problem analysis/statement........................... ........................... ........
Purpose of the study/Objective (s) ........................... ........................
Definition of (unfamiliar) terms........................... ........................... .
Scope of study........................... ........................... ........................... .
Significance of the Research/Project........................... ......................
CHAPTER TWO- LITERATURE REVIEW AND THEORETICAL FRAMEWORK
Overview (A short NOTE) ........................... ........................... ........
Literature Review........................... ........................... .......................
CHAPTER THREE........................... ........................... ....................
METHODOLOGY AND SYSTEM(RESEARCH) DESIGN: ...
Requirements.....................................................................................
Tools used.........................................................................................
Requirements....................................................................................
Functional requirement .................................................................
Non-Functional Requirements ......................................................
System Architecture........................... ........................... .................
System Flowchart........................... ........................... ......................
System Design........................... ........................... ...........................
REFERENCES: ........................... ........................... ..........................

ABSRACT
CHAPTER ONE

INTRODUCTION
As social media allows users to create, share, and engage with content and each other, businesses
can leverage this platform for marketing, customer engagement, brand promotion, and/or product
improvement. It has become so ubiquitous for people that it is now a rich repository of people’s
opinions about a brand, news headlines, and feedback.
Therefore, business or politicians can monitor their online reputation through this platform to
effectively address issues, establish and develop their brand by making use of the large amount of
data generated on social media platforms through the application of AI by sentiment analysis

PROBLEM ANALYSIS
Besides organizations implementing various feedback systems such as websites and social media platforms for
product/service, or brand reputation star ratings, comments, reviews, opinions etc., the volume of this user
generated text lies underutilized. Thus, leading to employee frustration about potential in the data but lacking
the means to derive value from the textual data. Further, customers become more dissatisfied at the lack or
disproportionate response to their expressed concerns. Thereby, negatively affecting the company’s brand
reputation.
The implication of this problem, the lack of understanding and quantifying the expressed sentiments, will lead
to the following;
 Difficulty to fully comprehend public opinion
 Lack of /inaccurate customer insights
 Ineffective marketing strategies and product development efforts
 A reactive rather than proactive reputation management efforts
 A misplaced policy and governance priorities
Thus, the need for a means to utilize the online customer reviews meaningfully through the application of AI.
PURPOSE OF THE STUDY/OBJECTIVES
The research aims to develop a robust sentiment analysis framework to assess and intreprate
customer reviews across various industries.
 To develop a sentiment analysis model that classifies product reviews as positive,
negative, or neutral.
 To leverage text data to understand and quantify emotions, opinions, and attitudes
expressed by individuals
 To explore how sentiment analysis can be used to extract meaningful insights from
textual data
 To clean and prepare the collected test data for sentiment analysis by transforming it into
a usable format
 To apply the chosen sentiment analysis model to classify the sentiment or text data into
predefined categories (e.g., positive, neutral, negative)
 To choose an appropriate machine learning or deep learning model for sentiment analysis
based on data and project requirements
 To assess the performance of the sentiment analysis model using appropriate metrics and
validation techniques
 To analyze and visualize the sentiment in the text data

DEFINITION OF (UNFAMILIAR) TERMS:


Sentiment Analysis
DistilBERT-

SCOPE OF STUDY
Data sources: the source of text data for training is a customer reviews from public online dataset i.e
kaggle
Sentiment Categories
 Binary:(positive, negative).
Textual Analysis Techniques
 Employee various NLP methods, including:
o Tokenization and normalization
o Part-of-speech tagging
o Named entity recognition

The limitations and Challenges


We acknowledge the potential limitations such as:
 Identification of entities in texts such as product categories, topics or themes will not be
carried out.
 nuanced expressions such as sarcasm, irony, slang not taken care of
 Model may not perform well in another domain due to differing language use
 Data imbalance in the available dataset with score distributions positively or negatively
skewed
 Domain-specific terminology is used.
SIGNIFICANCE OF THE PROJECT:

 Enhanced Customer Understanding


By analyzing customer reviews, businesses can gain a deeper understanding of customer
preferences, needs, and pain points. Thus, gaining insight into consumer behavior helps
business to tailor their products and services to better meet customer expectations.
 Informed Decision-Making
Insights from sentiment analysis can guide product improvements and innovations based
on direct customer feedback. Businesses can also target marketing campaigns that
resonate with customer sentiments, enhancing engagement and conversion rates.
 Competitive Advantage
By monitoring sentiment trends, companies can identify gaps in the market and adjust
their offerings accordingly, gaining an edge over competitors. Early detection of negative
sentiments allows businesses to address issues proactively, maintaining a positive brand
image.
 Real-Time Feedback Loop
Automated sentiment analysis can provide real-time insights, enabling companies to react
quickly to customer concerns and feedback. Regular monitoring of customer sentiment
helps businesses adapt and evolve based on ongoing consumer feedback.
 Cost Efficiency
Automating sentiment analysis reduces the time and resources spent on manual review
assessment, allowing teams to focus on strategic initiatives. Sentiment analysis can easily
scale to handle large volumes of data from multiple sources, making it a viable solution
for businesses of all sizes.
 Data-Driven Strategy Development
Sentiment analysis provides quantifiable data that supports strategic decision-making,
reducing reliance on intuition alone. Businesses can establish metrics to measure
customer sentiment over time, helping to assess the impact of changes in products,
services, or policies.
 Cross-Industry Applications
The versatility in use cases allows Sentiment analysis to be applicable across various
sectors, including retail, hospitality, technology, and healthcare, making its insights
valuable to a broad audience. It can also be used in industry benchmarking by allowing
sentiment to compare their performance against industry standards or competitors.
 Customer Engagement and Loyalty
Understanding and addressing customer sentiments fosters better relationships, increasing
customer loyalty and retention. Businesses can leverage sentiment insights to create
personalized interactions, enhancing customer satisfaction.
 Academic and Research Contributions
The methodologies and findings from a sentiment analysis project can contribute to
academic research in fields like consumer psychology, marketing, and artificial
intelligence. Additionally, the project can lead to the development of new or improved
natural language processing algorithms, advancing the field.

CHAPTER TWO
OVERVIEW
Sentiment analysis or opinion mining is a subfield of natural language processing(NLP) that focuses on
determining the emotional tone behind a body of text. With the rise of digital platforms, vast amounts of
customer feedback are generated by users, making sentiment analysis a vital tool in understanding
consumer perceptions and improve their products and services. This research proposal aims to develop a
robust sentiment analysis framework that classifies customer reviews, and extracts actionable insights for
the business.

LITERATURE REVIEW:
Sentiment analysis is an important task in natural language processing(NLP) and data mining. It involves
extracting and analyzing subjective information from textual data to determine the sentiment or expressed
opinion. Utilizing a systematic literature review method, it shows that there are various
techniques and approaches that have been developed and tested in sentiment analysis. Some of
the commonly used techniques include rule-based, classification-based, and deep learning –based
methods.
Brief Background of Sentiment Analysis
Initial studies focused on rule-based methods utilizing lexicons on sentiment dictionaries
containing words and their associated sentiments to assign scores to words and aggregating them
to determine the overall sentiment. (Hu and Liu, 2004)
The second wave, relied on advances in Machine learning techniques which marked a significant
shift by applying supervised learning methods such as Naïve Bayes, and SVM trained on labeled
dataset e.g., twitter sentiment dataset.
And now, deep learning approaches leverages neural networks using architectures like LSTM
and CNN for capturing complex patterns in textual data or Transformer models (Bert and GPT)
to provide contextual embedding to sentiments in textual data.
Sentiments can be positive, negative, or neutral emotions, attitudes, evaluations, or opinions
toward an entity such as product, service, brand, event, or issue. Sentiment analysis involves the
process of data acquisition, text cleaning, pre-processing, feature engineering modeling,
evaluation, deployment, and monitoring and model update

Levels of Sentiment Analysis


There are three levels of sentiment analysis
Document level: the task at this level is to classify whether a whole opinion document expresses
positive or negative sentiment. For example, given a product review, the system determines
whether the review expresses an overall positive or negative sentiment about the product. It
assumes that each document expresses opinion on a single entity-single product
Sentence level: The task at this level, the document is broken down into several sentences and
each sentence is treated as a unit and analyzed at one time. goes to the sentences and determines
whether each sentence expressed a positive or negative or neutral sentiment. This is closely
related to subjectivity classification
Aspect level: the main task is to extract aspect terms from a product and then customer feedback
is analyzed based on the extracted aspects. (Ray & Chakrabarti, 2022)

Sentiment analysis has gained significant traction in recent years, with researchers exploring
various approaches to optimize its performance. A number of studies have focused on the
optimization of sentiment analysis using machine learning classifiers, highlighting the
importance of feature selection and algorithm tuning in achieving accurate sentiment
classification. For example, one study delved into the application of sentiment analysis in the e-
commerce sector, demonstrating the industry's keen interest in understanding consumer opinions
and its impact on decision-making. (- & -, 2023) Another study explored the use of advanced
machine learning techniques, such as deep learning, to further enhance the accuracy and
robustness of sentiment analysis models. Overall, the existing literature underscores the growing
importance of sentiment analysis in various domains, particularly in the e-commerce industry
where understanding customer sentiment can provide valuable insights for business strategy and
decision-making.

CHAPTER THREE
METHODOLOGY AND RESEARCH DESIGN
Developing a sentiment analysis project involves several steps, from understanding the problem
to deploying the model for real-time use. Hereinafter is a methodology sued to guide the entire
development process from gathering data to model deployment
Sentiment Analysis Pipeline
1. Problem Definition and Requirements Gathering
 Business Goal: To analyze product reviews and assess the performance on the
market
 Define the scope: the scope of the sentiment analysis model is a binary sentiment
classification problem
 Set success criteria: the metrics for a successful model includes precision, recall,
F1-score
2. Data Collection and Preprocessing
Before performing sentiment analysis, there is need to collect and clean the data
Step:
Data Collection: The data will be used as training dataset for the model. The quantity and
quality has a direct bearing on the performance of the model. This is essentially the
process of gathering data from various sources
o Public datasets e.g. Kaggle, UCI Machine learning repository, government
databases
o Web scrapping from websites using tools such as BeautifulShop or Scrapy
o APIs from services like Twitter, Facebook, weather API, financial APIs etc. to
gather real-time data
o Surveys and user input: directly collecting data from users through forms and
surveys
o IoT Devices: data from sensors and devices in real-time applications
o Synthetic data generation
Data format: the data is in a structured format(CSV) with text(reviews) and sentiment
labels (positive, negative)
Preprocessing: the raw data can be difficult or impossible to work with especially for
sentiment analysis. All collected data needs to be prepared and cleaned prior to analysis.
Depending on the how and what data is collected, this process may include;
o Handling missing values by imputation with mean or mode values
o Remove duplicates to avoid a skewed distribution
o Remove noise remove html tags, URLs, whitespaces, numbers, special
characters, and punctuations.
o Lowercasing: Converts all text to lowercase to avoid case-sensitive
discrepancies
o Tokenization: Split text into individual words or tokens.
o StopWord Removal: - remove common words that may not be significant
such as “and”, “the”. That do not contribute to sentiment
o Stemming /Lemmatization –reducing words to their base or root form.
o Stemming removes prefixes and suffixes to get the basic form.
o Lemmatization changes the words to basic forms based on dictionary of
existing words.
3. Feature Extraction: This process, also known as Feature Engineering, involves selecting
and transforming data to a numerical representation so that it can be fed into a machine
learning model.
The methods include:
o Bag of Words(BoW) – represent the text as a set of words and their frequency
counts. Where each unique word is treated as a feature. The frequency of each
word in the document is recorded
o TF-IDF (Term Frequency-Inverse Document Frequency): weigh terms based on
their significance across the dataset. It reduces the weight of common words and
increases the weight of rare words
o N-grams involving sequences of “n” items from a given text. These items can be
words, characters, or symbols. N-grams can be used as features in machine
learning models where each unique n-grams is representing a dimension in a
feature space, and the frequency can be counted.
o Part-of-Speech Tagging – extracting grammatical tags (nouns, verbs, adjectives)
to gain insights into the structure of the text, which can indicate sentiment.
o Word Embedding: Representing words in a continuous vector using methods like
Word2Vec or GloVe, or FastText
o Transformers: use contextualized embeddings like BERT or GPT based models
for more accurate representation.
4. Model Training/Selection
Data Splitting: Before we start training the model, the dataset is typically split into two or
more subsets.
o Training set: used f]to train the model
o Validation set: used to tune hyperparameters and evaluate the model during
training (optional but recommended
o Test set: used to evaluate the final performance of the model after training.
The test set should never be seen during training to ensure the model generalizes
well.
Common split rations
 80-20: 80% for training, 20% for testing
 70-15-15: 70% for validation, and 15% for testing
Choosing a Model
Depending on the type of problem (regression, classification, clustering, etc.) we choose
an appropriate model
o Linear Models: e.g., Linear Regression, Logistic Regression.
o Tree-Based Models: e.g., Decision Trees, Random Forest, Gradient Boosting
Machines (GBM), XGBoost.
o Support Vector Machines (SVM).
o Neural Networks: e.g., Deep Learning models (CNN, RNN, Transformers).
o k-Nearest Neighbors (k-NN).
o Naive Bayes.
In our case, we have selected DistilBERT model
Model Initialization
Each machine learning model has certain parameters known as hyperparameters which
control aspects of the model’s learning (fitting) process.
Learning Rate: controls how much to change the model weights with each update
Number of Trees in Random Forest determines how many trees the model will build
Hidden Layers and Units in Neural networks determines the architecture of the neural
network
Kernel Type in SVM specifies the type of kernel used
The model is initializing with default values of these hyperparameters

Model Training
This involves feeding the model input data along with corresponding labels (in
supervised learning). The model adjusts its internal parameters (e.g., weights) based on
this data to minimize a loss function.

Loss Function: a function that quantifies the error between the model’s prediction and the
true value. The goal of training is to minimize the loss function.
Regression: MSE, MAE, RMSE, R2
Classification: cross-entropy loss

Optimization: an optimization algorithm is use to adjust the model’s parameters tom


minimize the loss function. The common algorithm used is Gradient Descent

Epochs: The model goes through the training dataset multiple times(epochs) to learn. The
model’s weights are updated after each epoch.

Hyperparameter Tuning
This is fine-tuning the model’s hyperparameters to improve the performance with
common approaches including Grid Search, Random Search, and Bayesian optimization

Model Evaluation –On Validation Set


After training the model, the model’s performance is evaluated on a validation set
through cross-validation.

Handling of overfitting and Underfitting


Overfitting-this is when a model performs well on the training data but poorly on the
test/validation data. This means the model has “memorized” the training data instead of
learning general patterns.
Prevention; use regularization (L1, L2), cross-valuation, early stopping for neural
networks or pruning for decision trees
Underfitting: this is when the model fails to capture the underlying patterns in the data,
resulting in poor performance on both the training and test data.
Prevention: increase model complexity (e.g. more layers in neural networks or deeper
trees) or ensure more features

5. Model Evaluation
Once the model has been trained and hyperparameters tuned, evaluation of the final
model can be done on the test dataset. The test dataset should not be seen in any part of
training process, making it a good indicator of how well the model will perform on real-
world, unseen data.

Evaluation Metrics:

o Accuracy: Proportion of correctly classified instances


o Precision: Proportion of positive predictions that are actually positive
o Recall: Proportion of actual positive instances that are correctly identified
o F1-Score: Harmonic mean of precision and recall

This evaluation helps understand the extent to which the model can classify sentiments
accurately.

6. Prediction/Inference
Once the model is trained, use the trained model to make predictions on new, unseen text
data. For each input text, the model predicts a sentiment label (e.g., positive, negative)

7. Post-Processing
After obtain the sentiment prediction, post-processing may involve converting predictions
back into a human readable format such as;
Aggregate results into a single score e.g. average sentiment

8. Visualization and Reporting


Visualize the results e.g. word clouds, confusion matrices and generate reports to
communicate the findings

9. Deployment
If the model performs well, we deploy it into production to start making predictions on
new real-time data. This involves saving the model, setting up APIs, or integrating it into
a larger software SYSTEM.
Integrate the model into an application or service for real-time use

10. Monitoring and maintenance


Continuously monitor the model’s performance and update it with new data to maintain
accuracy.
Requirements

Programming Language Python 3.7 or higher


Libraries Tranformers-DistilBERT,
PyTorch, scikit-learn, nltk,
pandas, numpy, matplotlib,
Software
spacy, re, IPython
Operating System Windows
Linux

CPU RAM-8GB
Hardware
GPU Recommended for training
Google Colab Comes with free GPU
Development Environment
Jupyter Notebooks
Dataset Product reviews.csv

Hardware Requirements
Windows 7 pro/Higher Ubuntu 15.04 Mac OSX10.10 Intel

Processor Intel Core i5 or equivalent Intel Core i5 or equivalent Dual core Intel
Memory 16GB 16GB 16GB
Disk Space 1.5GB of free disk space 1.5GB of free disk space 1.5GB of free disk space
References

You might also like