Aca 21 Ram
Aca 21 Ram
May 8, 2024
Declaration
All sentences or passages quoted in this report from other people’s work have been specifically
acknowledged by clear cross-referencing to author, work and page(s). Any illustrations that
are not the work of the author of this report have been used with the explicit permission
of the originator and are specifically acknowledged. I understand that failure to do this
amounts to plagiarism and will be considered grounds for failure in this project and the
degree examination as a whole.
i
Abstract
Furthermore, the report discusses the challenges and potential biases inherent in sentiment
analysis and Twitter data. The findings of this study underscore the importance and utility
of sentiment analysis in the digital age, particularly in the context of political discourse and
election predictions. This project uses support from existing body of knowledge on sentiment
analysis, providing fresh perspectives on its applications and implications, especially in
understanding and forecasting public sentiment in political contexts.
ii
Acknowledge
I would like to express my deep appreciation to my supervisor, Dr. Mark Hepple, for
his invaluable guidance throughout my project. Dr. Hepple’s approachable manner and
insights not only sharpened my work but also made the journey enjoyable. His assurance
and pragmatic approach to solving problems have been tremendously helpful. Thank you,
sir, for being such an outstanding mentor and for all your support—I am immensely thankful.
I’m grateful and happy to have the unwavering support of my family, which has blessed
me with the opportunity to pursue my passion in Software Engineering. I would also like
to express my appreciation to the University and the Department of Computer Science for
providing an environment that fosters growth and learning. Thank you all for your invaluable
support and encouragement.
iii
Contents
1 Introduction 1
1.1 Project Overview and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Methodology and Expected Outcomes . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Literature Survey 4
2.1 Project Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 What is Sentiment Analysis? . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Usage of sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Social Media Stream- X(Twitter) . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 Applications in Election Analysis . . . . . . . . . . . . . . . . . . . . . 5
2.2 Data Collection for Twitter Sentiment Analysis . . . . . . . . . . . . . . . . . 6
2.2.1 Main Format of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Twitter-Specific Challenges . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Data Preprocessing for Twitter Sentiment Analysis . . . . . . . . . . . . . . . 7
2.3.1 Cleaning and Noise Reduction . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Stemming/Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Feature Identification & Extraction . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Relevant Features for Election Sentiment . . . . . . . . . . . . . . . . 8
2.4.2 Text Representation Techniques (TF-IDF, Word2Vec, BoW) . . . . . 9
2.4.3 Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Supervised Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 Naive Bayes (NB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Unsupervised Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.1 Lexicon Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.2 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.3 t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Evaluation Metrics & Model Performance . . . . . . . . . . . . . . . . . . . . 17
2.7.1 Precision, Recall, and F1 Scores . . . . . . . . . . . . . . . . . . . . . 17
iv
CONTENTS v
Appendices 57
A 57
A.1 Vector Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.2 Distance Definations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.3 Hardware and Software Specifications . . . . . . . . . . . . . . . . . . . . . . 57
A.4 Selection of Tools and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.5 Dummy Classifier Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . 59
A.6 Full Report of all results from the Code . . . . . . . . . . . . . . . . . . . . . 59
A.7 Dummy Classifier Defination . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
List of Figures
5.1 Metrics Graphed For Naive Bayes and SVM (Trials 1-6) . . . . . . . . . . . . 33
5.2 Time & F1-Scores (X-Axis: Steps Taken, Y-Axis: Values) . . . . . . . . . . . 34
5.3 Metrics Graphed For Naive Bayes and SVM . . . . . . . . . . . . . . . . . . . 36
5.4 Time & F1-Scores (X-Axis: Steps Taken, Y-Axis: Values) . . . . . . . . . . . 37
5.5 SVM Scores of various mixing methods . . . . . . . . . . . . . . . . . . . . . . 49
vii
List of Tables
viii
Chapter 1
Introduction
To contextualise the significance of the project, recent studies have underscored the
transformative impact of social media on the political landscape and election outcomes like
”Social Media and Elections” by Fujiwara, (2019) (14). Evidence from that study suggests,
platforms like Twitter have played a discernible role in shaping electoral preferences,
evidenced by a reduction in the Republican vote share in both the 2016 and 2020 presidential
elections. Recognising the importance of sentiment analysis is integral to understanding
public opinion. In the context of election prediction, sentiment analysis serves as a valuable
tool to gauge public sentiment and forecast potential election outcomes also confirmed by
(15) Guha (2020) too.
1
CHAPTER 1. INTRODUCTION 2
Chapter 4 covers the practical aspects of the sentiment analysis system focusing on design,
implementation, and rigorous testing of the developed models. This chapter outlines the
system architecture and describes the specific methodologies employed for preprocessing,
feature extraction, and sentiment classification. It details the application of various NLP
techniques and machine learning algorithms to process and analyse Twitter data effectively.
Additionally, providing insights into the coding practices, the use of libraries, and the
integration of different components into a cohesive system. Testing covers unit, integration
and manual tests to ensure model accuracy and robustness, culminating in a series of
evaluations that validate the system against the project’s requirements.
Chapter 6 summarises the study’s findings, highlighting the key contributions and
achievements of the sentiment analysis project. It reflects on the challenges encountered
during the project and how they were addressed. The chapter proposes future directions for
research based on the limitations identified during the testing and evaluation phases. Which
includes suggestions for improving model accuracy, exploring additional data sources, and
integrating more sophisticated NLP techniques.
Chapter 2
Literature Survey
This chapter offers a thorough review of Sentiment Analysis (SA), covering different methods,
tasks, and approaches. Where sections are structured to follow the basic framework of
sentiment analysis. The chapter delves into machine learning algorithms, shedding light
on their applications and the role of pre-trained models, expands on role of twitter in this
project and gives insights into tools and resources used, with supported references.
4
CHAPTER 2. LITERATURE SURVEY 5
Supporting this idea, (12) Ebrahimi et al. (2018) explored the challenges of SA, emphasising
the role of Twitter in understanding public sentiment during elections. Other researchers
have focused on specific elections, such as the 2020 US presidential election, conducting
large-scale SA of Twitter data to gain insights into political sentiments like (4) Alvi
(2023). However, it could have been improved by using a more comprehensive dataset
of voters and have candidate-related tweets be analysed better with keywords. They
could have integrated multiple data sources and carefully considered other relevant factors,
such as traditional polling data, campaign strategies and socioeconomic factors in their study.
• Time-Stamp • Likes
The rows in the table above represents features – They are measurable properties or
characteristics of the dataset, used as input for SA. In the context of SA, features can
include the text of the tweet, metadata such as the number of retweets and likes, and
other contextual information. These features are used to train machine learning models
to recognise patterns and make predictions about the sentiment expressed in the tweets.
CHAPTER 2. LITERATURE SURVEY 7
2.3.2 Tokenization
Tokenization is a fundamental step in NLP– the goal is to break down text into
smaller units, typically words or phrases, called tokens. Tokenization basically
converts a continuous text into a structured format, making it more suitable for
CHAPTER 2. LITERATURE SURVEY 8
SA. These tokens are the building blocks of natural language. They can be words,
phrases, or even individual characters, depending on the level of granularity desired.
For example, Let’s take the sentence: “I love harry potter & the order of the phoenix.”
Tokenized: [“I”, “love”, “harry”, “potter”, “&”, “the”, “order”, “of”, “the”, “phoenix”, “.”]
2.3.3 Stemming/Lemmatization
Stemming and lemmatization are both important techniques in NLP to reduce words to
their base form. Thus far, unnecessary and unimportant information has been removed; now
focus shifts to condensing essential information for further analysis.
Stemming reduces words to their base form by removing suffixes. Although, the
stem of a word may not always be a valid root word, this process allows different
variations of a word to be represented by a common stem. This is done by removing
prefixes or suffixes from words, a heuristic process, using rules to remove common affixes
without understanding the context of the word. For example: “eating” gets stem to “eat”.
Both methods share similarities in goal, to reduce dimensionality of the data ensuring that
variations of words are treated consistently.
• Hashtags are a crucial component in SA, as they enhance engagement and help
categorise sentiments expressed in tweets. The co-occurrence of these electoral
hashtags may provide additional context, thus improving comprehension of the intrinsic
sentiment. Leveraging this, sentiment of each tweet can be classified more accurately.
CHAPTER 2. LITERATURE SURVEY 9
• Candidate Mentions are integral features for election sentiment analysis as well. By
identifying and analysing the sentiment expressed towards specific candidates,sentiment
towards specific candidates becomes easier to gauge. This could provide valuable
insights into the public perception of candidates during the election period as per Bansal
et-al, 2018 (7).
• Demographic Features such as the age, gender, and occupation of users are also
relevant in election SA. These features can provide insights into the sentiment of
different demographic groups, which can be crucial in understanding the diverse
perspectives and sentiments within the electorate (Alvi et-al, 2023) (4).
• Tweet Metadata such as the number of retweets, likes, and replies a tweet receives, are
important for sentiment too. These features provide insight into the reach and impact
of a tweet, which can be indicative of the popularity and influence of the sentiments
expressed in the tweets (Alvi et-al, 2023) (4).
• Bag of Words (BoW): is the simplest and most commonly used technique.
It represents text as a bag of words. It disregards grammar and word order
but keeps tally of each word essentially creating a vocabulary of all the unique
words in a data-set. It represents each document as a vector of numbers
where each number represents the count of a particular word in the document.
For example, we have 2 sentences,“Fish and chips are the best”, “Ice Cream
and Fries are the best”. We can now create a vocabulary/list of unique words from
all the sentences [“Fish”, “and”, “chips”, “are”, “the”, “best”, “Ice”, “Cream”,
“Fries”]. We can then proceed to represent each sentence as a vector of word count:
CHAPTER 2. LITERATURE SURVEY 10
Total No. of doc in the collection
IDF(t) = log (2.2)
No of doc in the collection containing term t
where:
There are two primary methods for constructing vector representations in Word2Vec;
Continuous Bag of Words (CBOW) and Skip-Gram. CBOW predicts the target
word from its surrounding context, while Skip-Gram does the opposite, predicting the
surrounding words from the target word as outlined by Onishi et al. (2020) (24).
CBOW, takes the context words as input and predicts the target word. This process
enables CBOW to understand the meaning of a word by considering the words that
typically surround it. However, Word2Vec Skip-Gram operates by predicting the
surrounding words from a given target word. It is opposite of CBOW, aiming to
understand the context in which a word is likely to appear.
CBOW and Skip-Gram are distinct, yet under Word2Vec. Together, they capture
both local and global statistics of words, producing vector representations that
comprehensively reflect the semantic and syntactic intricacies of the language.
(A × D − B × C)2 × (A + B + C + D)
X2 = (2.4)
(A + B) × (C + D) × (A + C) × (B + D)
Here, A, B, C, and D are the counts in each cell of the contingency table. The chi-square
statistic is used to test the hypothesis that the occurrence of a specific word is independent
of sentiment. Higher values of X 2 suggest a stronger association between the word and
sentiment.
Another method is the Least Absolute Shrinkage and Selection Operator (LASSO).
LASSO is a regularisation technique* that can be seamlessly integrated into the
model-building. It introduces a penalty term based on the absolute values of the feature
coefficients, encouraging sparsity in the feature space. In SA, LASSO helps identify and
retain the most impactful features while shrinking others to zero. During the training of
a SA model, LASSO simultaneously performs feature selection and model fitting*. This
ensures that the resulting model not only predicts sentiment accurately but also focuses on
the most influential features, enhancing interpretability. (Muthukrishnan et al, 2016) (23)
1 X N Xp p
X
β̂ = argmin (yi − β0 − βj xij )2 + λ |βj | (2.5)
β 2N
i=1 j=1 j=1
1 PN Pp
• The first term , 2N i=1 (yi − β0 −
2
j=1 βj xij ) , represents the residual sum of squares,
highlighting the difference between the observed and predicted values.
• The second term, λ pj=1 |βj |, is the L1 penalty. This term improves sparsity in the
P
A regression task trains models to predict a continuous outcome (sentiment score) based
on one or more variables (Tweet features). So, the results would be a continuous value.
A classification task trains models to categorise text data into predefined sentiment
classes, allowing for automated labelling of sentiments in unseen textual content.
CHAPTER 2. LITERATURE SURVEY 13
This project would compare the performance of all implemented models. Therefore, viewing
this as a majority regression problem would give a better scale of understanding than having
a classification problem where there are only 3 or 4 categories to categorise sentiment.
Random Forests is an excellent choice due to its ability to handle high-dimensional and
sparse datasets, which are common in text analysis. It can also deal with missing values,
outliers, and imbalanced classes, which may affect the performance of other algorithms.
Moreover, Random Forests are straightforward to implement and tune, as they have few
hyper-parameters and requires little preprocessing or scaling of the data as outlined by
Bahrawi (2019) (6).
Random Forests works by selecting random samples from a given dataset and constructs a
decision tree for every sample. It then performs a vote for each predicted result, and the
prediction with the most votes is selected as the final prediction (Yu et al, 2020) (32). This
approach helps to improve the accuracy and robustness of the model. For example, a study
by Karthika et.al (18) used the Random Forest algorithm for SA of social media data and
achieved an accuracy rate of close to 85% . Also Bahrawi with data sources from Twitter
using the Random Forest algorithm approach, achieved an accuracy of around 75%.
P (d|c)P (c)
P (c|d) = (2.6)
P (d)
where:
A CNN has multiple layers of convolutions, interspersed with non-linear activation functions.
Each convolution layer applies multiple filters to the input, and each filter is responsible for
extracting a specific feature from the input data. A typical CNN architecture consists of
three types of layers: convolutional layers, pooling layers, and fully connected layers.
The Convolutional Layer is the core segment of a CNN. The layer’s parameters consist
of a set of learnable filters or kernels, which have a small receptive field but extend through
the full depth of the input volume. During the forward pass, each filter is convolved across
the width and height of the input volume, computing the dot product between the entries of
the filter and the input, producing a 2-dimensional activation map of that filter. The output
of the convolutional layer is a stack of these activation maps, one for each filter.
The Pooling Layer is a down-sampling layer that follows the convolutional layer. It
reduces the dimensions of each feature map while retaining the most important information.
There are several types of pooling operations, but the most common one is max pooling,
which extracts the maximum value from each segment of the feature map.
Lastly, the Fully Connected Layer. After several convolutional and pooling layers, the
high-level reasoning in the neural network happens in the fully connected layer. Neurons
in this layer have connections to all activations (outputs) in the previous layer, as seen
in regular (non-convolutional) Artificial Neural Networks. Their activations can thus be
computed as an affine transformation, with matrix multiplication followed by a bias offset
(Janke et al, 2019) (17).
CHAPTER 2. LITERATURE SURVEY 16
• 2. Assignment: Assign each data point, based on their distance from the randomly
selected points (centroid), to the nearest centroid, which will form the predefined
clusters.
• 3. Update: Compute mean of all the points for each cluster and set the new centroid.
• 4. Repeat: Repeat step 2 and 3, which reassigns each data point to the new closest
centroid of each cluster, until the centroids do not change significantly, meaning they
have converged.
2.6.3 t-SNE
t-Distributed Stochastic Neighbour Embedding (t-SNE) is a useful in SA for visualising
the distribution of text data in a lower-dimensional space. By projecting high-dimensional
features of textual data into a two-dimensional space, t-SNE aids in understanding the
relationships between different sentiment classes and visualising complex structures in SA
datasets. The implementation steps for t-SNE include:
• Recall focuses on capturing all relevant positive or negative instances. This metric is
particularly crucial when the objective is to capture the complete spectrum of relevant
sentiments expressed in tweets. High recall scores indicates the model’s effectiveness in
minimising false negatives, ensuring that it does not overlook significant tweets.
CHAPTER 2. LITERATURE SURVEY 18
• F1 combines precision and recall into a singular metric, providing a balanced evaluation
of the model’s overall performance. Computed as the harmonic mean of precision and
recall, it differs by considering both false positives and false negatives, ensuring no single
aspect dominates making precision and recall equally important. The F1-score offers
a comprehensive assessment of the model’s accuracy crucial in SA where an imbalance
between precision and recall can misrepresent the true sentiment as supported by
Margherita et.al, 2020 (22).
precision × recall
F1 = 2 × (2.7)
precision + recall
where:
The original sample is randomly partitioned into K equal-sized sub-samples or folds. Then, a
single sub-sample is retained as the validation data for testing the model, and the remaining
K-1 sub-samples are used as training data. This cross-validation process is then repeated K
times, with each of the K sub-samples used once as the validation data. The K results from
the folds can then be averaged to produce a single estimation. This estimation provides a
better measure of model performance, as it reduces the variance of the performance estimate
and allows more data to be used for training.
The Silhouette Score metric is used to evaluate the quality of clusters specifically in
unsupervised learning. It provides a quantitative measure of how well each object fits within
its assigned cluster and how distinct the clusters are from each other. The Silhouette Score
ranges from -1 to 1, where a high value indicates that the object is well matched to its own
cluster and poorly matched to neighbouring clusters. If most objects have a high value, then
the clustering configuration is appropriate. If many points have a low or negative value,
then the clustering configuration may have too many or too few clusters as per Shahapure
et al, 2020 (27).
The Silhouette Score is calculated using distance metrics such as the Euclidean distance or
Manhattan distance. These distance metrics are used to calculate the average intra-cluster
distance (a) and the average inter-cluster distance (b) for each data point, which are then
used to compute the Silhouette Score. The Score can then be used to determine the
natural number of clusters within a dataset. The highest Silhouette Score indicates the
optimal number of clusters, providing valuable insights into the structure of the data and the
effectiveness of the clustering algorithm (A.2).
As a recap:
• F1 Score: The weighted average of Precision & Recall. It finds the balance between
precision and recall.
The confusion matrix can be used to evaluate the quality of clusters. However, it’s important
to note that the confusion matrix is primarily designed for classification problems, and its
use in clustering requires some adaptation. In clustering, there is no association provided
by the clustering algorithm between the class labels and the predicted cluster labels. In
a clustering problem, the rows of the confusion matrix represent the actual labels, and
the columns represent the new cluster names (i.e., cluster-1, cluster-2, etc.). The diagonal
elements of the confusion matrix represent the instances where the actual label matches the
predicted cluster, and the off-diagonal elements represent the instances where the actual
label does not match the predicted cluster as described in (13).
Note on the State-Of-the-Art in SA: This project will focus more on machine learning
models, however, sophisticated machine learning techniques like deep learning, models,
especially those based on neural architectures like BERT and RoBERTa, have been
outstanding in enhancing the interpretation of sentiments within large volumes of data.
Deep learning approaches excel in capturing subtlety in data, which is key for understanding
complex human emotions and opinions in politics (Zhang et al, 2018)(34).
Chapter 3
3.2 Datasets
In this project, various models will be trained and tested with labelled and unlabelled
datasets, as outlined in Chapter 2. The design and further usage of the dataset will be
discussed in Chapter 4.
21
CHAPTER 3. REQUIREMENTS AND ANALYSIS 22
3.3 Methodology
The methodology for the project is sequentially explained by sections in chapter 2, following
that. Here are the Steps, simplified;
1. Data collection: Tweets can be collected either by web scraping or by using publicly
available data-sets, here, the latter is chosen. Golden dataset together with statistics
from dummy classifier* will be used to compare the results for the model performance.
Dataset will be collected from Kaggle, California State University research website,
ACL-anthology-SemEval and GitHub.
enhancing data uniformity. Finally, Stopword Removal eliminates common words that
have little analytical value. Effectively managing the linguistic complexities of tweets.
4. Sentiment Classification & Evaluation: All of the above steps will be now be
put to use. Chapter two Section-2.5, explores supervised learning models that will
be implemented for their strength in high-dimensional and sparse data-set sentiment
analysis. Section 2.6 describes unsupervised learning models that will be implemented
for their ability to uncover patterns and group data based on features. The performance
of these models will be compared using relevant metrics to determine the most effective
approach for sentiment analysis in the context of social media and politics. Key models
that will be developed are lexicon implementations, SVM, Naive-Bayes and Random
Forests.
• Experiment 3 and 4: Linear models like Naive Bayes, Logistic Regression, and
Linear SVM are known for their efficiency and effectiveness in high-dimensional spaces
where relationships between features might be linearly separable. In contrast, non-linear
models like Random Forest, RBF-SVM, and Decision Trees are crucial when dealing
with the nuanced and multi-layered nature of political discourse, where the sentiment
might not be clearly expressed. Therefore, this selection of classifiers are chosen and
compared to identify the most appropriate model for analysing sentiment in political
contexts on social media platforms. (experiment 3 and 4 each with 4 trials)
4.1 Overview
Following a modular design philosophy, the implementation was divided into distinct
modules matching the principles of the standard SA pipeline. This pipeline was chosen
for its demonstrated robustness and reliability, thus offering a stable framework for
systematically integrating experimental enhancements.
25
CHAPTER 4. DESIGN, IMPLEMENTATION & TESTING 26
Above chosen methods of feature extraction was designed with a forward-thinking approach,
prioritising the integration of diverse text representation techniques. The design and
implementation covers the complexities of combining various preprocessing steps, the Maths
behind the calculations, scaling and integration. Accommodating a wide range of technique
combinations without sacrificing computational efficiency.
• Naive Bayes Classifiers: Selected for its proven effectiveness in text classification,
the Multinomial and Gaussian Naive Bayes classifiers are particularly adept at
processing high-dimensional data efficiently. The model’s performance is tweaked by
adjusting TF-IDF parameters enhancing its ability to detect contextually significant
terms.
For more details on the tools and libraries used, decisions behind using them,
refer to appendix A.4
4.5 Testing
Systematic testing ensures that each component operates correctly and yields accurate
results. Firstly, unit tests were used for individual modules such as the DataProcessor,
FeatureExtractor, and ClassifierManager. These tests include checks for the functionality
of text preprocessing, feature extraction accuracy, and the robustness of each classifier.
For instance, DataProcessor class is tested to verify that text normalization, stop-word
expansion, and negation handling are performed correctly. Using test cases with manually
predefined outputs, each function was validated ensuring correctness.
CHAPTER 4. DESIGN, IMPLEMENTATION & TESTING 31
Integration testing then follows, where the combined operation of multiple units is tested
as a whole in the Retrieve class. This ensures that data flows correctly between processes
and that the system behaves as expected when all components interact. For example, the
output of the FeatureExtractor should seamlessly integrate into the classifiers within the
ClassifierManager without data mismatch or loss. This exemplifies end-to-end tests that
simulate the processing of raw input data through to sentiment classification, validating the
system’s ability to handle real-world data under controlled test conditions*(Refer to A.2).
Finally, regression testing is performed whenever changes are made to the code-base,
ensuring that new code does not adversely affect existing functionalities. By adhering to
these meticulous testing protocols, the project aims to deliver a robust and reliable sentiment
analysis system that consistently produces valid and precise results.
Note: The robustness and consistency of the models were further validated by conducting
multiple trials. Each tested 5 times and averaged, to ensure the reliability of the results
presented in Chapter 5.
Chapter 5
This chapter explores the results of the various testes done to methodically improve different
parts of the aforementioned Sentiment pipeline. Each sectioned experiment will detail it’s
purpose and the background to support the reasons driving the development.
5.1.2 Implementation
The first iteration of the implementation was straightforward, focusing on removing and
condensing words. However, as the project evolved through further research and discussion,
the preservation of key political, negative and/or colloquial terms was understood to be
crucial. To this end, the system was enhanced to not only preserve these terms but also
to apply dynamic stop-word management thus, avoiding the loss of contextually significant
terms. Additionally, Negation handling was first applied only to the term with negativity
(”Not”, ”Neither”,”No”) then, refined to mark dependencies and modify neighbouring verbs,
adverbs and adjectives to reflect true sentiment more accurately.
32
CHAPTER 5. RESULTS AND DISCUSSION 33
Figure 5.1: Metrics Graphed For Naive Bayes and SVM (Trials 1-6)
• Trial 4 to Trial 5: Stop-word removal has now been scaled back and colloquial
expansion is enabled in Trial five. This gave a huge recovery in the performance
metrics showing that expanding contractions to their full form provides more clarity
and potentially more features for the classifier to use.
• Trial 5 to Trial 6: Trial 6 shows the best performance so far, with correctly
implemented methods in place. The final addition of negation handling significantly
affect sentiment analysis, as it changes the sentiment polarity of phrases. This proved to
be working well and correctly as supported by the results. This shows that recognising
negations allows the models to better understand the context and sentiment of the
comments, leading to more accurate classifications.
Figure 5.2: Time & F1-Scores (X-Axis: Steps Taken, Y-Axis: Values)
Moving on to focusing on the F1-Scores and the preprocessing time taken for each trial:
• Initially from Trials 1-3, the preprocessing time fluctuated, starting at 26.11 seconds
and peaking at 29.37 seconds. The introduction and expansion of the stop-word list
in Trial 3 and other fairly intensive methods costs on average 3 seconds more. During
these phases, the F1-Scores for both SVM and NB remain relatively similar with NB
experiencing a slight increase from 0.516 to 0.519, then a minor drop to 0.513. SVM’s
F1-Score shows a similar trend. These modest changes suggest that the addition of
special character removal and dynamic stopwords had a marginal but not substantial
impact on the classifiers’ ability.
• Despite the preprocessing time decreasing slightly to 25.85 s, Trial 4, where the
stop-word list was at level 2 (table 5.1), F1-Scores plummeted to 0.393 for NB and
0.411 for SVM. This indicates that aggressively removing stopwords can degrade the
classifiers’ performance, likely by eliminating contextually important words.
CHAPTER 5. RESULTS AND DISCUSSION 35
The impact of these 6 trials in experiment 1, has given the result of the best combination
of preprocessing (summarised at the end) to use for the rest of the implementation. This
combination is supported with strong metrics as discussed earlier. Results from both NB and
SVM, further solidifies that this combination would give the similar positive progression for
other classifiers.
5.2.2 Implementation
Feature extraction trials compose of using; Bag of Words (BoW), TF-IDF, Word2Vec
and Part-of-Speech (POS) Tagging. (Table in the next page)
• POS tagging and Word2Vec are integrated into the preprocessing steps. In the
DataProcessor class, the method pos tagger and train word2vec is implemented
to tag words with their respective tags using the spaCy and Genism library. The
vector or the counts of POS-tags are then potentially included in the feature set by the
combine features method of the FeatureExtractor class. (sparse matrices stacked
horizontally)
CHAPTER 5. RESULTS AND DISCUSSION 36
• In Trial 1, using BoW both NB and SVM show similar accuracy levels, with NB at
58.9% and SVM at 58.6%. This suggests that BoW provides a solid baseline, capturing
the frequency of terms effectively for both models and good feature to begin with.
• Trial 4 adds Word2Vec with TF-IDF and POS-tagging, which results in a slight
dip in SVM’s performance but a larger decrease in NB’s metrics especially accuracy
dropping to 37.6%. This massive drop for NB could indicate that the dense, continuous
vector space representations from Word2Vec might be conflicting with NB’s expectation
of discrete feature representation, while SVM still manages to handle the increased
complexity of the feature space.
Figure 5.4: Time & F1-Scores (X-Axis: Steps Taken, Y-Axis: Values)
Comparing the F1-Scores, we see a steady increase for SVM from 0.579 to 0.6 through Trials
1 to 4, which shows a consistent improvement in the balance between precision and recall,
due to SVM’s effectiveness in high-dimensional spaces and complex feature interactions. On
the other hand, NB’s F1-Score decreases after Trial 1, hitting a low of 0.373 in Trial 4, which
aligns with the reasons to change in accuracy, precision, and recall discussed above.
Time fluctuates across the trials, taking the longest in Trial 1 at 66.64 seconds and the shortest
in Trial 2 at 48.98 seconds. When Word2Vec is activated, in Trial 4, time increases to 61.92
seconds, which involves a computationally intensive process to produce word embeddings.
This additional effort, however, did not translate to improved performance for NB, and while
SVM maintains high F1-Scores, the minimal changes suggest diminishing returns for the
added complexity in feature extraction.
Overall, the integration of more sophisticated feature extraction techniques appears to
benefit SVM, which is well-equipped to handle high-dimensional and dense feature spaces,
while NB seems to be better suited to simpler, less dense representations.
Note Before Experiment 3 & 4: The Dataset used are identical and details of
classifier categorisation for the following experiments is in Section 3.4.1
Dummy Classifier Evaluation on Dev Set Dummy Classifier Evaluation on Test Set
Accuracy: 0.349 Accuracy: 0.375
Precision: 0.116 Precision: 0.125
Recall: 0.333 Recall: 0.333
F1-score: 0.173 F1-score: 0.182
Despite its notable strengths, the NB classifier reveals areas for optimisation,
particularly in distinguishing positive sentiments without misclassifying them as
neutral. Although, from base scores, NB’s doing much better, the macro-average
F1-scores, consistently better than the dummy classifier, signal the need for
fine-tuning to enhance positive sentiment detection without compromising the
accurate classification of neutral or negative sentiments.
The precision-recall curve reveals that the Naive Bayes model performs best
in identifying positive sentiments, with an Average Precision (AP) of 0.70,
followed by neutral sentiments (AP=0.58) and negative sentiments (AP=0.56).
As recall increases, precision tends to decrease, reflecting the inherent trade-off
between these metrics. The learning curve complements this analysis by
CHAPTER 5. RESULTS AND DISCUSSION 40
showing that the model’s performance improves significantly as the training data
size increases, precisely up to around 2000 examples. Beyond this point, the
testing and cross-validation scores plateau, suggesting that adding more data
may not substantially enhance the model’s capabilities. Notably, the small gap
between the testing and cross-validation scores indicates that the Naive Bayes
model is not over-fitting the data, which is a desirable trait. However, the
precision-recall curve highlights potential areas for improvement, particularly
in recognising negative and neutral sentiments more accurately. Combining
these insights, efforts could be directed toward feature engineering, exploring
alternative models, or getting more diverse training data to enhance the model’s
overall performance across all sentiment classes.
• Trial 3, Linear SVM: The SVM classifier performed the best so far, evident by
its results. On dev set, it achieves an accuracy of 59.5%, precision of 63.9%, and
recall of 59.7%, resulting in an F1 score of 60.2%. Comparatively, on the test set,
the accuracy slightly increases to 62.1%, with a similar precision of 64.7% and a
modest improvement in recall to 59.9%, culminating in an F1 score of 60.4%. This
improvement suggests that SVM’s robustness to variations in data, particularly
with a higher number of test examples, may have contributed to a slightly
better generalisation on unseen data. The performance suggests that while
SVM is relatively effective at differentiating between classes (especially positive
sentiments), it struggles with false positives and negatives, particularly for
the neutral and negative classes, which could be due to overlapping sentiment
features that are challenging to linearly separate.
CHAPTER 5. RESULTS AND DISCUSSION 43
The learning curve of the SVM shows a strong performance consistency, where
the score plateau around 0.9 suggests that adding more test examples does not
significantly change the classifier’s ability to generalise. This is contrasted by the
cross-validation score’s solid increase, indicating that the classifier benefits from
more training data, reducing over-fitting and enhancing its predictive accuracy
on unseen data.
• Trial 1, rbf SVM: SVM-RBF’s accuracy is 50.8% on the dev set and slightly
lower at 49% on the test set. This modest performance highlights potential
issues with either the model’s ability to generalise or perhaps its suitability to
the dataset’s characteristics. Notably, the classifier performs well in identifying
the neutral class but struggles significantly with the negative class, indicating a
potential bias towards classes with more data or more distinct feature sets. The
precision for the negative class on the development set stands at 72%, which
plummets to 53% on the test set, showing a loss in the model’s confidence when
generalised to new data. This drop could be due to the model over-fitting to
the negative examples in the training data or the variability within the negative
class that isn’t captured fully by the model. The model fails to capture most of
the actual negative sentiments with only 15% for the negative class on the test
dataset likely missing subtler cues that define negative sentiment.
CHAPTER 5. RESULTS AND DISCUSSION 45
The Precision-Recall curve suggest that while the classifier can maintain a
reasonable precision rate as recall increases (especially for the positive class),
the trade-off becomes stark as the curve steepens, particularly for the negative
class. This steep drop-off indicates that achieving higher recall substantially
compromises precision, a typical sign of a model struggling with class imbalance
or lacking robust features to differentiate classes effectively.
• Trial 3, Random Forest (RF): RF unlike other models, has precision reaching
as high as 85% on the development set, indicating its strong capability to correctly
identify negativity without many false positives. However, the recall for the
same category is considerably lower at 49%, suggesting that while the classifier
is precise, it fails to capture a significant portion of the actual negative cases.
Highlighting a balance between precision and recall but room for improvement
in recall performance needed. The model performs best with neutral sentiments
on the development set, achieving a precision of 46% and a recall of 80%, which
results in the highest f1-score of 59% among the three sentiment classes. So
RF is particularly adept at identifying the broader spread of neutral sentiments,
possibly due to its ensemble nature allowing it to generalise better across more
varied but less extreme sentiment expressions.
The Learning curve of RF shows a high plateau for testing scores, remaining
around 90%, while the cross-validation score increases with more test examples.
This suggests that the model could potentially improve with more training data,
as indicated by the rising green line. The stability of the testing score (red line)
at high levels demonstrates the model’s robustness and its ability to maintain
performance despite increased complexity or data size.
The combined impact of experiment 3 and 4 shows that the choice between
linear and non-linear classifiers should be guided by the nature of the data, the
complexity of the decision boundaries, and the specific performance metrics that are
prioritised for a given NLP task. For example, Some classifiers may achieve high
precision, they often do so at the expense of recall, and vice versa (This trade-off
was particularly evident in models like decision trees and SVM with RBF kernels).
Leveraging ensemble techniques such as boosting and bagging can help improve the
performance of decision trees and potentially other classifiers by reducing variance
and bias. These insights not only facilitate a more informed model selection process.
5.5.2 Implementation
• Combine Mode: combines the training and development data into a single
dataset. It then splits this combined dataset into new training and development
CHAPTER 5. RESULTS AND DISCUSSION 49
sets using train test split. The test data is kept separate- 90% of the combined
dataset is used for the new training set, 10% is used for the new development set.
• Mixing Mode: combines the development and test data into a single dataset.
This combined dataset is then split into new development and test sets using
train test split. 50% of the combined dataset is used for the new development
set, remaining 50% is used for the new test set.
• Pooling Mode: combines all the data (training, development, and test) into a
single dataset. It then splits this combined dataset into new training and test
sets using train test split (80% of the combined dataset is used for the new
training set and 20% is used for the new test set).
• Combine Method: Has high recall for the negative sentiment class (0.89),
indicating that the model was highly effective in identifying negativity but at
the cost of precision (0.39), reflecting high false positives. Likely due to the
training set being overly representative of negative examples. This is supported
by a relatively lower performance in identifying neutral and positive sentiments,
as indicated by the precision and recall values for these classes as per A.6. The
overall accuracy of 49.3% and the extended processing time of 3163.67 seconds
further suggest inefficiencies, possibly due to model complexity or over-fitting on
the data distribution.
Preprocessing
Effectiveness Use Case
Method
Creates a baseline by reducing Good for establishing
Lower-casing
noise and vocabulary size. consistency in text data.
Removing Special Removes noise, helping classifiers Effective in cleaning data to
Characters focus on more meaningful content. improve clarity for analysis.
Useful when fine-tuned to
Slightly decreases performance;
Stopwords, Lvl 1 preserve contextually
stopwords need to be well-selected.
significant terms.
Overly aggressive removal harms Not recommended; too
Stopwords, Lvl 2 performance by stripping essential aggressive for sentiment
features. analysis.
Highly effective in text
Expands contractions, improving
Colloquial normalisation and
clarity and feature availability for
Expansion understanding colloquial
classifiers.
words.
Huge improvement by accurately Critical for accurately
Negation Handling reflecting sentiment polarity capturing sentiment in
changes. phrases involving negations.
51
CHAPTER 6. CONCLUSION AND FURTHER WORK 52
Strategy Usefulness
Useful when aiming to maximise the
Combine training dataset but require a separate
set for model validation during training.
Useful when development and test data
Mixing are limited, and there is a need to re-balance
or create new sets from existing data.
Useful for maximising training data
Pooling availability and using a subset of the
combined data for final model testing.
Lastly, more work could have been done for specific political analysis, by analysing
a number of hand-picked tweets with varying political bias and testing the optimal
model’s performance more thoroughly.
Bibliography
[1] Akdogan, A. Word embedding techniques: Word2vec and tf-idf explained, 2021.
[2] Aleksandric, A., Saha, S., and Nilizadeh, S. Twitter users’ behavioral
response to toxic replies, n.d.
[3] Algorithms, E. Text data pre-processing techniques in ml, 2023.
[4] Alvi, Ali, S. F., Ahmed, S. B., Khan, N. A., Javed, M., and Nobanee,
H. On the frontiers of twitter data and sentiment analysis in election prediction:
a review. PeerJ Computer Science 9 (August 21 2023), e1517.
[5] AminiMotlagh., Shahhoseini, H., and Fatehi, N. A reliable sentiment
analysis for classification of tweets in social networks. Social Network Analysis
and Mining 13, 1 (2023), 7.
[6] Bahrawi, N. Sentiment analysis using random forest algorithm-online social
media based. Journal of Information Technology and Its Utilization 2, 2 (2019),
29–33.
[7] Barkha, B., and Sangeet, S. Sentiment analysis using twitter data:
a comparative application of lexicon- and machine-learning-based approach.
International Journal of Engineering & Technology 7, 4 (2018), 2036–2040.
[8] Bergmeir, C., Hyndman, R., and Koo, B. A note on the
validity of cross-validation for evaluating autoregressive time series prediction.
Computational Statistics Data Analysis 120 (2018), 70–83.
[9] Breiman. Random forests. Machine Learning 45, 1 (2001), 5–32.
[10] Dataconomy. Data preprocessing steps and requirements, 2023.
[11] Dean, B. How many people use twitter in 2023? [new twitter stats], 2023.
[12] Ebrahimi, Yazdavar, A., and Sheth, A. Challenges of sentiment analysis for
dynamic events. IEEE Intelligent Systems 32, 5 (2017), 70–75.
[13] evidentlyai. Confusion matrix explanation.
54
BIBLIOGRAPHY 55
[14] Fujiwara, T., Müller, K., and Schwarz, C. The effect of social media on
elections: Evidence from the united states. SSRN Electronic Journal (2022).
[15] Guha, P. Sentiment analysis on twitter data regarding 2020 us elections, 2020.
[16] Haddi, E., Liu, X., and Shi, Y. A study of the sentiment analysis techniques
in the social media. Procedia Computer Science 22 (2013), 747–752.
[17] Janke, J., Castelli, M., and Popovič, A. Analysis of the proficiency
of fully connected neural networks in the process of classifying digital images:
Benchmark of different classification algorithms on high-level image features from
convolutional layers. Expert Systems with Applications 135 (2019), 12–38.
[20] Khan, A., Boudjellal, N., Zhang, H., Ahmad, A., and Khan, M.
From social media to ballot box: Leveraging location-aware sentiment analysis for
election predictions. Computers, Materials & Continua 77, 3 (2023), 3037–3055.
[21] Krouska, A., Troussas, C., and Virvou, M. The effect of preprocessing
techniques on twitter sentiment analysis. In 2016 IEEE International Symposium
on Intelligent Signal Processing and Communication Systems (IISA) (2016).
[25] Park., and Lek, S. Artificial neural networks: Multilayer perceptron for
ecological modeling. Developments in Environmental Modelling 28 (2016),
123–140.
[26] Ramos. Using tf-idf to determine word relevance in document queries, 2003.
[29] Tran, H. Studying the community of trump supporters on twitter during the
2020 us presidential election via hashtags #maga and #trump2020. Journalism
and Media 2, 4 (2021), 709–731.
[30] Wang, S. How to build an email sentiment analysis bot: An nlp tutorial. Toptal
Engineering Blog (2018).
[32] Yu, Wang, L., Huang, H., and Yang, W. An improved random forest
algorithm. Journal of Physics: Conference Series 1646, 1 (2020), 012070–012070.
[34] Zhang, L., and Wang, S. Deep learning for sentiment analysis: A survey. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 4 (2018),
e1253.
[36] . My thanks to com3110, i used the starter code from assignment 2 for this project.
Appendix A
• Python 3.11.0, with its rich library support, was the primary language used,
managed via Anaconda. Development tools included VSCode and Cursor
integrated with Git for version control, managing code changes efficiently.
57
APPENDIX A. 58
• SpaCy: SpaCy is utilised for its robust parsing capabilities, efficiency, and ease of
use. Tokenization, lemmatization, and part-of-speech (POS) tagging are essential
steps in the preprocessing pipeline to structure and simplify text data for analysis.
SpaCy’s optimised algorithms enable rapid processing of large volumes of text,
crucial for the dataset’s scale.
• Scikit-learn: Chosen for its comprehensive suite of algorithms and tools for
machine learning, including text vectorization (CountVectorizer, TfidfVectorizer)
and classifiers (Naive Bayes, Random Forest). Scikit-learn is praised for
its simplicity, documentation, and community support, making it ideal for
implementing and experimenting with various traditional machine learning
models in the classification design.
• Pandas: Selected for data manipulation and analysis, Pandas provides high-level
data structures and operations for manipulating numerical tables and time series.
It is instrumental in the data preprocessing and feature engineering stages,
offering efficient handling of data-frames and ease of integration with other
libraries.
• Matplotlib & Seaborn: These libraries are chosen for data visualisation,
enabling us to plot graphs and charts for exploratory data analysis, model
evaluation, and results presentation. Their wide range of plotting options and
ease of use make them suitable for conveying complex data insights visually.
APPENDIX A. 59