0% found this document useful (0 votes)
13 views68 pages

Aca 21 Ram

Uploaded by

fraudyfraudy1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views68 pages

Aca 21 Ram

Uploaded by

fraudyfraudy1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

University of Sheffield

Sentiment Detection and Tracking in


Social Media Streams

Rayees Ahamed Mohamed Fayas

Supervisor:Dr Mark Hepple

A report submitted in fulfilment of the requirements


for the degree of BEng Software Engineering
in the
Department of Computer Science

May 8, 2024
Declaration

All sentences or passages quoted in this report from other people’s work have been specifically
acknowledged by clear cross-referencing to author, work and page(s). Any illustrations that
are not the work of the author of this report have been used with the explicit permission
of the originator and are specifically acknowledged. I understand that failure to do this
amounts to plagiarism and will be considered grounds for failure in this project and the
degree examination as a whole.

Name: Rayees Ahamed Mohamed Fayas

Signature: Rayees Ahamed Mohamed Fayas

Date: May 8, 2024

i
Abstract

Sentiment analysis is an invaluable tool in gauging the emotion of subjective text. In


recent years, the proliferation of user-generated data on social media platforms, particularly
Twitter, has opened new vistas for understanding public sentiment on a range of subjects,
such as politics. This project specifically focuses on using Twitter data for sentiment
analysis in the context of the US general elections. It analyses a rich dataset of both
labelled and unlabelled tweets, offering a unique insight into public perception and opinion
towards political figures and key election issues. The research employs advanced natural
language processing techniques and analysis, leveraging both supervised and unsupervised
learning models to predict political bias accurately. This study delves deep into the intricacies
of sentiment analysis, including data preprocessing, feature extraction, and model evaluation.

Furthermore, the report discusses the challenges and potential biases inherent in sentiment
analysis and Twitter data. The findings of this study underscore the importance and utility
of sentiment analysis in the digital age, particularly in the context of political discourse and
election predictions. This project uses support from existing body of knowledge on sentiment
analysis, providing fresh perspectives on its applications and implications, especially in
understanding and forecasting public sentiment in political contexts.

ii
Acknowledge

I would like to express my deep appreciation to my supervisor, Dr. Mark Hepple, for
his invaluable guidance throughout my project. Dr. Hepple’s approachable manner and
insights not only sharpened my work but also made the journey enjoyable. His assurance
and pragmatic approach to solving problems have been tremendously helpful. Thank you,
sir, for being such an outstanding mentor and for all your support—I am immensely thankful.

I’m grateful and happy to have the unwavering support of my family, which has blessed
me with the opportunity to pursue my passion in Software Engineering. I would also like
to express my appreciation to the University and the Department of Computer Science for
providing an environment that fosters growth and learning. Thank you all for your invaluable
support and encouragement.

iii
Contents

1 Introduction 1
1.1 Project Overview and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Methodology and Expected Outcomes . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Survey 4
2.1 Project Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 What is Sentiment Analysis? . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Usage of sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Social Media Stream- X(Twitter) . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 Applications in Election Analysis . . . . . . . . . . . . . . . . . . . . . 5
2.2 Data Collection for Twitter Sentiment Analysis . . . . . . . . . . . . . . . . . 6
2.2.1 Main Format of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Twitter-Specific Challenges . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Data Preprocessing for Twitter Sentiment Analysis . . . . . . . . . . . . . . . 7
2.3.1 Cleaning and Noise Reduction . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Stemming/Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Feature Identification & Extraction . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Relevant Features for Election Sentiment . . . . . . . . . . . . . . . . 8
2.4.2 Text Representation Techniques (TF-IDF, Word2Vec, BoW) . . . . . 9
2.4.3 Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Supervised Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 Naive Bayes (NB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Unsupervised Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.1 Lexicon Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.2 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.3 t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Evaluation Metrics & Model Performance . . . . . . . . . . . . . . . . . . . . 17
2.7.1 Precision, Recall, and F1 Scores . . . . . . . . . . . . . . . . . . . . . 17

iv
CONTENTS v

2.7.2 K-Fold Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


2.7.3 Silhouette Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7.4 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Requirements and analysis 21


3.1 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Evaluation Metrics Usage and Analysis . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Analysis Plan and Aims . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Design, Implementation & Testing 25


4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Design Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2 Feature Extraction Design . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.3 Sentiment Classification Design . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Other Design Justifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Using Boolean Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Dataset Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.3 Model Selection and Customisation . . . . . . . . . . . . . . . . . . . . 30
4.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Results and Discussion 32


5.1 Experiment 1: Impact of Preprocessing Techniques . . . . . . . . . . . . . . . 32
5.1.1 Experimentation Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.3 Results and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Experiment 2: Comparing Feature Extraction Techniques . . . . . . . . . . . 35
5.2.1 Experimentation Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.3 Results and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Experiment 3: Comparison of Linear Classifiers . . . . . . . . . . . . . . . . . 38
5.3.1 Experimentation Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.2 Results and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Experiment 4: Comparison of Non-Linear Classifiers . . . . . . . . . . . . . . 44
5.4.1 Experimentation Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.2 Results and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
CONTENTS vi

5.5 Experiment 5: Effects of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 48


5.5.1 Experimentation Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5.3 Results and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Conclusion and Further Work 51


6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Appendices 57

A 57
A.1 Vector Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.2 Distance Definations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.3 Hardware and Software Specifications . . . . . . . . . . . . . . . . . . . . . . 57
A.4 Selection of Tools and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.5 Dummy Classifier Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . 59
A.6 Full Report of all results from the Code . . . . . . . . . . . . . . . . . . . . . 59
A.7 Dummy Classifier Defination . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
List of Figures

2.1 General sentiment analysis framework . . . . . . . . . . . . . . . . . . . . . . 5


2.2 Data-set format example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 RNN Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Full Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26


4.2 Dataprocesser Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Preprocesser method logic flow . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Implemented Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Metrics Graphed For Naive Bayes and SVM (Trials 1-6) . . . . . . . . . . . . 33
5.2 Time & F1-Scores (X-Axis: Steps Taken, Y-Axis: Values) . . . . . . . . . . . 34
5.3 Metrics Graphed For Naive Bayes and SVM . . . . . . . . . . . . . . . . . . . 36
5.4 Time & F1-Scores (X-Axis: Steps Taken, Y-Axis: Values) . . . . . . . . . . . 37
5.5 SVM Scores of various mixing methods . . . . . . . . . . . . . . . . . . . . . . 49

vii
List of Tables

2.1 The contingency table: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Data-Set Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1 Experiment 1 Implementation Steps . . . . . . . . . . . . . . . . . . . . . . . 33


5.2 Quick Summary of Experiment 1 and 2 . . . . . . . . . . . . . . . . . . . . . 37
5.3 “most frequent” Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Quick Summary of Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.5 Quick Summary of Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1 Summary of Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


6.2 Summary of Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Summary of Experiment 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Summary of Experiment 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

viii
Chapter 1

Introduction

1.1 Project Overview and Objectives


Public opinion in social media plays an increasingly pivotal role in shaping political
landscapes which emphasises the crucial role of sentiment analysis. This project aims
to assess and monitor public sentiment, understand favourability towards candidates and
predict political bias. Platforms like Twitter, abundant with public opinion, provide an
excellent opportunity to understand public sentiment and political perspectives. With
sentiment analysis techniques, this project focuses on analysing tweets related to the 2020
US Presidential Election. The objective is to develop and compare models for gauging
political bias and understanding sentiment analysis better, offering valuable insights to
researchers and political stakeholders. This report will elaborate on Natural Language
Processing (NLP) concepts, methodologies employed, experiments conducted, and results
obtained, shedding light on the potential benefits and limitations of this approach for
political research and decision-making.

To contextualise the significance of the project, recent studies have underscored the
transformative impact of social media on the political landscape and election outcomes like
”Social Media and Elections” by Fujiwara, (2019) (14). Evidence from that study suggests,
platforms like Twitter have played a discernible role in shaping electoral preferences,
evidenced by a reduction in the Republican vote share in both the 2016 and 2020 presidential
elections. Recognising the importance of sentiment analysis is integral to understanding
public opinion. In the context of election prediction, sentiment analysis serves as a valuable
tool to gauge public sentiment and forecast potential election outcomes also confirmed by
(15) Guha (2020) too.

1
CHAPTER 1. INTRODUCTION 2

1.2 Methodology and Expected Outcomes


The project will start with an in-depth review of existing literature on sentiment analysis,
social media mining, and Natural Language Processing (NLP) techniques, focusing on their
political applications. A crucial step will be to acquire a comprehensive Twitter dataset,
including labelled and unlabelled tweets related to the elections. Robust preprocessing
techniques will be developed to clean and prepare the dataset. Feature extraction methods,
such as Bag of Words, TF-IDF, and WordVec, will be implemented to convert sentiment-rich
content within tweets into numerical representations. These foundational steps will pave
the way for a thorough exploration of sentiment analysis in electoral discourse. The report
will sequentially explain the framework for sentiment analysis, covering data collection,
cleaning, feature extraction, and sentiment classification. The overarching objective is to
conduct a comprehensive comparison of supervised and unsupervised models, contributing
to a nuanced understanding of public opinion and accurately predicting political bias.

1.3 Overview of the Report


Chapter 2 delves into the literature and background supporting sentiment analysis. It
discusses the challenges of colloquial language in Tweets, explores various sentiment analysis
techniques, and examines different classifiers. The chapter also covers the challenges
in sentiment analysis and compares various concepts in NLP. Further explains, feature
engineering and text preprocessing techniques, discusses evaluation methods and metrics,
briefly presenting the tools and resources used in sentiment analysis.

Chapter 3 constitutes a critical phase of the dissertation, focusing on the intricacies of


data collection, exploration strategies, and experimental design. This chapter expands by
detailing the meticulous processing of the dataset, aligning every step with the overarching
project objectives. Building upon the methods and concepts introduced in Chapter 2, it
presents a comprehensive plan for experimentation, illustrating the rationale behind each
approach chosen. The chapter provides a road-map for evaluating result’s significance
within the broader context of sentiment detection and tracking in social media streams.
Importantly, it underscores the pivotal role of experimentation methods, elucidating their
significance in driving the project toward its ultimate goals. This chapter also lays out the
designs of the experiment to be carried out which are used to assess against the defined
requirements.
CHAPTER 1. INTRODUCTION 3

Chapter 4 covers the practical aspects of the sentiment analysis system focusing on design,
implementation, and rigorous testing of the developed models. This chapter outlines the
system architecture and describes the specific methodologies employed for preprocessing,
feature extraction, and sentiment classification. It details the application of various NLP
techniques and machine learning algorithms to process and analyse Twitter data effectively.
Additionally, providing insights into the coding practices, the use of libraries, and the
integration of different components into a cohesive system. Testing covers unit, integration
and manual tests to ensure model accuracy and robustness, culminating in a series of
evaluations that validate the system against the project’s requirements.

Chapter 5 presents a detailed analysis of the experimental results obtained from


the implementation of various sentiment analysis models. This chapter evaluates the
effectiveness of preprocessing techniques, compares the performance of feature extraction
methods, and assesses the accuracy of different classifiers through a series of controlled
experiments. Each section discusses the implications of the findings, providing insights
into the strengths and weaknesses of each approach. The chapter ends with discussing
the practical implications of the results for predicting political sentiment, emphasising the
impact of data quality, model choice, and feature selection on the overall system performance.

Chapter 6 summarises the study’s findings, highlighting the key contributions and
achievements of the sentiment analysis project. It reflects on the challenges encountered
during the project and how they were addressed. The chapter proposes future directions for
research based on the limitations identified during the testing and evaluation phases. Which
includes suggestions for improving model accuracy, exploring additional data sources, and
integrating more sophisticated NLP techniques.
Chapter 2

Literature Survey
This chapter offers a thorough review of Sentiment Analysis (SA), covering different methods,
tasks, and approaches. Where sections are structured to follow the basic framework of
sentiment analysis. The chapter delves into machine learning algorithms, shedding light
on their applications and the role of pre-trained models, expands on role of twitter in this
project and gives insights into tools and resources used, with supported references.

2.1 Project Background


2.1.1 What is Sentiment Analysis?
SA, a concept interchangeable with sentiment detection and opinion mining, analyses public
opinions, sentiments, attitudes, and emotions towards various entities such as services,
organisations, individuals and their attributes. It is an approach in NLP that identifies
the emotional tone behind text and aims to automate textual analysis to extract and predict
emotions confirms Yuxing Qi (2023) (33). This concept is largely applied to materials such
as reviews, survey responses, and social media for applications that range from marketing to
customer service.

2.1.2 Usage of sentiment analysis


SA in NLP employs techniques ranging from simple lexicon-based methods to advanced
machine learning and deep learning approaches. The choice of technique is influenced by
specific application requirements, such as real-time analysis needs, sentiment complexity,
and data volume. Their uses are powerful and varied; In business and marketing, it’s used
for monitoring brand perception and customer feedback. In the political arena, it helps
gauge public opinion on various issues and campaigns. The healthcare sector utilises it for
analysing patient feedback and discussions around health experiences. In finance, SA of
news and social media predicts market trends and investor sentiments. These applications
underscore the broad scope and versatility of sentiment analysis in extracting meaningful
insights from text.

4
CHAPTER 2. LITERATURE SURVEY 5

2.1.3 Social Media Stream- X(Twitter)


Twitter is selected for SA due to its text-focused content and its compelling attribute of having
minimal censorship. This fosters an environment where users feel empowered to express their
thoughts openly, resulting in a rich and diverse dataset, ideal for in-depth sentiment analysis.
Moreover, Twitter boasts an impressive global user base, with over 500M monthly active users
confirmed by Dean (2023)(11). This diverse community has been recognised for its significant
impact on shaping online conversations and the dissemination of information (Aleksandric,
2022)(2).

2.1.4 Applications in Election Analysis


Election analysis and prediction will benefit greatly from SA within its limits. The diversity
of Twitter users, encompassing various political affiliations, age groups, and nationalities,
contributes to a better sample for SA in political predictions. Incorporating SA into politics
with Tweets enhances the accuracy of predictive models. These models provide a detailed
understanding of public sentiments and their effect on electoral outcomes, potentially
forecasting political biases. Therefore, Twitter’s function extends beyond simple data
collection; it’s about capitalising on public opinion to forecast political bias more accurately,
providing a comprehensive and representative sample for SA.

Supporting this idea, (12) Ebrahimi et al. (2018) explored the challenges of SA, emphasising
the role of Twitter in understanding public sentiment during elections. Other researchers
have focused on specific elections, such as the 2020 US presidential election, conducting
large-scale SA of Twitter data to gain insights into political sentiments like (4) Alvi
(2023). However, it could have been improved by using a more comprehensive dataset
of voters and have candidate-related tweets be analysed better with keywords. They
could have integrated multiple data sources and carefully considered other relevant factors,
such as traditional polling data, campaign strategies and socioeconomic factors in their study.

Figure 2.1: General sentiment analysis framework


CHAPTER 2. LITERATURE SURVEY 6

2.2 Data Collection for Twitter Sentiment Analysis


2.2.1 Main Format of Data
The primary data to be collected are the tweets themselves. However, other metadata such
as the timestamp of the tweet, the data origin, the number of retweets and likes, and the
user’s followers may also provide valuable context. Most exported and downloadable Tweet
data-sets are in the format of having key components of:

• Unique userid Additionally, datasets may include:

• Tweet content • User followers count

• Time-Stamp • Likes

• Country • Retweet number

Figure 2.2: Data-set format example

The rows in the table above represents features – They are measurable properties or
characteristics of the dataset, used as input for SA. In the context of SA, features can
include the text of the tweet, metadata such as the number of retweets and likes, and
other contextual information. These features are used to train machine learning models
to recognise patterns and make predictions about the sentiment expressed in the tweets.
CHAPTER 2. LITERATURE SURVEY 7

In context of machine learning, a model is created by an algorithm that identifies


patterns in training data and uses these patterns to make predictions without without being
explicitly programmed to do so. This project will train models to understand the sentiment
of a tweet based on the words and phrases used in it. Each tweet is a data point, and
the sentiment of the tweet is the label. The model learns from these examples and their
corresponding labels. Once trained, these models can predict the sentiment of new, unseen
tweets.

2.2.2 Twitter-Specific Challenges


Collected data-set will mainly have linguistic and representative complications. Unique
characteristics of tweets, such as sarcasm, emojis, idioms, and negations will pose as a
challenge to recognise. The informal and context-dependent nature of tweets, along with
the limited space for expression, makes it challenging to accurately interpret sentiments from
the text (wankhade,2022; Barkha 2018)(31) (7). Additionally, the large vocabulary used by
users and the presence of irony and sarcasm further complicates SA from tweets as per Wang
2018 (30). AminiMotlagh et al (2023) also suggests (5) abbreviations, misspelled words,
and the frequent use of emoticons and hashtags add to the complexity of SA on Twitter.
These challenges highlight the need for robust language preprocessing and machine learning
techniques to effectively analyse sentiments from Twitter data.

2.3 Data Preprocessing for Twitter Sentiment Analysis


2.3.1 Cleaning and Noise Reduction
“noise” refers to irrelevant or meaningless data that can interfere in SA. This can include
special characters, URLs, numbers, unnecessary white spaces, and stopwords that do
not carry much sentiment. Preprocessing techniques such as lower-casing, removing
punctuation, and eliminating stopwords are commonly used to clean and normalise the
text data for SA as outlined by Dataconomy (10) and Enjoy Algorithms(3). Stopwords
are insignificant prepositions, pronouns, and conjunctions in text (“the”, “a”, “an”, “so”,
and “what” etc). Additionally, techniques such as lemmatization and stemming can be
applied reducing words to their base form, further reducing dimensionality of the data and
improving the accuracy of sentiment analysis models as highlighted by sources (Haddi et al,
2013) (16). By applying various preprocessing steps, textual data transforms into structured,
cleaner format, suitable for sentiment classification tasks, enhancing the performance and
accuracy of the sentiment analysis model outlined by Dataconomy, 2023 (10).

2.3.2 Tokenization
Tokenization is a fundamental step in NLP– the goal is to break down text into
smaller units, typically words or phrases, called tokens. Tokenization basically
converts a continuous text into a structured format, making it more suitable for
CHAPTER 2. LITERATURE SURVEY 8

SA. These tokens are the building blocks of natural language. They can be words,
phrases, or even individual characters, depending on the level of granularity desired.

For example, Let’s take the sentence: “I love harry potter & the order of the phoenix.”
Tokenized: [“I”, “love”, “harry”, “potter”, “&”, “the”, “order”, “of”, “the”, “phoenix”, “.”]

2.3.3 Stemming/Lemmatization
Stemming and lemmatization are both important techniques in NLP to reduce words to
their base form. Thus far, unnecessary and unimportant information has been removed; now
focus shifts to condensing essential information for further analysis.

Stemming reduces words to their base form by removing suffixes. Although, the
stem of a word may not always be a valid root word, this process allows different
variations of a word to be represented by a common stem. This is done by removing
prefixes or suffixes from words, a heuristic process, using rules to remove common affixes
without understanding the context of the word. For example: “eating” gets stem to “eat”.

Lemmatization reduces words to their dictionary form. Unlike stemming, lemmatization


considers the word’s context and applies linguistic rules to transform it. Lemmatization
produces valid root words, present in the language’s dictionary but this requires more
linguistic knowledge and is computationally more intensive. For example, “gigantic”
becomes “huge”.

Both methods share similarities in goal, to reduce dimensionality of the data ensuring that
variations of words are treated consistently.

2.4 Feature Identification & Extraction


2.4.1 Relevant Features for Election Sentiment
Feature identification and engineering are critical in SA. Features, are like the building blocks
of information, converting raw textual data into a format that our models can interpret and
analyse. Specifically, for election sentiment, it is important to extract and utilise features
that are directly relevant to understanding public sentiment during an election. Some key
aspects;

• Hashtags are a crucial component in SA, as they enhance engagement and help
categorise sentiments expressed in tweets. The co-occurrence of these electoral
hashtags may provide additional context, thus improving comprehension of the intrinsic
sentiment. Leveraging this, sentiment of each tweet can be classified more accurately.
CHAPTER 2. LITERATURE SURVEY 9

• Candidate Mentions are integral features for election sentiment analysis as well. By
identifying and analysing the sentiment expressed towards specific candidates,sentiment
towards specific candidates becomes easier to gauge. This could provide valuable
insights into the public perception of candidates during the election period as per Bansal
et-al, 2018 (7).

• Geographical Features can provide insights into regional variations in public


sentiment, which can be critical in understanding the electoral landscape by region.
Particularly useful in identifying swing states* or regions where public sentiment is
divided (Chaudhry et-al, 2021 (20)).

• Demographic Features such as the age, gender, and occupation of users are also
relevant in election SA. These features can provide insights into the sentiment of
different demographic groups, which can be crucial in understanding the diverse
perspectives and sentiments within the electorate (Alvi et-al, 2023) (4).

• Tweet Metadata such as the number of retweets, likes, and replies a tweet receives, are
important for sentiment too. These features provide insight into the reach and impact
of a tweet, which can be indicative of the popularity and influence of the sentiments
expressed in the tweets (Alvi et-al, 2023) (4).

2.4.2 Text Representation Techniques (TF-IDF, Word2Vec, BoW)


The aforementioned features are still in a textual format. To be understood by machine
learning models, they need to be converted into numerical representations using text
representation techniques. These techniques, which include methods like Bag of Words,
TF-IDF, and Word2Vec, are designed to transform text into numerical features. Basically,
taking the ‘ingredients’ identified in our textual data and converting them into a ‘recipe’ that
our machine learning model can follow. The features extracted using these techniques will
serve as the input for our machine learning models, enabling them to learn from the textual
data and make accurate predictions. (Vectors will be introduced later. Defined in A.1)

• Bag of Words (BoW): is the simplest and most commonly used technique.
It represents text as a bag of words. It disregards grammar and word order
but keeps tally of each word essentially creating a vocabulary of all the unique
words in a data-set. It represents each document as a vector of numbers
where each number represents the count of a particular word in the document.

For example, we have 2 sentences,“Fish and chips are the best”, “Ice Cream
and Fries are the best”. We can now create a vocabulary/list of unique words from
all the sentences [“Fish”, “and”, “chips”, “are”, “the”, “best”, “Ice”, “Cream”,
“Fries”]. We can then proceed to represent each sentence as a vector of word count:
CHAPTER 2. LITERATURE SURVEY 10

Sentance 1- [1, 1, 1, 1, 1, 1, 0, 0, 0] (Fish and chips are the best)


Sentance 2- [0, 1, 0, 1, 1, 1, 1, 1, 1] (Ice Cream and Fries are the best).
Each dimension corresponds to a word in the vocabulary. Each value represents the
count of that word in the given sentence.

• TF-IDF (Term Frequency-Inverse Document Frequency): A better textual


representation is when importance is given to the words used. TF-IDF combines two
concepts: Term Frequency (TF) and Inverse Document Frequency (IDF). TF measures
how frequently a term occurs in a document. It’s similar to the BoW model, which
assumes that all words are equally important. IDF however, measures the importance
of a term by logarithmically scaling the inverse fraction of the documents that contain
the word. This helps to weigh down the frequent terms while scaling up the rare
ones, balancing their significance. TF and IDF values are then multiplied to give the
importance of the word in the document. Words that are common in a single document
but rare in the data-set will have a high TF-IDF score (Ramos, 2003)(26). This score
can help identify words that are important within the document:

No. of times term t in doc d


TF(t, d) = (2.1)
Total number of terms in doc d

 
Total No. of doc in the collection
IDF(t) = log (2.2)
No of doc in the collection containing term t

TF-IDF(t, d) = TF(t, d) × IDF(t) (2.3)

where:

– t is the term or word


– d is the document
– TF(t, d) is the Term Frequency of term t in document d
– IDF(t) is the Inverse Document Frequency of term t

• Word2Vec: Word2Vec sets itself apart from previously discussed methods


by its unique ability to understand the meaning of words by capturing their
semantic and syntactic relationships. It uses a prediction-based method to
create word embedding, which are vector representations of words. Unlike
BoW and TF-IDF, which are tally based techniques, Word2Vec uses the
context of words to predict target words as confirmed by Akdogan (2021) (1) .
CHAPTER 2. LITERATURE SURVEY 11

There are two primary methods for constructing vector representations in Word2Vec;
Continuous Bag of Words (CBOW) and Skip-Gram. CBOW predicts the target
word from its surrounding context, while Skip-Gram does the opposite, predicting the
surrounding words from the target word as outlined by Onishi et al. (2020) (24).

CBOW, takes the context words as input and predicts the target word. This process
enables CBOW to understand the meaning of a word by considering the words that
typically surround it. However, Word2Vec Skip-Gram operates by predicting the
surrounding words from a given target word. It is opposite of CBOW, aiming to
understand the context in which a word is likely to appear.

CBOW and Skip-Gram are distinct, yet under Word2Vec. Together, they capture
both local and global statistics of words, producing vector representations that
comprehensively reflect the semantic and syntactic intricacies of the language.

2.4.3 Feature Selection Methods


Feature Selection is a process that retains a subset of existing features, while Feature
Identification & Extraction involves creating new features or transforming existing ones.

Chi-Squared test is commonly used feature selection method. This statistical


method is particularly suited for SA as it assesses the independence between features
and the target sentiment class. It measures the discrepancy between the expected
and observed frequencies of feature occurrences, identifying features that are most
informative for sentiment classification. The Chi-Squared test calculates the chi-squared
values for each feature; higher chi-squared values indicate greater relevance to
sentiment. This method helps exclude irrelevant or redundant features, thereby
streamlining the sentiment analysis model as supported by Nurhayati et al, (2019).

(A × D − B × C)2 × (A + B + C + D)
X2 = (2.4)
(A + B) × (C + D) × (A + C) × (B + D)

The chi-square statistic (X 2 ) is calculated based on

Table 2.1: The contingency table:

Positive Sentiment Negative Sentiment


Word Present A B
Word Absent C D
CHAPTER 2. LITERATURE SURVEY 12

Here, A, B, C, and D are the counts in each cell of the contingency table. The chi-square
statistic is used to test the hypothesis that the occurrence of a specific word is independent
of sentiment. Higher values of X 2 suggest a stronger association between the word and
sentiment.

Another method is the Least Absolute Shrinkage and Selection Operator (LASSO).
LASSO is a regularisation technique* that can be seamlessly integrated into the
model-building. It introduces a penalty term based on the absolute values of the feature
coefficients, encouraging sparsity in the feature space. In SA, LASSO helps identify and
retain the most impactful features while shrinking others to zero. During the training of
a SA model, LASSO simultaneously performs feature selection and model fitting*. This
ensures that the resulting model not only predicts sentiment accurately but also focuses on
the most influential features, enhancing interpretability. (Muthukrishnan et al, 2016) (23)
 
 1 X N Xp p
X 
β̂ = argmin (yi − β0 − βj xij )2 + λ |βj | (2.5)
β  2N 
i=1 j=1 j=1

1 PN Pp
• The first term , 2N i=1 (yi − β0 −
2
j=1 βj xij ) , represents the residual sum of squares,
highlighting the difference between the observed and predicted values.

• The second term, λ pj=1 |βj |, is the L1 penalty. This term improves sparsity in the
P

model coefficients, reducing some coefficients to zero.

• The λ is a hyper-parameter that must be experimented with, often via cross-validation.


It plays a key role in balancing the fit of the model (decreasing the first term) against
the complexity of the model (controlled by the second term), thereby preventing
over-fitting.

2.5 Supervised Learning Models


Supervised learning, is when models are trained on a labelled dataset. Each input data is
paired with a corresponding output label. The model learns to map the input to the correct
output by identifying patterns from these labelled examples. Once trained, the model can
make predictions on new, unseen data. The SA problem addressed with these supervised
machine learning models may fall under two types of methods: classification or regression.

A regression task trains models to predict a continuous outcome (sentiment score) based
on one or more variables (Tweet features). So, the results would be a continuous value.

A classification task trains models to categorise text data into predefined sentiment
classes, allowing for automated labelling of sentiments in unseen textual content.
CHAPTER 2. LITERATURE SURVEY 13

This project would compare the performance of all implemented models. Therefore, viewing
this as a majority regression problem would give a better scale of understanding than having
a classification problem where there are only 3 or 4 categories to categorise sentiment.

2.5.1 Random Forests


The Random Forests algorithm is a supervised machine learning algorithm that operates by
constructing numerous decision trees at training time and outputting the class that is the
mean prediction (regression) of the individual trees (Breiman, 2001)(9).

Random Forests is an excellent choice due to its ability to handle high-dimensional and
sparse datasets, which are common in text analysis. It can also deal with missing values,
outliers, and imbalanced classes, which may affect the performance of other algorithms.
Moreover, Random Forests are straightforward to implement and tune, as they have few
hyper-parameters and requires little preprocessing or scaling of the data as outlined by
Bahrawi (2019) (6).

Random Forests works by selecting random samples from a given dataset and constructs a
decision tree for every sample. It then performs a vote for each predicted result, and the
prediction with the most votes is selected as the final prediction (Yu et al, 2020) (32). This
approach helps to improve the accuracy and robustness of the model. For example, a study
by Karthika et.al (18) used the Random Forest algorithm for SA of social media data and
achieved an accuracy rate of close to 85% . Also Bahrawi with data sources from Twitter
using the Random Forest algorithm approach, achieved an accuracy of around 75%.

2.5.2 Naive Bayes (NB)


Naive Bayes algorithm is an efficient probabilistic machine learning algorithm in SA due to
its simplicity and effectiveness in handling text data. It uses word and feature frequencies to
calculate the probability of a document belonging to a select sentiment class. However, it’s
important to note that NB works well with discrete values and not with continuous values.

P (d|c)P (c)
P (c|d) = (2.6)
P (d)
where:

• P (c|d) is the probability of class c given document d

• P (d|c) is the probability of document d given class c

• P (c) is the prior probability of class c

• P (d) is the probability of document d


CHAPTER 2. LITERATURE SURVEY 14

Naive Bayes is implemented by, calculating the probability of a document belonging to a


particular sentiment class based on the frequency of words and features in the document.
Next, it focuses on its feature independence assumption given the class label and simplifies
the calculations. This assumption allows the algorithm to work well with text classification
and spam filtering tasks as per Sharma (2020) (28). Lastly, NB uses a labelled dataset for
training, it can then predict the sentiment of new, unseen documents based on the learned
probabilities.

2.5.3 Neural Networks


A neural network is a computing model designed to process information similar to how our
brain would. It is a type of machine learning model that is inspired by the structure and
function of biological neural networks in our mind. They consist of interconnected nodes,
or , “neurons”, which are organised into layers. There are usually three types of layers in
a neural network: The input layer, which receives the initial data; The hidden layer(s),
where the processing is done; and the output layer, which produces the final result. Each
neuron receives inputs, performs a computation on these inputs, and passes the result to
the next layer. The “learning” in a neural network happens during back-propagation. In
back-propagation, the network makes a prediction based on the input data it receives,
and then compares this prediction to the expected output. The difference between the
predicted and actual output is calculated as an error. This error is then propagated back
through the network, adjusting the weights of the connections between the neurons. This
incredible process is repeated many times, and the network “learns” by adjusting its weights
to minimise the error. However, they require large amounts of data to train effectively,
they can be computationally intensive, and their “black box” nature can make it difficult to
interpret why they make certain decisions (Park et al, 2016) (25).

Recurrent Neural Networks (RNNs) and Convolutional Neural Networks


(CNNs). RNNs are designed to work with sequential data, making them particularly
suitable for NLP tasks like sentiment analysis. They are unique with their ability to
remember previous inputs in a sequence, using that to influence the current output. They
work by maintaining a hidden state that captures information about a sequence. During the
forward pass, RNN takes an input and a hidden state, uses these to compute the new hidden
state, and generates an output. This process is repeated for each element in the sequence.
However, RNNs do suffer from the vanishing gradient problem, where the contribution
of information decays over time, making it difficult for the network to learn long-range
dependencies in the data.
CHAPTER 2. LITERATURE SURVEY 15

Figure 2.3: RNN Representation

A CNN has multiple layers of convolutions, interspersed with non-linear activation functions.
Each convolution layer applies multiple filters to the input, and each filter is responsible for
extracting a specific feature from the input data. A typical CNN architecture consists of
three types of layers: convolutional layers, pooling layers, and fully connected layers.

The Convolutional Layer is the core segment of a CNN. The layer’s parameters consist
of a set of learnable filters or kernels, which have a small receptive field but extend through
the full depth of the input volume. During the forward pass, each filter is convolved across
the width and height of the input volume, computing the dot product between the entries of
the filter and the input, producing a 2-dimensional activation map of that filter. The output
of the convolutional layer is a stack of these activation maps, one for each filter.

The Pooling Layer is a down-sampling layer that follows the convolutional layer. It
reduces the dimensions of each feature map while retaining the most important information.
There are several types of pooling operations, but the most common one is max pooling,
which extracts the maximum value from each segment of the feature map.

Lastly, the Fully Connected Layer. After several convolutional and pooling layers, the
high-level reasoning in the neural network happens in the fully connected layer. Neurons
in this layer have connections to all activations (outputs) in the previous layer, as seen
in regular (non-convolutional) Artificial Neural Networks. Their activations can thus be
computed as an affine transformation, with matrix multiplication followed by a bias offset
(Janke et al, 2019) (17).
CHAPTER 2. LITERATURE SURVEY 16

2.6 Unsupervised Learning Models


Unsupervised learning models are designed to discover hidden patterns or structures within
unlabelled data. Unlike supervised learning models, which learn from labelled data to achieve
a defined goal, unsupervised learning is more exploratory, aiming to uncover underlying
structures or relationships in the data.

2.6.1 Lexicon Approach


Lexicon-based SA in NLP is used to determine the sentiment of a text by analysing the
words it contains. It uses a sentiment lexicon or a comprehensive dictionary of words and
phrases, each tagged with associated sentiment values. These values can range from binary
to numerical scores, reflecting the intensity of sentiment. Lexicon analysis can involve either
utilising a pre-existing lexicon or developing a custom lexicon tailored to the specific needs of
the analysis. This approach is particularly valued for its transparency and interpretability,
as it allows for a direct understanding of how each word contributes to the overall sentiment
of the text.

Öhman (2021)(35), emphasises the effectiveness of lexicon-based methods, especially in fields


like computational social science and digital humanities saying that– while these methods have
been criticised for their lack of validation and accuracy, they prove highly useful in projects
where neither qualitative analysis nor machine learning-based approaches are feasible.

2.6.2 K-means Clustering


K-means Clustering algorithm works by dividing a group of ’n’ datasets into ’k’ subgroups
based on similarity and their mean distance from the centroid of that particular subgroup.
Here, K is the predefined number of clusters to be formed. It is implemented by first*:

How the algorithm is implemented:


• 1. Initialisation: Select the value of K to decide the number of clusters to be formed.
Then, select random K points which will act as cluster centroids (Sharma, 2021) (28) .

• 2. Assignment: Assign each data point, based on their distance from the randomly
selected points (centroid), to the nearest centroid, which will form the predefined
clusters.

• 3. Update: Compute mean of all the points for each cluster and set the new centroid.

• 4. Repeat: Repeat step 2 and 3, which reassigns each data point to the new closest
centroid of each cluster, until the centroids do not change significantly, meaning they
have converged.

As described by Sharma, 2021*.


CHAPTER 2. LITERATURE SURVEY 17

2.6.3 t-SNE
t-Distributed Stochastic Neighbour Embedding (t-SNE) is a useful in SA for visualising
the distribution of text data in a lower-dimensional space. By projecting high-dimensional
features of textual data into a two-dimensional space, t-SNE aids in understanding the
relationships between different sentiment classes and visualising complex structures in SA
datasets. The implementation steps for t-SNE include:

• Pairwise Similarities: The algorithm computes pairwise similarities between data


points in the high-dimensional space. Using Gaussian distribution to measure this.

• Student’s t-Distribution: These pairwise similarities are then represented using


a probability distribution in a lower-dimensional space. The use of a Student’s
t-distribution maintains relative distances between points and prevents crowding of
similar points.

• Gradient Descent optimises the locations of points in the lower-dimensional


space. The goal is to minimise the divergence between the high-dimensional and
low-dimensional pairwise similarities.

2.7 Evaluation Metrics & Model Performance


The effectiveness and accuracy of the models are assessed with various Evaluation Metrics.
Their results give insights into how well the models are performing and helps make informed
decisions about its suitability for predicting sentiment.

2.7.1 Precision, Recall, and F1 Scores


• Precision measures the ratio of correctly predicted positive instances to all instances
predicted as positive. This metric is particularly vital in SA for its role in minimising
false positives instances where tweets are incorrectly classified. High precision scores
signify the model’s accuracy, emphasising its ability to avoid irrelevant or incorrect
predictions as supported by Singh, 2022. When analysing tweets, the volume of data
and the nuances of human language can easily lead to misclassification. By prioritising
precision, SA models can provide more accurate insights into public sentiment.

• Recall focuses on capturing all relevant positive or negative instances. This metric is
particularly crucial when the objective is to capture the complete spectrum of relevant
sentiments expressed in tweets. High recall scores indicates the model’s effectiveness in
minimising false negatives, ensuring that it does not overlook significant tweets.
CHAPTER 2. LITERATURE SURVEY 18

• F1 combines precision and recall into a singular metric, providing a balanced evaluation
of the model’s overall performance. Computed as the harmonic mean of precision and
recall, it differs by considering both false positives and false negatives, ensuring no single
aspect dominates making precision and recall equally important. The F1-score offers
a comprehensive assessment of the model’s accuracy crucial in SA where an imbalance
between precision and recall can misrepresent the true sentiment as supported by
Margherita et.al, 2020 (22).

precision × recall
F1 = 2 × (2.7)
precision + recall

where:

– precision is the proportion of true positive predictions among all positive


predictions
– recall is the proportion of true positive predictions among all actual positives

2.7.2 K-Fold Validation


K-Fold Cross-Validation is a resampling procedure used to evaluate machine learning models
on a limited data sample. Resampling, meaning, repeatedly drawing samples from a dataset
to train and evaluate a model. K-Fold Cross-Validation has a single parameter, K, which
refers to the number of groups that a given data sample is to be split into– often referred to
as creating “folds”.

The original sample is randomly partitioned into K equal-sized sub-samples or folds. Then, a
single sub-sample is retained as the validation data for testing the model, and the remaining
K-1 sub-samples are used as training data. This cross-validation process is then repeated K
times, with each of the K sub-samples used once as the validation data. The K results from
the folds can then be averaged to produce a single estimation. This estimation provides a
better measure of model performance, as it reduces the variance of the performance estimate
and allows more data to be used for training.

K-Fold Cross-Validation is particularly useful when the dataset is limited, as it allows


for efficient use of data by cycling through different training and validation datasets.
This helps mitigate over-fitting by providing a more generalised model performance.
K-Fold Cross-Validation can be used to evaluate the performance of both supervised and
unsupervised models for sentiment analysis of Twitter data. The choice of K can be
experimented with to find the optimal balance between bias and variance. (Bergmeir et
al, 2018 (8))
CHAPTER 2. LITERATURE SURVEY 19

2.7.3 Silhouette Score


Clustering is a type of unsupervised learning method in machine learning. It involves
grouping data points into clusters by their similarity. The goal is to partition the data into
sets such that the data points in the same cluster are more similar to each other than those
in other clusters.

The Silhouette Score metric is used to evaluate the quality of clusters specifically in
unsupervised learning. It provides a quantitative measure of how well each object fits within
its assigned cluster and how distinct the clusters are from each other. The Silhouette Score
ranges from -1 to 1, where a high value indicates that the object is well matched to its own
cluster and poorly matched to neighbouring clusters. If most objects have a high value, then
the clustering configuration is appropriate. If many points have a low or negative value,
then the clustering configuration may have too many or too few clusters as per Shahapure
et al, 2020 (27).

The Silhouette Score is calculated using distance metrics such as the Euclidean distance or
Manhattan distance. These distance metrics are used to calculate the average intra-cluster
distance (a) and the average inter-cluster distance (b) for each data point, which are then
used to compute the Silhouette Score. The Score can then be used to determine the
natural number of clusters within a dataset. The highest Silhouette Score indicates the
optimal number of clusters, providing valuable insights into the structure of the data and the
effectiveness of the clustering algorithm (A.2).

2.7.4 Confusion Matrix


While some of the metrics derived from the Confusion Matrix, like Precision and Recall,
may overlap with the metrics that’s already being used, the Confusion Matrix itself offers
a visual representation of the model’s performance, making it easier to identify areas where
the model may be struggling. It is a table with four different combinations of predicted and
actual values, namely True Positives (TP), True Negatives (TN), False Positives (FP), and
False Negatives (FN).

• True Positives (TP): Model correctly predicts the positive class.

• True Negatives (TN): Model correctly predicts the negative class.

• False Positives (FP): Model incorrectly predicts the positive class.

• False Negatives (FN): Model incorrectly predicts the negative class.


CHAPTER 2. LITERATURE SURVEY 20

As a recap:

• Accuracy: Ratio of correctly predicted observations to the total observations.


It is calculated as (TP+TN)/(TP+FP+FN+TN)[5].

• Precision: Ratio of correctly predicted positive observations to the total predicted


positive observations. It is calculated as TP/(TP+FP).

• Recall: Ratio of correctly predicted positive observations to all observations in actual


class. It is calculated as TP/(TP+FN).

• F1 Score: The weighted average of Precision & Recall. It finds the balance between
precision and recall.

The confusion matrix can be used to evaluate the quality of clusters. However, it’s important
to note that the confusion matrix is primarily designed for classification problems, and its
use in clustering requires some adaptation. In clustering, there is no association provided
by the clustering algorithm between the class labels and the predicted cluster labels. In
a clustering problem, the rows of the confusion matrix represent the actual labels, and
the columns represent the new cluster names (i.e., cluster-1, cluster-2, etc.). The diagonal
elements of the confusion matrix represent the instances where the actual label matches the
predicted cluster, and the off-diagonal elements represent the instances where the actual
label does not match the predicted cluster as described in (13).

Note on the State-Of-the-Art in SA: This project will focus more on machine learning
models, however, sophisticated machine learning techniques like deep learning, models,
especially those based on neural architectures like BERT and RoBERTa, have been
outstanding in enhancing the interpretation of sentiments within large volumes of data.
Deep learning approaches excel in capturing subtlety in data, which is key for understanding
complex human emotions and opinions in politics (Zhang et al, 2018)(34).
Chapter 3

Requirements and analysis


In previous chapters, we have laid the groundwork by discussing, with literature, the key
ideas and methodologies that underpin SA. In this chapter, we will analyse more into how
these concepts will be applied to meet the project’s objectives. We will explore the rationale
behind choosing these processes. The aim is to understand which method yields the best
model, and the logic of these processes and their implications on the results.

3.1 Aims and Objectives


The aim of this project is to use advanced NLP techniques in SA to accurately analyse
political tweets, identifying the most effective classifiers through a detailed exploration
of machine learning models with visual and statistical metrics. The project’s objectives
include compiling a representative dataset of election-related tweets, employing a variety
of text preprocessing methods like negation handling, colloquial expansion, etc rigorously
testing both linear and non-linear classifiers across multiple experiments. These methods are
optimised through extensive trials involving different feature extraction techniques, including
TF-IDF and Word2Vec, to enhance the models’ ability to accurately differentiate between
nuanced political sentiments.
Comprehensive evaluation metrics include accuracy, precision, recall, and F1 scores, enabling
a critical assessment of how preprocessing and feature extraction adjustments impact classifier
effectiveness. Seeking to innovate by experimenting with various dataset compositions to
improve model validity and robustness. Ultimately, the goal is to learn more about and
establish an accurate model for linear and non-linear SA models to predict political bias
well.

3.2 Datasets
In this project, various models will be trained and tested with labelled and unlabelled
datasets, as outlined in Chapter 2. The design and further usage of the dataset will be
discussed in Chapter 4.

21
CHAPTER 3. REQUIREMENTS AND ANALYSIS 22

Table 3.1: Data-Set Details


Data Set Type Purpose Processing
Labelled, Training Data-Set
Labelled, Development Data-Set Model uses this for learning All of the data-set will be processed the same way.
Unlabelled, Testing Data-Set Checking model performance

3.2.1 Training data


The training dataset is a collection of tweets from the 2020 US elections and other generic
tweets from SemEval. This dataset is labelled- each tweet is associated with a sentiment
value indicating its sentiment polarity. It will be used to train the supervised learning
models, enabling them to learn and generalise patterns from these examples.

3.2.2 Gold Standard


The golden dataset, also known as test / validation data-set, is a separate collection of labelled
tweets, not used during the training process. Instead, it serves as a benchmark to evaluate the
performance of the trained models. By testing the models on this unseen data, we can assess
their ability to generalise and accurately predict sentiment, providing us with a measure of
their effectiveness and reliability. The Gold standard (only used for testing) is separate from
the dataset used for training and developing the classifiers.

3.2.3 Online resources


Feature selection with the help of keywords and hashtag and possibly creating a custom
lexicon would be done with the help of online data monitoring resources. For example, using
data from research, Chen et.al 2021 (29) to find out more.

3.3 Methodology
The methodology for the project is sequentially explained by sections in chapter 2, following
that. Here are the Steps, simplified;

1. Data collection: Tweets can be collected either by web scraping or by using publicly
available data-sets, here, the latter is chosen. Golden dataset together with statistics
from dummy classifier* will be used to compare the results for the model performance.
Dataset will be collected from Kaggle, California State University research website,
ACL-anthology-SemEval and GitHub.

2. Cleaning/Preprocessing: The same approach explained in Section-2.3 will be


applied. Cleaning and Noise Reduction for removing irrelevant elements and special
characters from tweets, then, Tokenization for breaking up cleaned text into individual
words. Subsequently, Stemming/Lemmatization to reduce words to their base forms,
CHAPTER 3. REQUIREMENTS AND ANALYSIS 23

enhancing data uniformity. Finally, Stopword Removal eliminates common words that
have little analytical value. Effectively managing the linguistic complexities of tweets.

3. Feature Extraction: Section-2.4 has covered what Feature Extraction means.


Techniques like Bag of Words, TF-IDF, and Word2Vec will be implemented to convert
these textual elements into a numerical format. This process is crucial for the machine
learning models to effectively analyse the sentiment within tweets, particularly in the
context of politics.

4. Sentiment Classification & Evaluation: All of the above steps will be now be
put to use. Chapter two Section-2.5, explores supervised learning models that will
be implemented for their strength in high-dimensional and sparse data-set sentiment
analysis. Section 2.6 describes unsupervised learning models that will be implemented
for their ability to uncover patterns and group data based on features. The performance
of these models will be compared using relevant metrics to determine the most effective
approach for sentiment analysis in the context of social media and politics. Key models
that will be developed are lexicon implementations, SVM, Naive-Bayes and Random
Forests.

3.4 Evaluation Metrics Usage and Analysis


In Section 2.7, we discussed various metrics for evaluating SA models. For comparing
the performance of the models in this study, we will primarily use Precision, Recall, F1
Score, and the Confusion Matrix. These metrics have been chosen for their effectiveness
in providing a balanced and comprehensive assessment of model performance, particularly
in terms of accuracy, sensitivity, and the ability to handle both positive and negative
sentiment classifications. Results from sklearn.metrics’s classification report and
various performance metrics will be used for analysis in each experiment and a larger report
will be linked in A.6

3.4.1 Analysis Plan and Aims


• Experiment 1 and 2: The goal of exploring different preprocessing techniques is
to establish an efficient and effective foundational model for sentiment analysis. The
choice of these specific methods are to match the review from Chapter 2 and find
the most computationally efficient techniques without compromising accuracy. These
initial experiments (Trials 1-6 in experiment 1) are designed to directly inform and
optimise the preprocessing pipeline.
CHAPTER 3. REQUIREMENTS AND ANALYSIS 24

Explaining the choice of classifiers: SVM is selected due to its robustness in


handling high-dimensional data. Its capacity to optimise the margin of separation
between classes makes it intriguing, particularly when preprocessing modifications
could significantly influence feature representation and the decision boundary. The
versatility of SVM to work with both linear and non-linear kernels also allows for an
in-depth exploration of preprocessing impacts across varied model complexities.

NB is chosen based on its probabilistic foundation, which is inherently sensitive


to the frequency and presence of features an aspect directly manipulated through
preprocessing techniques. Its computational efficiency and strong performance with
text data make it an ideal candidate for baseline assessments and iterative experiments,
enabling a clear demonstration of how preprocessing influences assumptions about
feature independence.

• Experiment 3 and 4: Linear models like Naive Bayes, Logistic Regression, and
Linear SVM are known for their efficiency and effectiveness in high-dimensional spaces
where relationships between features might be linearly separable. In contrast, non-linear
models like Random Forest, RBF-SVM, and Decision Trees are crucial when dealing
with the nuanced and multi-layered nature of political discourse, where the sentiment
might not be clearly expressed. Therefore, this selection of classifiers are chosen and
compared to identify the most appropriate model for analysing sentiment in political
contexts on social media platforms. (experiment 3 and 4 each with 4 trials)

• Experiment 5: Datasets play a vital role in facilitating learning, testing, and


developing models. By experimenting with various compositions of dataset mixes, the
project aims to find if there is a change in results and the possible reason behind
the change. Specifically, when working with social media datasets, the availability of
labelled or ”golden” data may often be limited. Understanding the effects of dataset
size and composition becomes especially crucial in such settings. This insight will help
in evaluating how different dataset characteristics influence the accuracy and reliability
of sentiment analysis models.

Please refer to A.7 for more about Dummy Classifiers


Chapter 4

Design, Implementation & Testing

4.1 Overview
Following a modular design philosophy, the implementation was divided into distinct
modules matching the principles of the standard SA pipeline. This pipeline was chosen
for its demonstrated robustness and reliability, thus offering a stable framework for
systematically integrating experimental enhancements.

A ”greedy approach” is adopted for continuous improvement and optimisation at each


subsequent step. Each design when implemented underwent meticulous evaluation and
selected based on its potential to get incremental enhancements in model performance and
accuracy. This iterative exploration and refinement was guided by selecting the optimal
configuration at every stage, to make a robust and accurate model.

For detailed hardware and software specifications, refer to appendix A.3

4.2 Design Technique


Using a Gantt chart and ideas organised through mind mapping, this project adopts an
Agile design approach, to emphasise iterative development to manage complex and evolving
requirements and solutions effectively. This approach ensures the project stays in line with
its objectives and timelines through regular updates and feedback. Key tools such as SpaCy,
for its processing speed, and TensorFlow, for its deep learning capabilities, form a robust
foundation for SA. This strategic selection and iterative optimisation are integral to a design
philosophy focused on systematic enhancement, enabling precise and in-depth analysis of
complex sentiment dynamics.

25
CHAPTER 4. DESIGN, IMPLEMENTATION & TESTING 26

4.3 Implementation Details


The implementation has 3 core components: preprocessing, feature extraction, and
sentiment classification. Program logic is compartmentalised into clean, distinct classes.
For clearer design– each responsible for a fundamental stage in the sentiment analysis process.

Figure 4.1: Full Class Diagram

4.3.1 Data Preprocessing


DataProcessor class is a crucial component of the sentiment analysis system, specifically for
handling preprocessing. The class is designed with a focus on maintaining the integrity and
context of the data, which is essential for accurate SA.

Figure 4.2: Dataprocesser Attributes


CHAPTER 4. DESIGN, IMPLEMENTATION & TESTING 27

Key functionalities include:

• Standardising Colloquial Expressions: Through expand colloquial method,


colloquial terms are mapped to their formal equivalents (e.g, “gonna” → “going to”),
promoting consistency in the dataset. This process aids in normalising the text without
diluting its meaning and expands possible negative colloquial terms otherwise lost.

• Dynamic Stop-word Management: calculate dynamic threshold and


expand stopwords methods adjust the stop-word list based on word frequency
within the corpus. This innovation allows for a more nuanced approach to filtering out
noise, balancing the need to remove irrelevant words while keeping significant terms.

• Enhancing Negation Sensitivity: mark negation dep enhances the understanding


of context and nuances in natural language by identifying negation tokens, like ”not”,
in dependency parsing. It focuses on verbs, adjectives, and adverbs, which typically
convey sentiment better. When a negation token is detected, the method analyses the
token’s sub-tree to locate dependent verbs, adjectives, and adverbs, prefixing them with
”NEG ”. With this, both negated and non-negated sentiments are accurately captured.

• preprocess data brings everything together by dynamically calculating stop-word


thresholds and applying multiple transformation steps. It converts text to lowercase,
expands colloquial to formal equivalents, removes special characters, and applies
negation handling to improve analysis accuracy. Optionally, it uses the VADER tool
for immediate sentiment scoring. The code tokenizes the text, efficiently filtering out
stop-words while retaining contextually important keywords. When enabled, it also
extracts Word2Vec tokens and POS-tags, integrating these features into an updated
dataset for further analysis.

Figure 4.3: Preprocesser method logic flow


CHAPTER 4. DESIGN, IMPLEMENTATION & TESTING 28

4.3.2 Feature Extraction Design


FeatureExtractor class is designed to transform processed text into structured data that
machine learning models can interpret. It incorporates various text vectorization techniques,
including TF-IDF and Word2Vec, to enrich the feature set. These methods are chosen to
capture both the frequency and contextual relevance of terms within the corpus, aligning with
the project’s design philosophy and to optimise text representation for analytical accuracy.

• Bag of Words (BoG): Implemented using CountVectorizer, this technique is the


base approach to text vectorization in SA. It effectively identifies the high-frequency
terms within the dataset, setting a performance baseline crucial for initial SA. The
simplicity of BoW makes it particularly useful for spotting trends in the data.

• TF-IDF Vectorization: Implemented using TfidfVectorizer, this method succeeds


BoG by using a measure of word importance across the corpus. It’s specifically tailored
to recognise the, significance of terms that are pivotal in political discussions which
may not appear frequently. The code allows for fine-tuning through parameters such
as n-gram range and term frequency thresholds, thus refining the representation of
features to better capture the nuances of sentiment expression.

• Word Embedding: Word2Vec embedding captures the semantic relationships between


words, a critical aspect when dealing with political sentiments. This technique allows for
an advanced understanding of context and sentiment beyond what traditional models
offer, addressing the project’s need to discern subtle sentiment shifts conveyed through
long/complex linguistic structures.

Above chosen methods of feature extraction was designed with a forward-thinking approach,
prioritising the integration of diverse text representation techniques. The design and
implementation covers the complexities of combining various preprocessing steps, the Maths
behind the calculations, scaling and integration. Accommodating a wide range of technique
combinations without sacrificing computational efficiency.

4.3.3 Sentiment Classification Design


The core of the classification strategy is a test of various models; Naive Bayes variants,
Random Forest, SVM, and VADER, each selected for its strengths in SA. This diverse
experimentation ensures a robust and comprehensive coverage capable of interpreting
sentiment expressions within a high-dimensional and sparse textual political dialogue. The
final design will have the best performing model, but the implementation employed the
following before finalising one.
CHAPTER 4. DESIGN, IMPLEMENTATION & TESTING 29

• Naive Bayes Classifiers: Selected for its proven effectiveness in text classification,
the Multinomial and Gaussian Naive Bayes classifiers are particularly adept at
processing high-dimensional data efficiently. The model’s performance is tweaked by
adjusting TF-IDF parameters enhancing its ability to detect contextually significant
terms.

• Random Forest Classifier: chosen for managing complex features without


succumbing to over-fitting. Tree parameters are fine-tuned to strike a balance between
computational efficiency and model accuracy, for practicality and effectiveness in large
datasets.

• SVM, is especially known for its effectiveness in high-dimensional spaces such as


text classification. It operates on margin maximisation, striving to find the optimal
hyperplane that distinctly separates different classes with the widest possible margin.
In this project, where nuanced textual data requires sophisticated decision boundaries,
SVM was employed with a linear (and rbf) kernel to focus on linear separability within
the feature space. The selection of a linear kernel also aimed to balance between model
complexity and computational demand, ensuring efficient training and prediction
phases without compromising on the classifier’s power. SVM’s capacity to handle the
intricate patterns in the dataset, allows it to classify sentiments even in the presence of
sparse data, demonstrating a perfect ability to capture subtle differences across textual
features which will be very useful in political context.

• VADER is implemented for its specialised capability in handling sentiments expressed


in social media texts, leveraging its built-in lexicon and rules to directly analyse text
without the need for extensive training. This choice is particularly pertinent for
analysing short, impactful expressions prevalent in political commentary, and to use
as a baseline comparison.

Figure 4.4: Implemented Pipeline


CHAPTER 4. DESIGN, IMPLEMENTATION & TESTING 30

4.4 Other Design Justifications


4.4.1 Using Boolean Logic
Using boolean flags as toggles for various preprocessing steps, helps enhance adaptability,
ensuring the model’s robustness across different datasets and contexts. The design philosophy
extends to efficiency & iterative experimentation of code, where boolean combinations
enable the systematic assessment of each preprocessing step’s impact on model performance.
This methodical approach does more than just identify and apply the most effective
preprocessing configurations. It enhances model accuracy and computational efficiency,
while also providing support for scalability. The modular nature of the preprocessing
pipeline facilitates straightforward extensions or modifications, ensuring the system can
expand with correctly driven analytics, maintaining scalability and adaptability in a changing
environment.

4.4.2 Dataset Design


The datasets are collected from SemEval-2016 Task 4 GitHub repository and from
Kawintiranon et al (2021) Research (19). The dataset were mixed together and cleaned
to remove unnecessary columns.

4.4.3 Model Selection and Customisation


The rationale behind this diversified model selection is to use the distinct advantages of
each component, ensuring a holistic and nuanced SA. The customisation of each model,
be it through parameter tuning, architecture adjustments, or preprocessing techniques,
is guided by the specific challenges presented by SA. These include the need to process
vast amounts of high-dimensional data, the requirement to capture both overt and subtle
sentiment expressions, and the necessity for adapting to the varied linguistic styles of
political discourse.

For more details on the tools and libraries used, decisions behind using them,
refer to appendix A.4

4.5 Testing
Systematic testing ensures that each component operates correctly and yields accurate
results. Firstly, unit tests were used for individual modules such as the DataProcessor,
FeatureExtractor, and ClassifierManager. These tests include checks for the functionality
of text preprocessing, feature extraction accuracy, and the robustness of each classifier.
For instance, DataProcessor class is tested to verify that text normalization, stop-word
expansion, and negation handling are performed correctly. Using test cases with manually
predefined outputs, each function was validated ensuring correctness.
CHAPTER 4. DESIGN, IMPLEMENTATION & TESTING 31

Integration testing then follows, where the combined operation of multiple units is tested
as a whole in the Retrieve class. This ensures that data flows correctly between processes
and that the system behaves as expected when all components interact. For example, the
output of the FeatureExtractor should seamlessly integrate into the classifiers within the
ClassifierManager without data mismatch or loss. This exemplifies end-to-end tests that
simulate the processing of raw input data through to sentiment classification, validating the
system’s ability to handle real-world data under controlled test conditions*(Refer to A.2).

Finally, regression testing is performed whenever changes are made to the code-base,
ensuring that new code does not adversely affect existing functionalities. By adhering to
these meticulous testing protocols, the project aims to deliver a robust and reliable sentiment
analysis system that consistently produces valid and precise results.

Note: The robustness and consistency of the models were further validated by conducting
multiple trials. Each tested 5 times and averaged, to ensure the reliability of the results
presented in Chapter 5.
Chapter 5

Results and Discussion

This chapter explores the results of the various testes done to methodically improve different
parts of the aforementioned Sentiment pipeline. Each sectioned experiment will detail it’s
purpose and the background to support the reasons driving the development.

5.1 Experiment 1: Impact of Preprocessing Techniques


5.1.1 Experimentation Purpose
Preprocessing, is the primary step for data preparation and cleaning, before feature
extraction. Effects of having different combinations for this step would ripple through to
the sentiment polarity significantly as supported by Krouska et.al (2016) (21). Therefore,
by measuring the impact of these combinations drawn from a set of implemented methods,
understanding their performance differences would help with optimising the overall SA.

5.1.2 Implementation
The first iteration of the implementation was straightforward, focusing on removing and
condensing words. However, as the project evolved through further research and discussion,
the preservation of key political, negative and/or colloquial terms was understood to be
crucial. To this end, the system was enhanced to not only preserve these terms but also
to apply dynamic stop-word management thus, avoiding the loss of contextually significant
terms. Additionally, Negation handling was first applied only to the term with negativity
(”Not”, ”Neither”,”No”) then, refined to mark dependencies and modify neighbouring verbs,
adverbs and adjectives to reflect true sentiment more accurately.

32
CHAPTER 5. RESULTS AND DISCUSSION 33

Table 5.1: Experiment 1 Implementation Steps

Trial 1: Just lower-casing


Trial 2: +Removing Special Characters
Trial 3: +Stopwords, Level 1
Trial 4: +Stopwords, Level 2 (Reverted to level 1 at the next step)
Trial 5: +Colloquial Expansion
Trial 6: +Negation

5.1.3 Results and Impact


Various Scores to show the impact of preprocessing: (X-Axis: Steps Taken, Y-Axis: Result)

Figure 5.1: Metrics Graphed For Naive Bayes and SVM (Trials 1-6)

• Trial 1 to Trial 2: Comparing values of accuracy, precision and recall, a


baseline was established with lower-casing words to have consistency, reducing noise
and vocabulary size. These can lead to less over-fitting and better generalisation
but sentiment conveyed with capitalised words is forgone. The addition of special
character removal in Step 2 removed more noise which can help classifiers focus on more
meaningful content. The relatively stable performance of SVM and slight improvement
in NB’s precision suggests that both classifiers benefit from cleaner data.

• Trial 2 to Trial 3: A fixed list of dynamically expanding stopwords is introduced in


Trial 3 which removes common, less informative words. However, the slight decrease
in performance for both classifiers suggests that the stop-word list may have removed
some words that were important for understanding the sentiment, indicating that the
stop-word list needs fine-tuning to carefully avoid discarding useful context.
CHAPTER 5. RESULTS AND DISCUSSION 34

• Trial 3 to Trial 4: Stop-word size is increased to level 2, which causes a significant


performance drop for both classifiers. This indicates that it is overly aggressive striping
the text of critical discriminative features, leading to a loss of contextual information
necessary for accurate sentiment analysis which should be avoided and reverted.

• Trial 4 to Trial 5: Stop-word removal has now been scaled back and colloquial
expansion is enabled in Trial five. This gave a huge recovery in the performance
metrics showing that expanding contractions to their full form provides more clarity
and potentially more features for the classifier to use.

• Trial 5 to Trial 6: Trial 6 shows the best performance so far, with correctly
implemented methods in place. The final addition of negation handling significantly
affect sentiment analysis, as it changes the sentiment polarity of phrases. This proved to
be working well and correctly as supported by the results. This shows that recognising
negations allows the models to better understand the context and sentiment of the
comments, leading to more accurate classifications.

Figure 5.2: Time & F1-Scores (X-Axis: Steps Taken, Y-Axis: Values)

Moving on to focusing on the F1-Scores and the preprocessing time taken for each trial:

• Initially from Trials 1-3, the preprocessing time fluctuated, starting at 26.11 seconds
and peaking at 29.37 seconds. The introduction and expansion of the stop-word list
in Trial 3 and other fairly intensive methods costs on average 3 seconds more. During
these phases, the F1-Scores for both SVM and NB remain relatively similar with NB
experiencing a slight increase from 0.516 to 0.519, then a minor drop to 0.513. SVM’s
F1-Score shows a similar trend. These modest changes suggest that the addition of
special character removal and dynamic stopwords had a marginal but not substantial
impact on the classifiers’ ability.

• Despite the preprocessing time decreasing slightly to 25.85 s, Trial 4, where the
stop-word list was at level 2 (table 5.1), F1-Scores plummeted to 0.393 for NB and
0.411 for SVM. This indicates that aggressively removing stopwords can degrade the
classifiers’ performance, likely by eliminating contextually important words.
CHAPTER 5. RESULTS AND DISCUSSION 35

• Trials 5 and 6 show a rebound in F1-Scores as the preprocessing parameters


were adjusted to be less aggressive and more nuanced, with Trial 5 seeing a slight
improvement and Trial 6 recovering fully with scores of 0.544 for NB and 0.617
for SVM. On reverse, the preprocessing time nearly doubled to 49 seconds. This
recovery in F1-Scores alongside more extensive preprocessing suggests that the quality
and appropriateness of preprocessing steps, like expanding contractions and applying
negation handling, are critical to enhancing model performance.

The impact of these 6 trials in experiment 1, has given the result of the best combination
of preprocessing (summarised at the end) to use for the rest of the implementation. This
combination is supported with strong metrics as discussed earlier. Results from both NB and
SVM, further solidifies that this combination would give the similar positive progression for
other classifiers.

5.2 Experiment 2: Comparing Feature Extraction Techniques


5.2.1 Experimentation Purpose
Feature extraction is a crucial step preceding classification. By evaluating different feature
extraction methods, we can identify the most effective techniques for extracting relevant
features from text data, thereby enhancing the performance of the classification model.

5.2.2 Implementation
Feature extraction trials compose of using; Bag of Words (BoW), TF-IDF, Word2Vec
and Part-of-Speech (POS) Tagging. (Table in the next page)

• FeatureExtractor class, initialises a CountVectorizer when self.use tfidf is set


to False. This vectorizer is set up to convert text into a frequency-based BoG method.
The fit vectorizer method is used to learn the vocabulary of the training data and
the transform data method applies this vectorizer to any given dataset.

• Similarly, if self.use tfidf is True, TfidfVectorizer is initialised making use of


TF-IDF instead. Like with BoW, fit vectorizer and transform data methods are
utilised to prepare and apply the TF-IDF transformation to the text data.

• POS tagging and Word2Vec are integrated into the preprocessing steps. In the
DataProcessor class, the method pos tagger and train word2vec is implemented
to tag words with their respective tags using the spaCy and Genism library. The
vector or the counts of POS-tags are then potentially included in the feature set by the
combine features method of the FeatureExtractor class. (sparse matrices stacked
horizontally)
CHAPTER 5. RESULTS AND DISCUSSION 36

Trial 1: Bag Of Words (BoW)


Trial 2: +TF-IDF (Used instead of BoW)
Trial 3: +Pos-Tagging
Trial 4: Word2vec

5.2.3 Results and Impact


Scores of Various Feature Extractions: (X-Axis: Steps Taken, Y-Axis: Results)

Figure 5.3: Metrics Graphed For Naive Bayes and SVM

• In Trial 1, using BoW both NB and SVM show similar accuracy levels, with NB at
58.9% and SVM at 58.6%. This suggests that BoW provides a solid baseline, capturing
the frequency of terms effectively for both models and good feature to begin with.

• In Trial 2, with TF-IDF, there is an increase in SVM’s accuracy to 62.3%, while


NB remains almost unchanged at 58.8% .The precision and recall for SVM also see
a substantial increase, with precision jumping to 65.3% and recall to 60.1%. This
implies that the TF-IDF representation, which weighs term frequency against the term’s
document frequency, enhances SVM’s performance, allowing it to better discriminate
between classes by giving more importance to less frequent, more informative terms.

• In Trial 3, the addition of POS-tagging to TF-IDF marginally enhances SVM’s


accuracy and precision (to 62.1% and 64.7%, respectively), yet it has a slightly negative
impact on NB, dropping the accuracy to 55.4%. This could be due to SVM’s ability
to leverage the additional syntactic information provided by POS-tagging, but NB’s
probabilistic model has feature independence therefore, it is affected negatively.
CHAPTER 5. RESULTS AND DISCUSSION 37

• Trial 4 adds Word2Vec with TF-IDF and POS-tagging, which results in a slight
dip in SVM’s performance but a larger decrease in NB’s metrics especially accuracy
dropping to 37.6%. This massive drop for NB could indicate that the dense, continuous
vector space representations from Word2Vec might be conflicting with NB’s expectation
of discrete feature representation, while SVM still manages to handle the increased
complexity of the feature space.

Figure 5.4: Time & F1-Scores (X-Axis: Steps Taken, Y-Axis: Values)

Comparing the F1-Scores, we see a steady increase for SVM from 0.579 to 0.6 through Trials
1 to 4, which shows a consistent improvement in the balance between precision and recall,
due to SVM’s effectiveness in high-dimensional spaces and complex feature interactions. On
the other hand, NB’s F1-Score decreases after Trial 1, hitting a low of 0.373 in Trial 4, which
aligns with the reasons to change in accuracy, precision, and recall discussed above.
Time fluctuates across the trials, taking the longest in Trial 1 at 66.64 seconds and the shortest
in Trial 2 at 48.98 seconds. When Word2Vec is activated, in Trial 4, time increases to 61.92
seconds, which involves a computationally intensive process to produce word embeddings.
This additional effort, however, did not translate to improved performance for NB, and while
SVM maintains high F1-Scores, the minimal changes suggest diminishing returns for the
added complexity in feature extraction.
Overall, the integration of more sophisticated feature extraction techniques appears to
benefit SVM, which is well-equipped to handle high-dimensional and dense feature spaces,
while NB seems to be better suited to simpler, less dense representations.

Feature Extraction Summary (best scores chosen):


Preprocessing Selected: Metric NB SVM
Lower-casing, Removing Special Characters, Accuracy 55.40% 62.10%
Stopwords (Level 1), Colloquial Expansion Precision 60.80% 64.70%
& Negation.
Feature Extraction Selected: Recall 54.00% 59.90%
TF-IDF and Pos-Tagging. F1-Score 53.50% 60.40%
CHAPTER 5. RESULTS AND DISCUSSION 38

Note Before Experiment 3 & 4: The Dataset used are identical and details of
classifier categorisation for the following experiments is in Section 3.4.1

Implementation for Experiment 3 and 4:


The linear classifiers used are Naive Bayes, Logistic Regression, Linear-SVM.
The non-linear classifiers used are Random Forest, RBF-SVM and Decision
Tree. These models are instantiated and managed through the ClassifierManager
class. After vectorization (Experiment 2), the data is transformed into a format
suitable for model training. Steps like converting the sparse output to dense array
using .toarray() is done here too. The specific implementations, covered by chapter
3, diverge in the model training phase due to the inherent differences in how these
classifiers handle data separation and complexity.

One of the baseline used to compare upcoming trials:


Table 5.3: “most frequent” Strategy

Dummy Classifier Evaluation on Dev Set Dummy Classifier Evaluation on Test Set
Accuracy: 0.349 Accuracy: 0.375
Precision: 0.116 Precision: 0.125
Recall: 0.333 Recall: 0.333
F1-score: 0.173 F1-score: 0.182

Dummy Classifier Confusion Matrix refer to A.5.

5.3 Experiment 3: Comparison of Linear Classifiers


5.3.1 Experimentation Purpose
This experiment compares different linear classifiers under uniform settings to improve
on understanding linear separation between classes and to determine which model is
best for this use. To draw attention to advantages and disadvantages of each classifier
by evaluating these models using a variety of performance measures. This thorough
evaluation helps choose the optimal model and judges the suitability of the classifier.

5.3.2 Results and Impact


(Exp 3) NB (Trial 1) Logistic Regression (Trial 2) Linear SVM (Trial 3)
Time 0.34 s 1.75 s 2008.81 s
Accuracy 0.554 0.617 0.621
Precision 0.608 0.649 0.647
Recall 0.540 0.593 0.599
F1 Score 0.535 0.597 0.604
CHAPTER 5. RESULTS AND DISCUSSION 39

• Trial 1, Naive Bayes (NB): In comparison to the most frequent dummy


classifier, NB classifier exhibits a huge improvement in sentiment classification
accuracy, achieving over 55% on development and test datasets, marking a
significant advancement from the baseline’s performance below 37.5%. Notably,
NB showcases a balanced understanding of negative sentiments, evident in
its precision and recall metrics surpassing 60%, which contrasts sharply with
the uniform recall of 33.3% from the dummy classifier’s simplistic approach
of predicting the most frequent class. However, while NB demonstrates high
precision in recognising positive sentiments, reaching 75%, its recall for positive
sentiments is notably lower at around 29%, suggesting a tendency towards
carefulness, potentially resulting in missed positive sentiment instances. The
F1-score for neutral sentiments peaks at 61% on the test set, underscoring NB’s
adept utilisation of extracted features, indicating its ability to discern nuanced
sentiment.

Despite its notable strengths, the NB classifier reveals areas for optimisation,
particularly in distinguishing positive sentiments without misclassifying them as
neutral. Although, from base scores, NB’s doing much better, the macro-average
F1-scores, consistently better than the dummy classifier, signal the need for
fine-tuning to enhance positive sentiment detection without compromising the
accurate classification of neutral or negative sentiments.

The precision-recall curve reveals that the Naive Bayes model performs best
in identifying positive sentiments, with an Average Precision (AP) of 0.70,
followed by neutral sentiments (AP=0.58) and negative sentiments (AP=0.56).
As recall increases, precision tends to decrease, reflecting the inherent trade-off
between these metrics. The learning curve complements this analysis by
CHAPTER 5. RESULTS AND DISCUSSION 40

showing that the model’s performance improves significantly as the training data
size increases, precisely up to around 2000 examples. Beyond this point, the
testing and cross-validation scores plateau, suggesting that adding more data
may not substantially enhance the model’s capabilities. Notably, the small gap

between the testing and cross-validation scores indicates that the Naive Bayes
model is not over-fitting the data, which is a desirable trait. However, the
precision-recall curve highlights potential areas for improvement, particularly
in recognising negative and neutral sentiments more accurately. Combining
these insights, efforts could be directed toward feature engineering, exploring
alternative models, or getting more diverse training data to enhance the model’s
overall performance across all sentiment classes.

The Confusion Matrix shows strong performance


in classifying neutral sentiments, with 1676
instances correctly classified, followed closely by
positive sentiments with 734 instances correctly
classified. However, the classifier struggles more
with negative sentiments, with only 631 instances
correctly classified. The negative sentiments are
misclassified as neutral (738 instances) and vice
versa (220 instances), suggesting challenges in
distinguishing between these two classes. Similarly,
there are notable misclassifications between positive
and neutral sentiments (1014 instances), indicating
potential overlap or ambiguity in the features
used to differentiate these classes.

• Trial 2, Logistic Regression (LR): LR classifier also outperforms the dummy


classifier, particularly in the development set where the accuracy increased from
34.9% to 59.9%, a significant boost of approximately 25%. This enhancement is
also evident in the precision and recall metrics, where LR achieved 64.1% precision
and 60.1% recall, compared to the dummy’s 11.6% precision and 33.3% recall,
respectively. These improvements suggest a robust model’s ability to distinguish
between classes more effectively than simple frequency-based predictions. The
test set also mirrored this enhancement, with accuracy reaching 61.7%, up by
23.7% from the dummy’s performance, alongside both precision and recall. These
results likely stem from LR’s capability to model the decision boundary between
classes more dynamically than the naive approach.
CHAPTER 5. RESULTS AND DISCUSSION 41

The learning curve of the Logistic Regression model demonstrate a steady


increase in the testing score as more training examples are utilised, eventually
plateauing around a score of 0.7 with approximately 4,000 test examples. This
plateau suggests that adding more training data beyond this point might yield
diminishing returns unless further modifications to model complexity or feature
engineering are implemented. The convergence of the cross-validation score with
the testing score indicates good model generalisation without overfitting, as both
scores stabilisse at the same performance metric.

The Precision-Recall curve for LR model reveals a relatively high level of


precision maintained across varying thresholds of recall, particularly for the
positive class which showed the highest average precision (AP) of 0.77. This
indicates that the model is capable of retrieving a higher fraction of relevant
instances among the retrieved instances, which is vital in scenarios where the cost
of false positives is high. The graph suggests that LR effectively balances recall
and precision, managing trade-offs well, which is critical in practical applications
where both identifying as many relevant samples as possible (high recall) and
maintaining the relevancy of these identifications (high precision) are important.

Confusion Matrix of LR shows the relatively strong performance in classifying


neutral sentiments, with 1567 instances correctly classified, followed by
positive sentiments with 1280 instances correctly classified suggests a robust
understanding of the features associated with this sentiment class.
CHAPTER 5. RESULTS AND DISCUSSION 42

But, the model struggles more with


negative sentiments, with only 536
instances correctly classified. Negative
sentiments are misclassified as neutral
(738 instances) and vice versa (155
instances), indicating potential challenges
in discerning nuanced negative sentiment
cues from the text data. Similarly, there
are misclassifications between positive
and negative sentiments (621 instances),
suggesting overlapping or ambiguous
features between these classes indicating
areas where the model’s performance could
be improved.

• Trial 3, Linear SVM: The SVM classifier performed the best so far, evident by
its results. On dev set, it achieves an accuracy of 59.5%, precision of 63.9%, and
recall of 59.7%, resulting in an F1 score of 60.2%. Comparatively, on the test set,
the accuracy slightly increases to 62.1%, with a similar precision of 64.7% and a
modest improvement in recall to 59.9%, culminating in an F1 score of 60.4%. This
improvement suggests that SVM’s robustness to variations in data, particularly
with a higher number of test examples, may have contributed to a slightly
better generalisation on unseen data. The performance suggests that while
SVM is relatively effective at differentiating between classes (especially positive
sentiments), it struggles with false positives and negatives, particularly for
the neutral and negative classes, which could be due to overlapping sentiment
features that are challenging to linearly separate.
CHAPTER 5. RESULTS AND DISCUSSION 43

The learning curve of the SVM shows a strong performance consistency, where
the score plateau around 0.9 suggests that adding more test examples does not
significantly change the classifier’s ability to generalise. This is contrasted by the
cross-validation score’s solid increase, indicating that the classifier benefits from
more training data, reducing over-fitting and enhancing its predictive accuracy
on unseen data.

The precision-recall curve indicate that SVM performs best at classifying


positive sentiments with an average precision (AP) of 0.78, followed by negative
sentiments at 0.63 AP and neutral at 0.60 AP. This variation in performance
across classes could be attributed to the distinctiveness of features associated
with positive language that SVM can more effectively use, whereas neutral and
negative sentiments might share more ambiguous or overlapping lexical features,
complicating their classification. The learning curve demonstrates a high training
score plateauing around 0.9, contrasting with a steadily increasing cross-validation
score, which starts from 0.5 and reaches up to 0.7 with increased training
examples. This disparity suggests that SVM, while capable of fitting well to
the training data, faces challenges in achieving similar performance on validation
data, hinting at potential over-fitting despite regularisation attempts in my
code.
The confusion matrix shows increased
correct neutral sentiments predictions as
the dataset size increases, evidenced by
1531 correct classifications out of 2059
possible in the test set. But, the classifier
struggled more with negative sentiments,
often misclassifying them as neutral. This
might be due to overlapping features
between negative and neutral classes that
are not distinctly separable by SVM
in feature space. This misclassification
between these classes highlights an area for
improvement.

(Exp 4) SVM-rbf (Trial 1) Decision Trees(Trial 2) Random Forest (Trial 3)


Time 780 s 24.97 s 519.94 s
Accuracy 0.490 0.517 0.586
Precision 0.539 0.526 0.662
Recall 0.454 0.501 0.556
F1 Score 0.422 0.504 0.555
CHAPTER 5. RESULTS AND DISCUSSION 44

5.4 Experiment 4: Comparison of Non-Linear Classifiers


5.4.1 Experimentation Purpose
Experiment 4 is different from 3 by, investigating how well non-linear classifiers
performs to capture intricate interactions and patterns that are difficult to categorise
using linear techniques. This analysis helps to focus on the drawbacks of linear models,
unable to represent the non-linear decision boundaries required for complex datasets.
Experiment 4 seeks to find models that perform better in environments with high
levels of complexity and feature inter-dependency by thoroughly comparing different
non-linear classifiers.

5.4.2 Results and Impact

• Trial 1, rbf SVM: SVM-RBF’s accuracy is 50.8% on the dev set and slightly
lower at 49% on the test set. This modest performance highlights potential
issues with either the model’s ability to generalise or perhaps its suitability to
the dataset’s characteristics. Notably, the classifier performs well in identifying
the neutral class but struggles significantly with the negative class, indicating a
potential bias towards classes with more data or more distinct feature sets. The
precision for the negative class on the development set stands at 72%, which
plummets to 53% on the test set, showing a loss in the model’s confidence when
generalised to new data. This drop could be due to the model over-fitting to
the negative examples in the training data or the variability within the negative
class that isn’t captured fully by the model. The model fails to capture most of
the actual negative sentiments with only 15% for the negative class on the test
dataset likely missing subtler cues that define negative sentiment.
CHAPTER 5. RESULTS AND DISCUSSION 45

The Learning curve reveal that the performance on cross-validation improves


consistently as more data points are introduced, which is a positive sign of the
model’s learning capability. However, testing score plateaus, suggesting that
adding more data doesn’t necessarily translate to better generalisation. This
plateau could point to limitations in the feature set or the need for more complex
or different types of features to capture the nuances of sentiment effectively.

The Precision-Recall curve suggest that while the classifier can maintain a
reasonable precision rate as recall increases (especially for the positive class),
the trade-off becomes stark as the curve steepens, particularly for the negative
class. This steep drop-off indicates that achieving higher recall substantially
compromises precision, a typical sign of a model struggling with class imbalance
or lacking robust features to differentiate classes effectively.

The Confusion Matrix shows a high


degree of confusion between classes,
particularly between the neutral and
negative classes. This confusion might be
due to overlapping features between these
classes or insufficient model complexity to
capture the nuances distinguishing these
sentiments.

• Trial 2, Decision Tree (DT): DT has a balanced total accuracy of 50.3% on


the dev set and 51.7% on the test set. This modest level of accuracy, alongside
precision and recall averages around 52.3% and 50.2% respectively, indicating that
while the model has a reasonable grasp on the data, it struggles to achieve high
predictive power, typical of decision tree models. F1-scores, averages at 50.6% on
the dev set and lower at 50.4% on the test set, suggesting that there is a balanced
trade-off between precision and recall. These scores also highlight the classifier’s
limitations in handling complex patterns. Also, its precision for the ’neutral’ and
’positive’ classes is relatively consistent, yet the recall rates differ significantly.
So, the model is better at identifying ’neutral’ sentiments compared to ’positive’
ones. This could be due to the characteristics of decision trees that tend to focus
on features that provide the most information gain early in the tree, potentially
skewing the model’s sensitivity towards classes that are easier to distinguish.
CHAPTER 5. RESULTS AND DISCUSSION 46

The Learning curve indicates that DT’s training performance is consistently


high (red line), which does not improve with more data. This plateau suggests
that adding more training examples does not benefit the model. Due to DT’s
nature of fitting closely to its training data. Conversely, the cross-validation
score (green line) shows improvement as more test examples are used, which is
good as it suggests that the model could generalise better with more data, but
this improvement is still modest.
The Precision-Recall The precision recall curve shows a substantial decline in
precision as recall increases, which is typical for classifiers dealing with imbalanced
classes or overlapping class distributions. For ’negative’ class, the area under the
curve (AP=0.37) is notably lower compared to the ’positive’ class (AP=0.46),
indicating that the model struggles more with correctly identifying true negatives
without increasing false positives.

The Confusion Matrix shows more


false positives and negatives for ’neutral’
sentiments, due to he overlapping nature
of ’neutral’ sentiment features with other
classes- neutral statements might share
vocabulary with positive or negative
sentiments without carrying a definitive
sentiment themselves. So, DT’s simplistic
approach to decision-making, which relies
on thresholding feature values, might not
capture the nuances well.
CHAPTER 5. RESULTS AND DISCUSSION 47

• Trial 3, Random Forest (RF): RF unlike other models, has precision reaching
as high as 85% on the development set, indicating its strong capability to correctly
identify negativity without many false positives. However, the recall for the
same category is considerably lower at 49%, suggesting that while the classifier
is precise, it fails to capture a significant portion of the actual negative cases.
Highlighting a balance between precision and recall but room for improvement
in recall performance needed. The model performs best with neutral sentiments
on the development set, achieving a precision of 46% and a recall of 80%, which
results in the highest f1-score of 59% among the three sentiment classes. So
RF is particularly adept at identifying the broader spread of neutral sentiments,
possibly due to its ensemble nature allowing it to generalise better across more
varied but less extreme sentiment expressions.

The Learning curve of RF shows a high plateau for testing scores, remaining
around 90%, while the cross-validation score increases with more test examples.
This suggests that the model could potentially improve with more training data,
as indicated by the rising green line. The stability of the testing score (red line)
at high levels demonstrates the model’s robustness and its ability to maintain
performance despite increased complexity or data size.

The Precision-Recall illustrates a significant drop-off in precision as recall


increases, which is typical in classification tasks as the effort to cover more true
positive cases often results in accepting more false positives. The area under the
curve (AP) scores of 0.63 for negative, 0.61 for neutral, and 0.73 for positive
indicate a stronger performance in distinguishing positive sentiments compared
to negative and neutral, which aligns with the higher precision observed.
CHAPTER 5. RESULTS AND DISCUSSION 48

The Confusion Matrix shows that


while high precision indicates fewer false
positives, the model struggles with false
negatives, particularly in the negative
and positive classes. For example, a
substantial number of negative sentiments
are misclassified as neutral, reflecting the
model’s conservative stance on classifying
sentiments as negative unless the data
strongly suggests otherwise. This could be
attributed to the inherent randomness of
the forest, where diverse trees might lead
to conservative majority votes.

The combined impact of experiment 3 and 4 shows that the choice between
linear and non-linear classifiers should be guided by the nature of the data, the
complexity of the decision boundaries, and the specific performance metrics that are
prioritised for a given NLP task. For example, Some classifiers may achieve high
precision, they often do so at the expense of recall, and vice versa (This trade-off
was particularly evident in models like decision trees and SVM with RBF kernels).
Leveraging ensemble techniques such as boosting and bagging can help improve the
performance of decision trees and potentially other classifiers by reducing variance
and bias. These insights not only facilitate a more informed model selection process.

Final selected Model: Linear: SVM , Non-Linear: Random Forest

Other detailed model results are in the appendix A.6

5.5 Experiment 5: Effects of Dataset


5.5.1 Experimentation Purpose
The experiment investigates the impact of, mixing dataset size and composition, on
classifier performance. To determine their influence on the effectiveness of the optimal
model identified in prior analysis.

5.5.2 Implementation
• Combine Mode: combines the training and development data into a single
dataset. It then splits this combined dataset into new training and development
CHAPTER 5. RESULTS AND DISCUSSION 49

sets using train test split. The test data is kept separate- 90% of the combined
dataset is used for the new training set, 10% is used for the new development set.

• Mixing Mode: combines the development and test data into a single dataset.
This combined dataset is then split into new development and test sets using
train test split. 50% of the combined dataset is used for the new development
set, remaining 50% is used for the new test set.

• Pooling Mode: combines all the data (training, development, and test) into a
single dataset. It then splits this combined dataset into new training and test
sets using train test split (80% of the combined dataset is used for the new
training set and 20% is used for the new test set).

5.5.3 Results and Impact


To have a fair measurement, the same dataset as previous experiments were used, while
choosing the best classifier so far, linear SVM.

Figure 5.5: SVM Scores of various mixing methods

• Combine Method: Has high recall for the negative sentiment class (0.89),
indicating that the model was highly effective in identifying negativity but at
the cost of precision (0.39), reflecting high false positives. Likely due to the
training set being overly representative of negative examples. This is supported
by a relatively lower performance in identifying neutral and positive sentiments,
as indicated by the precision and recall values for these classes as per A.6. The
overall accuracy of 49.3% and the extended processing time of 3163.67 seconds
further suggest inefficiencies, possibly due to model complexity or over-fitting on
the data distribution.

• Mixing Method: Demonstrated a more balanced performance across sentiment


classes with precision and recall scores around 60% for all categories. The
relatively even distribution of data between the development and test datasets
could have contributed to a more generalised model that handled varied inputs
better. The shorter SVM processing time (1127.74 seconds) compared to the
Combine method suggests a more efficient learning process, possibly because the
model was trained on a more balanced dataset which might facilitate quicker
convergence during training.
CHAPTER 5. RESULTS AND DISCUSSION 50

• Pooling Method: The best performing, achieved an accuracy of 67.8% and an


F1-score of 68.3%. Merging and using all available data for training and testing,
likely provided the most comprehensive representation of feature space and class
distribution. This comprehensive exposure to the data could have enabled the
model to better generalise across unseen data, thus improving both precision and
recall uniformly across categories. Despite the longest SVM processing time of
3634.77 seconds, computational cost paid off with superior model performance,
indicating a better balance between learning complexity and predictive capability.
Chapter 6

Conclusion and Further Work


6.1 Conclusion
Table 6.1: Summary of Experiment 1

Preprocessing
Effectiveness Use Case
Method
Creates a baseline by reducing Good for establishing
Lower-casing
noise and vocabulary size. consistency in text data.
Removing Special Removes noise, helping classifiers Effective in cleaning data to
Characters focus on more meaningful content. improve clarity for analysis.
Useful when fine-tuned to
Slightly decreases performance;
Stopwords, Lvl 1 preserve contextually
stopwords need to be well-selected.
significant terms.
Overly aggressive removal harms Not recommended; too
Stopwords, Lvl 2 performance by stripping essential aggressive for sentiment
features. analysis.
Highly effective in text
Expands contractions, improving
Colloquial normalisation and
clarity and feature availability for
Expansion understanding colloquial
classifiers.
words.
Huge improvement by accurately Critical for accurately
Negation Handling reflecting sentiment polarity capturing sentiment in
changes. phrases involving negations.

Feature Extraction Best Use Case


Bag of Words (BoW) Useful for initial models to establish a baseline performance.
TF-IDF Enhances discrimination between classes, especially in SVM.
POS Tagging Beneficial for models that can leverage syntactic structure, like SVM.
Word2Vec Not suitable for NB; SVM can be used cautiously with adequate tuning.

Table 6.2: Summary of Experiment 2

51
CHAPTER 6. CONCLUSION AND FURTHER WORK 52

Table 6.3: Summary of Experiment 3 and 4

Classifier Strengths Recommended Use


Naive Bayes High precision, effective in Suitable for datasets where features
identifying positive sentiments, are independent and linearly
efficient with lower processing time. separable, good for initial baseline
models.
Logistic Regression Good balance between precision and Effective for binary and multi-class
recall, robust in handling linear classification in datasets where
separability between classes. relationships are approximately
linear.
Linear-SVM Best overall in linear classifiers, Ideal for high-dimensional spaces
good generalisation on unseen data, where the margin of separation is
strong at differentiating between crucial, such as in text classification.
classes.
Random Forest High precision, good at Best for datasets with intricate
handling complex and non-linear interactions and high feature
relationships, robust against inter-dependencies, requiring
over-fitting with ensemble ensemble learning.
approach.
RBF-SVM Can capture non-linear Suitable for datasets that exhibit
relationships, though with moderate non-linear distribution but may
accuracy and precision. require careful tuning to avoid
over-fitting.
Decision Tree Simple to understand and interpret, Good for initial exploratory analysis
can handle both numerical and to identify significant features, but
categorical data. prone to over-fitting and bias.

Strategy Usefulness
Useful when aiming to maximise the
Combine training dataset but require a separate
set for model validation during training.
Useful when development and test data
Mixing are limited, and there is a need to re-balance
or create new sets from existing data.
Useful for maximising training data
Pooling availability and using a subset of the
combined data for final model testing.

Table 6.4: Summary of Experiment 5


CHAPTER 6. CONCLUSION AND FURTHER WORK 53

6.2 Further Work


Chapter 5 has established a strong basis for evidencing how feature extraction
and preprocessing affect classifier performance. Future research, however, might
extend the reach by utilising more sophisticated machine learning algorithms and
classifiers that can better capture the subtleties and complexities of the data. For
example, deeper insights into contextual dependencies within the text could be
obtained by integrating ensemble methods like Gradient Boosting Machines (GBM)
or advanced neural network architectures like Transformer-based models (e.g., BERT,
RoBERTa). Furthermore, investigating alternative machine learning techniques such
as deep learning methods that make use of more customisation in Support Vector
Machines (SVM) with non-linear kernels may reveal patterns overlooked by more
conventional models. Cross-validation techniques could be extended to include k-fold
and stratified methods to improve comparative methods. Which could mitigate the
risk of over-fitting and provide a clearer comparison of model performance across
diverse data distributions.

A more thorough examination of feature extraction combinations and preprocessing


techniques could improve the current methodology. In order to identify the ideal
configuration for each classifier type, future research may concentrate on the
methodical testing of different combinations of preprocessing techniques with distinct
forms of feature engineering, to find the best for each classifier.

Additionally, the effectiveness of sentiment analysis heavily relies on the quality


of the data used. This project used the standardised labelled dataset from task 4
in SemEval 2016. While this dataset provides a strong base, future work should
also explore newer labelled datasets. It reflects linguistic and sentiment norms that
may not align with current trends. Future efforts could benefit more from using
more contemporary datasets, possibly obtained through scraping newer social media
content or curating custom datasets with recent topical relevance. Enhancing the
dataset could involve not just expanding the volume but also improving the diversity
and representativeness of the samples, including different social media platforms
or forums. Such advancements would likely increase the accuracy and applicability
of sentiment analysis models in recognising and interpreting modern linguistic nuances.

Lastly, more work could have been done for specific political analysis, by analysing
a number of hand-picked tweets with varying political bias and testing the optimal
model’s performance more thoroughly.
Bibliography

[1] Akdogan, A. Word embedding techniques: Word2vec and tf-idf explained, 2021.
[2] Aleksandric, A., Saha, S., and Nilizadeh, S. Twitter users’ behavioral
response to toxic replies, n.d.
[3] Algorithms, E. Text data pre-processing techniques in ml, 2023.
[4] Alvi, Ali, S. F., Ahmed, S. B., Khan, N. A., Javed, M., and Nobanee,
H. On the frontiers of twitter data and sentiment analysis in election prediction:
a review. PeerJ Computer Science 9 (August 21 2023), e1517.
[5] AminiMotlagh., Shahhoseini, H., and Fatehi, N. A reliable sentiment
analysis for classification of tweets in social networks. Social Network Analysis
and Mining 13, 1 (2023), 7.
[6] Bahrawi, N. Sentiment analysis using random forest algorithm-online social
media based. Journal of Information Technology and Its Utilization 2, 2 (2019),
29–33.
[7] Barkha, B., and Sangeet, S. Sentiment analysis using twitter data:
a comparative application of lexicon- and machine-learning-based approach.
International Journal of Engineering & Technology 7, 4 (2018), 2036–2040.
[8] Bergmeir, C., Hyndman, R., and Koo, B. A note on the
validity of cross-validation for evaluating autoregressive time series prediction.
Computational Statistics Data Analysis 120 (2018), 70–83.
[9] Breiman. Random forests. Machine Learning 45, 1 (2001), 5–32.
[10] Dataconomy. Data preprocessing steps and requirements, 2023.
[11] Dean, B. How many people use twitter in 2023? [new twitter stats], 2023.
[12] Ebrahimi, Yazdavar, A., and Sheth, A. Challenges of sentiment analysis for
dynamic events. IEEE Intelligent Systems 32, 5 (2017), 70–75.
[13] evidentlyai. Confusion matrix explanation.

54
BIBLIOGRAPHY 55

[14] Fujiwara, T., Müller, K., and Schwarz, C. The effect of social media on
elections: Evidence from the united states. SSRN Electronic Journal (2022).

[15] Guha, P. Sentiment analysis on twitter data regarding 2020 us elections, 2020.

[16] Haddi, E., Liu, X., and Shi, Y. A study of the sentiment analysis techniques
in the social media. Procedia Computer Science 22 (2013), 747–752.

[17] Janke, J., Castelli, M., and Popovič, A. Analysis of the proficiency
of fully connected neural networks in the process of classifying digital images:
Benchmark of different classification algorithms on high-level image features from
convolutional layers. Expert Systems with Applications 135 (2019), 12–38.

[18] Karthika, P., Murugeswari, R., and Manoranjithem, R. Sentiment


analysis of social media network using random forest algorithm, 2019.

[19] Kawintiranon, K., and Singh, L. Knowledge enhanced masked language


model for stance detection.

[20] Khan, A., Boudjellal, N., Zhang, H., Ahmad, A., and Khan, M.
From social media to ballot box: Leveraging location-aware sentiment analysis for
election predictions. Computers, Materials & Continua 77, 3 (2023), 3037–3055.

[21] Krouska, A., Troussas, C., and Virvou, M. The effect of preprocessing
techniques on twitter sentiment analysis. In 2016 IEEE International Symposium
on Intelligent Signal Processing and Communication Systems (IISA) (2016).

[22] Margherita. Metrics for multi-class classification: An overview.

[23] Muthukrishnan, R., and Rohini, R. Lasso: A feature selection technique in


predictive modeling for machine learning, 2016.

[24] Onishi., and Shiina, H. Distributed representation computation using cbow


model and skip–gram model, 2020.

[25] Park., and Lek, S. Artificial neural networks: Multilayer perceptron for
ecological modeling. Developments in Environmental Modelling 28 (2016),
123–140.

[26] Ramos. Using tf-idf to determine word relevance in document queries, 2003.

[27] Shahapure, Ketan Rajshekhar, N. C. Cluster quality analysis using


silhouette score, 2020.

[28] Sharma. K means clustering simplified in python — k means algorithm, 2021.


BIBLIOGRAPHY 56

[29] Tran, H. Studying the community of trump supporters on twitter during the
2020 us presidential election via hashtags #maga and #trump2020. Journalism
and Media 2, 4 (2021), 709–731.

[30] Wang, S. How to build an email sentiment analysis bot: An nlp tutorial. Toptal
Engineering Blog (2018).

[31] Wankhade, Rao, A., and Kulkarni, C. A survey on sentiment analysis


methods, applications, and challenges. Artificial Intelligence Review 55, 10
(October 2022), 5731–5780.

[32] Yu, Wang, L., Huang, H., and Yang, W. An improved random forest
algorithm. Journal of Physics: Conference Series 1646, 1 (2020), 012070–012070.

[33] Yuxing, and Shabrina, Z. Sentiment analysis using twitter data: a


comparative application of lexicon- and machine-learning-based approach. Social
Network Analysis and Mining 13 (2023), 31.

[34] Zhang, L., and Wang, S. Deep learning for sentiment analysis: A survey. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 4 (2018),
e1253.

[35] Öhman, E. The validity of lexicon-based sentiment analysis in interdisciplinary


research, 2021.

[36] . My thanks to com3110, i used the starter code from assignment 2 for this project.
Appendix A

A.1 Vector Definition


A vector is a mathematical representation of a sentence. It is a set of numbers, where
each number corresponds to the count of a specific word in the document. The length
of the vector is equal to the size of the vocabulary/dictionary, and each element in the
vector represents the presence or absence of a particular word.

A.2 Distance Definations


The Euclidean distance is the straight-line distance between two points in a Euclidean
space, while the Manhattan distance is the sum of the absolute differences between the
coordinates of the points.

A.3 Hardware and Software Specifications


• Development was conducted on a system with an Intel i7 core and 16GB RAM
under a 64-bit Windows 10 environment. A virtual environment with 32GB RAM
and an i9 Core was also used for preprocessing and testing.

• Python 3.11.0, with its rich library support, was the primary language used,
managed via Anaconda. Development tools included VSCode and Cursor
integrated with Git for version control, managing code changes efficiently.

A.4 Selection of Tools and Libraries


This selection of libraries each brings specific strengths to the pipeline, from SpaCy’s
efficient text processing to TensorFlow/Keras’s advanced deep learning capabilities,
ensuring a robust, flexible, and comprehensive analysis. The support provided by
these libraries reduce computational load and improve code efficiency.

57
APPENDIX A. 58

• SpaCy: SpaCy is utilised for its robust parsing capabilities, efficiency, and ease of
use. Tokenization, lemmatization, and part-of-speech (POS) tagging are essential
steps in the preprocessing pipeline to structure and simplify text data for analysis.
SpaCy’s optimised algorithms enable rapid processing of large volumes of text,
crucial for the dataset’s scale.

• NLTK (Natural Language Toolkit): The inclusion of NLTK, particularly


its VADER module, is justified by VADER’s specialised sentiment analysis
capabilities. NLTK’s pre-built sentiment lexicon and rule-based scoring
mechanism provide a quick and effective means to obtain sentiment polarity scores
without the need for training.

• Scikit-learn: Chosen for its comprehensive suite of algorithms and tools for
machine learning, including text vectorization (CountVectorizer, TfidfVectorizer)
and classifiers (Naive Bayes, Random Forest). Scikit-learn is praised for
its simplicity, documentation, and community support, making it ideal for
implementing and experimenting with various traditional machine learning
models in the classification design.

• Gensim: Employed for its Word2Vec implementation, Gensim allows us to


transform words into vector representations, capturing semantic meanings. This
feature is critical in understanding the contextual usage of words in the corpus,
enhancing the model’s ability to discern subtle sentiment nuances.

• TensorFlow/Keras: The decision to incorporate TensorFlow/Keras for


developing the LSTM models is grounded in these libraries’ advanced
neural network capabilities and user-friendly interface. TensorFlow offers a
comprehensive, flexible ecosystem of tools and libraries for machine learning and
deep learning, while Keras provides a high-level neural networks API, facilitating
the rapid and efficient development of LSTM models. These models are justified
with it’s ability to capture long-term dependencies in text, crucial for analysing
complex sentiment structures in social media.

• Pandas: Selected for data manipulation and analysis, Pandas provides high-level
data structures and operations for manipulating numerical tables and time series.
It is instrumental in the data preprocessing and feature engineering stages,
offering efficient handling of data-frames and ease of integration with other
libraries.

• Matplotlib & Seaborn: These libraries are chosen for data visualisation,
enabling us to plot graphs and charts for exploratory data analysis, model
evaluation, and results presentation. Their wide range of plotting options and
ease of use make them suitable for conveying complex data insights visually.
APPENDIX A. 59

A.5 Dummy Classifier Confusion Matrix


Development
 Set(Left)
 Test Set(Right):

0 1602 0 0 1452 0
0 1957 0 0 2059 0
   
0 2042 0 0 1976 0

A.6 Full Report of all results from the Code


Here is the [http://tiny.cc/dissoutput]
Document links to all the results. (Any reference made to classification report for
example or test or dev scores will be in the respective experiment tabs) (36)

A.7 Dummy Classifier Defination


A Dummy Classifier in sentiment analysis, using the ”most frequent” strategy, always
predicting the most frequently occurring class in the training dataset. This means
it disregards the input features and simply returns the same class for all instances.
Basically it doesn’t actually try to learn anything from the data; it just keeps saying
the most common thing over and over. But a important score to beat.

You might also like