0% found this document useful (0 votes)
31 views7 pages

A Comparative Study of Some Selected Classifiers On An Imbalanced Dataset For Sentiment Analysis

Extracting subjective data from online user generated text documents is made quite easy with the use of sentiment analysis. For a classification task different individual algorithms are applied to a review dataset in which most classifiers produce accurate results while others produce limited and inaccurate predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views7 pages

A Comparative Study of Some Selected Classifiers On An Imbalanced Dataset For Sentiment Analysis

Extracting subjective data from online user generated text documents is made quite easy with the use of sentiment analysis. For a classification task different individual algorithms are applied to a review dataset in which most classifiers produce accurate results while others produce limited and inaccurate predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1751

A Comparative Study of Some Selected


Classifiers on an Imbalanced Dataset for
Sentiment Analysis
Mohammed Ali Kawo1; Dr. Garba Muhammad2; Dr.Danlami Gabi3 and Dr. Musa Sule Argungu4
1
Department of Computer Science, Federal University Gusau, Zamfara State, Nigeria
1, 2, 3,4
Department of Computer Science, Kebbi State University of Science and Technology, Aliero, Nigeria

Abstract:- Extracting subjective data from online user views or feelings from text data is known as sentiment
generated text documents is made quite easy with the use analysis or opinion mining. Sentiment analysis determines a
of sentiment analysis. For a classification task different person's opinion or sentiment toward a particular incident
individual algorithms are applied to a review dataset in (Kawade & Oza, 2017). In order to perform sentiment
which most classifiers produce accurate results while analysis, we must provide a text or document that can be
others produce limited and inaccurate predictions. This examined and that can provide a system or model that
research is to evaluate various machine learning summarizes the opinions expressed in the text (Krishna,
algorithms for online dataset classification, where same 2020). Customer’s sentiment about company's goods and
set of data will be used to test four different machine services is determined by comments and reviews from other
learning algorithms: Naive Bayes, Support Vector users, it has proven extremely helpful in practically every
machine, K-nearest neighbor and Decision tree. In order business and social arena (Kumar et al., 2023). Sentiment
to determine which machine learning model will perform analysis involves a variety of techniques which includes
best in sentiment analysis as a constant issue. In this Natural Language Processing (NLP), Machine Learning
research, our primary goal is to identify the most (ML), Deep Learning (DL), Ensemble Methods and Hybrid
effective machine learning model for sentiment analysis Techniques.
of English texts among the aforementioned classifiers.
Their robustness will be tested and classified with an (Kasthuri & Jebaseeli, 2020) Many studies
imbalanced dataset Kaggle.com a Machine learning concentrated on using standard classifiers to handle most
repository. The dataset will first undergo data problems such as the maximum entropy, naive Bayes,
preprocessing in order to enable analysis, and then decision tree, K-nearest neighbor and support vector
feature extraction for the base classifiers performance machine. But in order to improve the classification accuracy
and accuracy which will be carried out in Jupyter on sentiment analysis a substantial and robust classifier must
notebook from Anaconda. Each machine learning to be obtained.
algorithm performance scores will be calculated for
higher accuracy using confusion matrix, F1-score, As a text classifier that can categorize text into
precision and recall respectively. different sentiments, sentiment analysis also known as
opinion mining is useful for reviews of movies, products,
Keywords:- Machine Learning Algorithms, Sentiment customer services, opinions about any event, such as
Analysis, Imbalanced, Confusion Matrix. politics, societal activities (Kawade & Oza, 2017).
Sentiment analysis is also useful for identifying people's
I. INTRODUCTION opinions about any event like academics, practitioners and
in human computer interaction, as well as those in other
Machine learning is the concept of self-learning disciplines like sociology, marketing, economics and
(George & Srividhya, 2022). It is a subset of artificial advertising (Bahwari, 2019). It can also be used to
intelligence which involves training of computer to learn determine whether a particular item or service is good or
and improve from data without being thoroughly or detailed bad, preferred or not preferred, and polarity of text (positive,
programmed. It deeply relied on algorithms and statistical negative, or neutral).
models to recognize patterns and make predictions and
decisions based on the input data. Machine leaning Due to the recent rapid rise of social platforms, a great
processes large amount of data to discover insights and deal of research in the field of sentiment analysis has
develop automated responses and actions, enabling focused on social medias. In order to improve company or
computers to perform task and improve their performance find solutions to a variety of real world issues, practitioners
overtime (George & Srividhya, 2022). and researchers have been working tirelessly to investigate
and analyze this huge amount of data. They have done this
Sentiment analysis examines how individuals express by utilizing the daily interactions and ever growing user
their ideas, sentiments, assessments, attitudes, and emotions generated material that the websites facilitate (Agustini,
in written language (Kumar et al., 2023). The inspection of 2021).

IJISRT24MAY1751 www.ijisrt.com 2826


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1751

Educators must comprehend the views and feelings of (George & Srividhya, 2022) provides a successful
their students, just as organizations must comprehend the method for creating precise classifiers for the Usenet2
thoughts of their clients. In an educational setting, sentiment dataset. The base classifiers used in the recommended
analysis is also very useful where teachers and students are approach are Naïve Bayes, Support Vector Machine, and
the driving forces behind the advancement of every nation's Genetic Algorithm. In their work both homogeneous and
educational infrastructure (Alade & Nwankpa, 2022). In heterogeneous models are constructed and classification
most cases, the creation of opinion mining or sentiment accuracy improved significantly by the suggested ensemble
analysis systems in education is to find out what students bagged techniques compared to the base classifiers.
think about education and how to improve the sector.
According to (Mostafa et al., 2021), Support Vector
Sentiment analysis is the act of making assessments of Machine, Bayesian, and Entropy classifiers were used to
people's ideas, imaginations, and personalities built on their determine the sentiment polarity of tweets that yielding
written words, feelings, various picture types including positive, negative and impartial tweets. These three distinct
emoticons, behavior, artwork, and other visual signs. Even methods for classifying Twitter material according to
though sentiment analysis is extensively used in so many phrases in supervised machine learning approaches were
domains, it still lack some areas where it application is applied to trained datasets in three different ways. However,
needed and the best models that can effectively perform the in order to obtain precise and trustworthy predictions many
analysis and predictions accurately is yet to be defined. classifiers are combined using ensemble approaches.

II. LITERATURE REVIEW (Kumar et al., 2023) conducts sentiment analysis on


the Twitter140 dataset using Decision Tree, Logistic
Sentiment analysis of review datasets using Naïve Regression, and Support Vector Machine. Within the biased
Bayes and K-NN Classifier as the two supervised methods techniques, these algorithms are very common. One of the
used with two datasets namely film and hotel, (Bahwari, struggles in machine learning sentiment analysis is the
2019) the more training data that is entered the better the ability to acquire large amount of data for better
accuracy obtained in the NB algorithm with the dataset film classification (Lazrig & Humpherys, 2022).
but for the K-NN method, accuracy is obtained randomly.
(Tan et al., 2023) the authors set out to develop a sentiment III. BASE CLASSIFIERS FOR
analyzer that could accurately classify the polarity of text SENTIMENT ANALYSIS
with outstanding precision. To do this, they employed five
distinct machine learning techniques: Logistic Regression, Among the most innovative cutting edge technologies
Bernoulli Naive Bayes, Naive Bayes, and linear support of the twenty-first century is predicted to be machine
vector classification where Naïve Bayes outperforms all learning (Jordan & Mitchell, 2020). Despite the fact that the
other classifiers. future cannot be predicted, society must start considering
ways to optimize its advantages. To acquire more insight on
SVM is used to identify slogs. It was determined which our research, current and advanced reviews were explored in
models were most useful for logging web frameworks that machine learning in other to establish more facts on the
used web indexes (Meenu, 2019). In (Meenu, 2019) authors widely used machine learning algorithms to be used in our
suggested several grouping computations for Sequential research, which are:
Minimal Optimization (SMO), Logistic Regression,
Decision Trees, Naïve Bayes, Classification, and Regression  Naive Bayes:
Trees to identify phishing mails in a coordinated manner The Naive Bayes model can handle large amounts of
across controlled and unsupervised methods. data and is robust against complicated classification
methods (George & Srividhya, 2022). Naïve Bayes theory is
In (Zishumba, 2019) Machine learning techniques such explained by the following equation: P(H|E) = (P(E|H) *
as Support Vector Machine, bag-of-words model, and Naïve P(H))/P(E). Where P(H|E) signify the prior probability of
Bayes are used for sentiment analysis of digital texts. In the hypothesis given that the evidence is true, P(E|H) is the
(Agustini, 2021) author employed a number of classifiers to likelihood of the evidence given that the hypothesis is true
assess a dataset of movie reviews and divide them into while P(H) is the prior probability of the hypothesis and
positive and negative categories. Out of 85,600 user P(E) is the prior probability that the evidence is true. (Patel,
comments, Logistic Regression performed the second best, 2017) Predicting the correct class for a freshly produced
with an accuracy rate of 99.46%. in another studies of instance and being simple to use are the main advantages of
(Agustini, 2021) author applied multiple classifiers to this classifier.
examine a dataset of movie reviews and classify them as
favorable or unfavorable. With an accuracy of 99.46% for
85,600 user reviews, Logistic Regression delivered the
second-best results.

IJISRT24MAY1751 www.ijisrt.com 2827


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1751

Fig 1 Naive Bayes (Gaussian Distribution) Fig 2 Support Vector Machine

 Support Vector Machine:  Decision Tree:


When it comes to nonlinear regression and Regression and classification may both be done using
classification tasks, support vector machines (SVMs) are the decision tree due to its tree-like structure. Using decision
essentially binary classifiers that work well at categorizing trees, one can create a training model that can be used to
both linear and nonlinear data (Patel, 2017). SVMs handles predict the class or value of the destination variable. (Moret,
overfitting problems that occur in high dimensional 2019),The application of decision trees is very advantageous
environments due to its global optimization base and it in many fields, such as databases, taxonomy and
helpful in variety of applications (Liakos et al., 2018). The identification, machine diagnosis, switching theory, pattern
process presents each data point as a point in an n- recognition, decision table programming, and algorithm
dimensional space, where 'n' is the total number of features analysis. The diagrammatic representation of Decision Tree
you possess. Each feature's value is represented by a unique is illustrated in figure 3 below
coordinate. Finding which can be utilized to divide a certain
class is the next stage in the classification process (George
& Srividhya, 2022).

Fig 3 Decision Tree Diagram

IJISRT24MAY1751 www.ijisrt.com 2828


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1751

 K-Nearest Neighbors: and the other to the minority class. Figure 5 below shows
One of the simplest machine learning algorithms and a the schema of imbalanced dataset.
theoretically valid method is the KNN technique, which was
first proposed by Cover and Hart in 1967. The idea behind
KNN is very simple and straightforward: given a sample, if
the K closest neighbors (i.e., most similar samples) in the
feature space are also samples in that class, then this sample
also belongs to that class. The classification outcome of the
sample is directly affected by the choice of K values (Feng
et al., 2023). Figure 4 below displays how K-Nearest
Neighbor took values of a different sample.

Fig 5 An Imbalanced Dataset

When most of the data in each class is evenly


distributed then the majority of conventional data
classification techniques can be implemented with
proficiency in terms of total classification accuracy.
However, when categorizing an imbalanced dataset that
contained some examples from the interest group, these
Fig 4 K-Nearest Neighbor classifiers will unable to do any better (Thesis, 2023).

IV. DATASET V. METHODOLOGY

Datasets are very essential in data analysis and The most essential segment of our research is enclosed
machine learning. A vigorous decision making process for in this section, where the techniques and algorithms that will
organizations and the entire nation will be greatly enhanced be applied to the datasets in order to obtain the desired
by the meaningful utilization of data. In order to carry out results. In our research, the aforementioned classifiers task
the proposed research, a life and health insurance company would be to forecast using the provided input features of the
imbalanced dataset from Kaggle.com a machine learning imbalanced dataset, to determine whether the insured
repository is chosen. This dataset is in ‘csv’ format that customers will be willing to sign up for vehicle insurance
comprises of 267,507 records with fourteen input features. newly provided by the company.. All the four based
classifiers will be utilized individually and their results will
For machine learning practitioners working on binary be compared in order to ascertain which classifier has the
classification and sentiment analysis tasks frequently find highest accuracy level. Figure 5 below illustrates our
imbalanced datasets as a barrier in detecting tasks such as methodological workflow.
fraud, spam, diseases and hardware faults. (George &
Srividhya, 2022) A dataset that is imbalanced comprises of
two distinct observations one belonging to the majority class

IJISRT24MAY1751 www.ijisrt.com 2829


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1751

Fig 6 Proposed Workflow Diagram

 Data Preprocessing  Training Model


Once data has been collected, data preparation mostly Ultimately, in our research, we employed four trending
known as preprocessing is an essential step in sentiment distinctive models such as K-nearest Neighbors (KNN),
analysis. It cleans, organizes, and scrubs raw data into a Decision trees (DT), Support Vector Machines (SVM) and
format that machine learning models can use for training Naïve Bayes (NB) to classify the dataset in order to
and evaluation. Preprocessing also known as text filtering ascertain which classifier will perform best in term of
(Arya et al., 2019), is the process of removing noisy, accuracy, precision, recall and f1-score respectively.
unreliable and partial datasets through the use of
tokenization, stemming, and vectorization techniques all of Our current research encountered a problem while
which are essential sub-steps in the process. using SVM classifier commonly known as support vector
machine due to the enormous dataset, the SVM classifier
Scrubbing the data is a crucial first step in doing took approximately two and half hours classifying dataset of
preprocessing for sentiment analysis. Scrubbing is the about above three hundred thousand with just fourteen
technical process of enhancing the dataset to increase its features. Consequent to that, we deployed DSVM classifier
utility. This will require data that is redundant, incomplete, which is known as dual support vector machine due to the
incorrectly formatted, or irrelevant to be edited and problem found. The DSVM is prompt in classifying large
occasionally removed (Theobald, 2017). datasets with less time consumption.

 Separating Training and Testing Data Set  Testing the Model


When using machine learning, we typically divide our Performance evaluation is a critical component of
original dataset into two subsets the training set and the every research study. Given that it is essential to examine
testing set. We then fit our model using the train data in the behaviors of the system. A confusion matrix is used in
order to provide predictions for the test set. In order to machine learning to evaluate the effectiveness of a
divide the original datasets into training and testing sets classification model (Zishumba, 2019). In case where the
since the dataset in an imbalanced one, we employed both true values are known, it compares test result in tabular
stratified sampling techniques and k-fold cross-validation form. The performance of the suggested models in this
with k equal to 5. This will allow the sharing of similar research will also be evaluated using the confusion matrix.
representative samples from each class and will improve the
quality and efficiency of the models to enable smooth
comparison among them.

IJISRT24MAY1751 www.ijisrt.com 2830


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1751

VI. RESULT AND DISCUSSION from each classifier's will also serve as an additional means
of assessing the accuracy on these classifiers.
To the provided kaggle imbalanced dataset of life and
health insurance Company, we implemented four classifiers Following the end of the training phase on each of
such as Naïve Bayes, Support Vector Machine, K-Nearest these classifiers (NB, KNN, SVM, and DT), the dataset is
Neighbor and Decision Tree, each with the same datasets applied to test the classification performances, table 1,
that is split up into a training set and a testing set. The below displays the confusion matrix of all the classifiers on
Jupyter notebook from Anaconda is used to explore the stratify sampling technique that was used on the dataset in
experiment which serves as the basis for the findings that order to have equal representations in all the classes. While
are presented here. Various performance metrics including figure 6, below also shows the bar chart representation of all
classification accuracy will be used to compare these the four classifiers accuracy result.
classifiers; Precision, Recall, and F1-score values obtained

Table 1 Performance Evaluation based on Confusion Matrix


Classifiers Accuracy Precision Recall F1-score
Naïve Bayes 77% 41% 84% 55%
K-N-Neighbor 80% 31% 16% 21%
SVM 84% 48% 10% 16%
Decision Tree 81% 43% 45% 44%

From table 1 above, the statistical values of all the four classifiers is show, where Support vector machine with the accuracy
score of 84 percent outperform all other classifiers even though the recall and f1-score is low, while decision tree came next with
81 percent accuracy score with average scores in recall and f1 respectively.

Fig 7 Bar Chart Accuracy Result for the Four Classifiers

Table 2 Performance Comparisons of Four Classifiers based on K-fold cross Validation


Classifiers Accuracy K1 Accuracy K2 Accuracy K3 Accuracy K4 Accuracy K5
Naïve Bayes 0.76955 0.77583 0.77150 0.77039 0.77184
KNN 0.80439 0.80821 0.80617 0.80524 0.80613
SVM 0.83555 0.83713 0.83565 0.83674 0.83586
Decision Tree 0.81260 0.81791 0.81325 0.81303 0.81345

From all indication in table 2 above, it has been clearly shown that Support Vector Machine has the highest accuracy score
of 84 percent followed by Decision Tree classifier with approximately 82 percent.

IJISRT24MAY1751 www.ijisrt.com 2831


Volume 9, Issue 5, May – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAY1751

VII. CONCLUSION AND FUTURE WORKS [6]. Ghosh, S., Hazra, A., & Raj, A. (2020). A
Comparative Study of Different Classification
In summary, insurance companies, product companies, Techniques for Sentiment Analysis. International
industries of all kinds, institutions and health practitioners Journal of Synthetic Emotions, 11(1), 49–57.
can utilize machine learning method in analyzing https://doi.org/10.4018/ijse.20200101.oa
sentiments. The life and health insurance company dataset [7]. Jawale, S. (2019). Sentiment Analysis using
used to predict if the customers are willing to apply for Ensemble Learning. May.
vehicle insurance in that same company. From the predicted [8]. Jordan, M. I., & Mitchell, T. M. (2020). Machine
analysis result in table one show the low score in precision learning: Trends, perspectives, and prospects.
which indicates low outcome of customers that are willing Science, 349(6245), 255–260. https://doi.org/
to review their vehicle insurance with the same company. 10.1126/science.aaa8415
Even though our target is to compare the classifiers, we still [9]. Kawade, D. R., & Oza, D. K. S. (2017). Sentiment
have to predict the outcome of the dataset. The frequent Analysis: Machine Learning Approach. International
change in vocabularies and cultural diversities has raised a Journal of Engineering and Technology, 9(3), 2183–
great challenge in the field of sentiment analysis. Every 2186. https://doi.org/10.21817/ijet/2017/v9i3/
culture has a way of expressing emotions be it happiness or 1709030151
sadness. Contextual sensitivity is another factor that [10]. Kumar, S., Kaur, N., Kavita, & Joshi, A. (2023).
contributes to the challenges of sentiment analysis, since Tweet sentiment analysis using logistic regression.
grammar continues to revolve every day. From the research July, 332–336. https://doi.org/10.1049/icp.2023.1801
carried out it has been proven that support vector machine [11]. Lazrig, I., & Humpherys, S. L. (2022). Using
has the highest classification score compared to naïve bayes, Machine Learning Sentiment Analysis to Evaluate
k-nearest neighbor and decision tree. This indicates that Learning Impact. Information Systems Education
even in an imbalanced data classification process support Journal (ISEDJ), 20(1), 20. https://isedj.org/;
vector machine still perform excellently. For the fact https://iscap.info
remains that support vector machine has the strength of [12]. Liakos, K. G., Busato, P., Moshou, D., Pearson, S., &
handling complex regression or classification problems. In Bochtis, D. (2018). Machine learning in agriculture:
the future, deep learning should be compare with some of A review. Sensors (Switzerland), 18(8), 1–29.
the base machine learning classifiers such as support vector https://doi.org/10.3390/s18082674
machine in predicting sentiment analysis so as to [13]. Meenu, S. G. (2019). 154. Sunila. International
standardize a model for sentiment analysis. Journal of Electronics Engineering (ISSN: 0973-
7383, Volumne 11(• Issue 1), 965–970.
REFERENCES [14]. Mostafa, G., Ahmed, I., & Junayed, M. S. (2021).
Investigation of Different Machine Learning
[1]. Agustini, T. (2021). Sentiment Analysis on Social Algorithms to Determine Human Sentiment Using
Media using Machine Learning-Based Approach. Twitter Data. International Journal of Information
June, 544437. Technology and Computer Science, 13(2), 38–48.
[2]. Arya, P., Bhagat, A., & Nair, R. (2019). Improved https://doi.org/10.5815/ijitcs.2021.02.04
Performance of Machine Learning Algorithms via [15]. Patel, R. (2017). Sentiment Analysis on Twitter Data
Ensemble Learning Methods of Sentiment Analysis. Using Machine Learning by Ravikumar Patel A
10(2), 110–116. thesis submitted in partial fulfillment of the
[3]. Bahwari. (2019). Sentiment Analysis Using Random requirements for the degree of MSc Computational
Forest Algorithm - Online Social Media Based. Sciences The Faculty of Graduate Studies.
Journal Of Information Technology AND ITS [16]. Tan, K. L., Lee, C. P., & Lim, K. M. (2023). A
UTILIZATION, 2(2), 29–33. Survey of Sentiment Analysis: Approaches, Datasets,
https://www.researchgate.net/publication/338548518 and Future Research. Applied Sciences
_SENTIMENT_ANALYSIS_USING_RANDOM_F (Switzerland), 13(7). https://doi.org/10.3390/app
OREST_ALGORITHM_ONLINE_SOCIAL_MEDI 13074550
A_BASED [17]. Theobald, O. (2017). Machine Learning For Absolute
[4]. Feng, W., Gou, J., Fan, Z., & Chen, X. (2023). An Beginners.
ensemble machine learning approach for [18]. Zishumba, K. (2019). Sentiment Analysis Based on
classification tasks using feature generation. Social Media Data. Journal of Information and
Connection Science, 35(1). https://doi.org/10.1080/ Telecommunication, 1–48. http://repository.aust.edu.
09540091.2023.2231168 ng/xmlui/bitstream/handle/123456789/4901/Kudzai
[5]. George, S., & Srividhya, V. (2022). Performance Zishumba.pdf?sequence=1&isAllowed=y
Evaluation of Sentiment Analysis on Balanced and
Imbalanced Dataset Using Ensemble Approach.
Indian Journal of Science and Technology, 15(17),
790–797. https://doi.org/10.17485/ijst/v15i17.2339

IJISRT24MAY1751 www.ijisrt.com 2832

You might also like