0% found this document useful (0 votes)
24 views6 pages

Fake Reviews Detection Based On Sentiment Analysis Using ML Classifiers

Uploaded by

name52513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views6 pages

Fake Reviews Detection Based On Sentiment Analysis Using ML Classifiers

Uploaded by

name52513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE)

Fake Reviews Detection Based on Sentiment


Analysis using ML Classifiers
1st Rajesh N 2nd Ramachandra A C 3rd Chaurasia Vaibhav
2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE) | 979-8-3503-1646-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/AIKIIE60097.2023.10390289

Nitte Meenakshi Institute of Technology Nitte Meenakshi Institute of Technology Nitte Meenakshi Institute of Technology
Bengaluru, India Bengaluru, India Bengaluru, India
[email protected]

4th Ayush Tomar 5th Hemant Kumawat 6th Anurag Prasad


Nitte Meenakshi Institute of Technology Nitte Meenakshi Institute of Technology Nitte Meenakshi Institute of Technology
Bengaluru, India Bengaluru, India Bengaluru, India

7th Ramprasad Poojary


Manipal Academy of Higher Education Dubai Campus
Dubai, UAE

Abstract—Fake reviews carry substantial amount of on real reviews which were taken from the e-commerce
significance for both the consumers and business owners and website (Amazon) and also included the features engineering
thus detection of fake reviews early helps to reduce the negative to get the various behaviours of the reviewers. We compared
influential growth has become crucial after increased the various classifiers with and without these extracted
popularity of online shopping. Here, in this paper we propose features using three different language models namely
machine learning approach by analysing both, the text and Support Vector Machine, Multinomial Naïve Bayes and
behaviour. A dataset of reviews was collected from various sites Logistic Regression along with two different feature
and it was labelled as Real or Fake. We then used various extraction techniques. The findings indicate that the count
machine learning approaches like the Support Vector Machine,
vectorizer yielded better results in enhancing the model's
Logistic Regression and Multinomial Naive Bayes for training
and evaluation of the performance of our model. Results show
performance compared to the TF-IDF Vectorizer.
that our method provides accuracy at par with the existing Additionally, the LR classifier demonstrated the highest
ones. Our method has its utilization by businesses and review accuracy rate at 85% and a recall rate of 92%. LR classifier is
platforms to detect and remove fake reviews which enhances finally deployed for the detection of fake reviews as a web
trust in online reviews. The developed system achieves a high app on streamlit platform [2].
accuracy rate of 85% in identifying fake reviews, demonstrating Section 2 of this paper provides review of prior studies
its effectiveness in improving the reliability of online reviews. that have explored the same problem. The structure of the
Keywords—Fake review detection, Natural language
upcoming sections is described as follows: In section 3, we
processing, machine learning, Support vector machine, Logistic describe our proposed methodology to detect fake reviews in
regression, Multinomial naïve bayes, Tf-idf vectorizer, Count detail. The experimental results and analysis are presented in
Vectorizer. Section 4 of this paper. In Section 5, we present our
conclusion and discuss various potential directions for the
I. INTRODUCTION future research.
Today, customers rely heavily on reviews to make II. RELATED WORK
decisions about products and services. For example, before
purchasing any product online, customers read reviews Several studies have been conducted in recent years,
written by other people to get an idea of the in-hand which are discussed in this section. These studies have
experience of the user. If the reviews are positive, they are explored different techniques and methods to detect fake
more likely to purchase that product. Unfortunately, some reviews, such as various ML algorithms, statistical analysis,
reviews can be deceptive, which is why detecting fake and NLP (Natural Language Processing). Here is the
reviews has become an active research area. Fake reviews can overview of some of the existing literature on fake review
mislead customers and negatively impact businesses. detection. In a paper titled "Identifying Fake Reviews and
Reviewers by Exploiting Temporal Information in Online
According to a survey, about 93% of consumers read Reviews," authors propose a method that uses temporal
reviews online before buying a product and almost all 84% of information, such as the time and frequency of reviews, to
people trust the reviews, but with the rapid growth of online detect fake reviews and reviewers. They tested their method
shopping has escalated the problem of fake reviews. One of on a real-world dataset from Amazon, and achieved an
the reasons for writing fake reviews using multiple accounts accuracy of up to 95% [3].
is to gain business advantage over the competitor by
defaming. Initially, rule-based methods were used for the In another paper titled "Identifying Deceptive Reviews
detection of fake reviews but their effectiveness was not at Using Time Series Analysis," authors propose a time-series
par and hence analysing the review text, as well as the based method to detect fake reviews that aims to capture the
behaviour of the reviewer is a must [1]. temporal dynamics of reviews. The authors evaluate their
method on a dataset of Yelp reviews, and achieved an
In this research paper, we used classifiers of machine accuracy of up to 80% [4]. The paper "Exploring Supervised
learning to detect the non-genuine reviews according to their and Semi-Supervised Machine Learning Approaches for Fake
content and other features. We tested the proposed classifiers

979-8-3503-1646-9/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:28:37 UTC from IEEE Xplore. Restrictions apply.
Review Detection" offers a comparative analysis of The dataset consisted of 32 columns that contained an
supervised and semi-supervised machine learning methods in abundance of information. After analysing the dataset, it was
the context of fake review detection. They evaluated their determined that the focus was primarily on the product and
method on a dataset from Yelp and TripAdvisor and achieved the reviews rather than the reviewers themselves. As a result,
an accuracy of up to 86% [5]. the project adopted a linguistic approach to fake review
detection instead of a behavioural one. The remaining 30
A study titled "Fake Reviews: A Survey on Detection columns provided useful context for understanding the
Methods," reviews the current state of the art in fake review reviews and products in greater detail, which aided in the
detection methods. The authors provide an overview of analysis of the results. Although the excess columns were
different approaches, such as linguistic, behavioural, and eventually dropped, their content was thoroughly examined
temporal-based methods, and evaluate their effectiveness [6]. through graphs and analysis throughout the project. Fig. 2
Lee and Landgrebe proposed a feature extraction method shows the overview of the dataset in graphical form.
based on decision boundaries in their paper [7]. The authors
described an iterative approach to finding the decision
boundary between two classes and selecting the features that
are most discriminative along this boundary. Their method
was demonstrated to be effective in various classification
problems, including remote sensing and medical image
analysis.
Hu et al. proposed a hybrid approach for fake review
detection in mobile app stores that combines rule-based
methods with machine learning techniques. The rule-based
component identifies suspicious reviews based on textual
features, while the machine learning component uses feature
extraction methods and classification algorithms to identify
fake reviews based on their semantic and syntactic features.
The authors evaluated their approach on a large dataset of
mobile app reviews and demonstrated its effectiveness in
detecting different types of fake reviews [8].
Divadkar and his colleagues performed comprehensive
survey of ambiguous news detection techniques in their paper
[9]. Their review provides a detailed overview of the latest
advancements in this field. The effectiveness of various
approaches in detecting ambiguous news articles is evaluated
in the paper by the authors, who review and compare deep
learning, machine learning, and ensemble paradigms. By
offering a comprehensive overview of these techniques, the
paper contributes to the advancement of ambiguous news
detection methods. In our paper, we present a novel machine
learning technique to detect fake reviews. We explore the use Fig. 1. Schematic Representation of the ML-Based Approach for Detecting
Fake Reviews
of various classifiers and passifiers to determine which of
them outperforms in the existing technologies.
III. PPROPOSED SYSTEM
To develop a system for detecting fake reviews, the
proposed methodology in this study involves utilizing
Amazon's verified purchases label to train three classifiers
through supervised learning on a labelled dataset from
Amazon. The three classifiers used are MNB, SVM, and LR,
which were chosen based on their effectiveness in
classification tasks. The proposed methodology's schematic
diagram is illustrated in Fig. 1, depicting the sequence of
steps involved in identifying fake reviews through the
developed machine learning approach. The diagram provides
an overview of the research process, serving as a visual aid to
understand the study's methodology.
A. Dataset
Choosing the right dataset for machine learning is crucial
for Machine Learning Algorithms to function properly. The Fig. 2. Overview of the dataset
datasets used for this study is downloaded from Kaggle,
which contains reviews from Amazon website along with B. Data Pre-processing
Amazon’s verified-purchases labels [10].
Data pre-processing refers to the critical step of cleaning,
transforming, and preparing raw data into a format suitable

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:28:37 UTC from IEEE Xplore. Restrictions apply.
for analysis in data analysis and machine learning. Its 4) Stemming: Stemming is a natural language processing
importance lies in ensuring that the quality of the input data is technique used to stem down the words to their base or root
high, as it ultimately affects the accuracy and effectiveness of form, which helps in text analysis and retrieval. It involves
the final results. The raw data often contain inconsistencies, removing prefixes and suffixes from words to obtain the base
missing values, noise, outliers, and irrelevant features that
form or stem. This helps in reducing the size of the
need to be addressed before any analysis can take place. Data
pre-processing involves various techniques such as data vocabulary and hence improving the efficiency of text
cleaning, feature selection, feature scaling, normalization, and analysis algorithms. As an example, the stem of the words
dimensionality reduction to get improved quality of data "running," "runner," and "runs" is "run." Stemming is often
while retaining essential properties for better results during used in search engines, information retrieval systems, and
analysis. A well-pre-processed dataset reduces the risk of spam detection systems to improve their accuracy and
inaccurate results and increases the efficiency and efficiency.
effectiveness of the analysis process. This study applies 5) Removing Stopwords: Stopwords are commonly
several pre-processing techniques to the Amazon Dataset, occurring words which do not add much meaning to the text,
converting it into a format appropriate for computational such as "a", "an", "the","is" etc. The process includes
tasks. By employing these methods, the input data undergoes
eliminating these common words from a text document to
several transformations, such as cleaning, normalization, and
feature selection, to remove inconsistencies, reduce noise, and reduce noise and improve the performance of natural
improve the data's quality. The processed dataset can then be language processing (NLP) algorithms. This is because stop
used for further analysis or as input to machine learning words do not carry any useful information for the analysis or
models for various applications. The following steps were classification of a text document. By removing them, the
taken to achieve this goal: remaining words in the document become more meaningful
and can help identify the key topics, sentiments, or patterns
1) Data cleaning: This step involves the identification
in the text. The removal of stop words is a common
and rectification of errors, inconsistencies, and inaccuracies
technique used in text pre-processing for NLP tasks which
present in a dataset to improve its quality for analysis. It
include sentiment analysis, topic modeling, and classification
involves several steps such as handling missing values,
of text.
removing duplicates, correcting data format and type,
removing irrelevant data, and resolving inconsistencies C. Feature Extraction
between different data sources. Data cleaning is an essential Feature extraction is an essential step in machine learning,
step in data pre-processing as it ensures that the data is where the relevant information is extracted from raw data and
accurate, consistent, and ready for analysis. converted into a set of meaningful features, which is further
2) Text pre-processing: This is the process of cleansing used to build predictive models. In the context of text data,
and preparation of unstructured text data for further analysis feature extraction involves representing text as numerical
and processing. One common approach to text pre- features that can be used for analysis. Common techniques for
feature extraction in natural language processing include bag-
processing is to perform lowercasing, which involves
of-words representation, TF-IDF, and word embeddings. The
converting all the characters in the text to lowercase. Another extracted features can then be used as inputs for machine
step is removing punctuations such as commas and periods to learning algorithms to perform tasks such as classification,
reduce noise in the text data. Additionally, spelling can be clustering, and sentiment analysis. The goal of feature
corrected using techniques such as autocorrection or extraction is to transform the raw data into a format that can
dictionary lookup to improve accuracy. Applying pre- be easily analyzed by machine learning models, while still
processing steps to text data can significantly enhance its preserving the important information contained in the original
quality and consistency, making it suitable for natural data. We are using two feature extraction techniques. In this
language processing tasks like sentiment analysis and topic project, two feature extraction techniques were used to
modeling. By performing these steps, text data can be convert the pre-processed text data into a numerical
representation suitable for machine learning algorithms.
transformed into a more structured format, facilitating its use
These techniques were TF-IDF and Count Vectorizer:
in various machine learning and analytical applications.
3) Tokenization: This is the process of breaking down 1) TF-IDF Vectorizer: TF-IDF is a crucial statistical
text into chunks or units known as tokens. These tokens can measure which is commonly used in NLP to determine the
be either the words, phrases, or even individual characters. relevance of a word which exists in a document or corpus,
Tokenization plays a crucial role in natural language which takes into account its frequency and frequency of the
processing (NLP) and is used to analyse and understand text term in the entire corpus. In simpler words, it measures how
data. It reduces the complexity of the text data, making it frequently a word has appeared in a document and adjusts it
easier for processing and analysing. There are various according to the number of times it appears in other
techniques for tokenization, such as white-space documents. This is useful because some words may appear
tokenization, rule-based tokenization, and statistical frequently across all documents in the collection and may not
tokenization. In white-space tokenization, text is split based provide much information to differentiate between
on spaces, tabs, and line breaks. In rule-based tokenization, documents. By using the TF-IDF weighting scheme, we can
specific rules are defined to split the text. In statistical give more importance to words that are specific to a
tokenization, machine learning algorithms are used to particular document and less importance to words that are
automatically determine the boundaries of tokens. common across all documents.

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:28:37 UTC from IEEE Xplore. Restrictions apply.
2) Count Vectorizer: Count Vectorizer is a powerful classes. The points closest to the hyperplane, known as
feature extraction method used in NLP which converts a set support vectors, play a crucial role in this process [12].
of text documents into a matrix of word occurrences, in the
matrix the rows correspond to a document and each column IV. RESULTS AND ANALYSIS
corresponds to a unique word in the corpus. The cell value is Here we showcase the outcomes of experiments
the count of how many times a word has appeared in the conducted using our novel methodology to identify fraudulent
respective document. It is a simple and effective way to reviews, demonstrating the effectiveness of our approach. The
convert the text data into a format which is suitable for analysis is conducted on a dataset obtained from Kaggle,
which is authentic. The performance of three different ML
machine learning algorithms. This technique disregards the
algorithms MNB, LR and SVM is evaluated and compared
order and context of the words in the document and solely based on their accuracy.
concentrates on the frequency of occurrence of each word in
the document. Our proposed system was tested on the Amazon dataset
dataset, which consists of 2501 which are more than enough
D. Evaluating Models to train any model. These reviews are labelled as fake or real
To evaluate the effectiveness of our proposed method for by Amazon’s verified purchase. There is total of 32 columns
detecting fake reviews, we utilized the following machine however only two are used for final analysis. Fig. 3 shows the
learning algorithms: summary of the dataset.
1) Multinomial Naive Bayes: MNB is one of the simple
but effective algorithm which is used for text classification .
It relies Bayes theorem, that calculates the probability of a
certain event which occurs based on the prior existing
knowledge of conditions that could be related to the event.
MNB is specifically designed for [11] classification of tasks
where the features are unique and discrete, such as word
counts. Working of MNB is by calculating the conditional
probability of a certain feature (word) occurring in a specific
class (e.g., positive or negative sentiment) based on the
training data. During the classification phase, the algorithm
uses these probabilities to predict the most likely class for a
given text sample. MNB is widely used in spam filtering,
sentiment analysis, and other text classification tasks.
2) Logistic Regression: It is a popular statistical Fig. 3. Summary of the dataset
approach which is used in machine learning for predicting
There are a total of three classifiers used for the detection
binary outcomes, such as the presence or absence of a
of the fake reviews MNB, SVM and Logistic Regression.
particular event. It models the relationship between the [12- Their confusion matrices are shown in Figure 4(a), 4(b) and
13] dependent variable and the independent variables by 4(c) respectively.
fitting a sigmoid function to the data, which allows for the
prediction of a binary outcome. It uses a parametric model
which uses a logistic function to model the probability of the
dependent variable. The logistical function is an S-shaped
curve that maps any input value to a value existing between 0
and 1, representing the probability of the output being true.
The model is trained by adjusting the weights of the
independent variables to reduce the error between the
predicted and the actual outcomes. It is a widely used
algorithm in various fields, including healthcare, finance, and
marketing.
3) Support Vector Machine (SVM): SVM is versatile
form of supervised machine learning algorithm that is
applied to various tasks. It is useful when we deal with
complex datasets which have non-linear boundaries. SVM, a
powerful supervised [14-15] machine learning algorithm, is (a)
primarily designed to address classification and regression
problems. By maximizing the margin between different
classes, SVM seeks to establish an optimal boundary that can
separate them effectively. In this regard, SVM works by
transforming the data into a higher-dimensional space, where
it can identify the hyperplane that can best separate the

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:28:37 UTC from IEEE Xplore. Restrictions apply.
between precision and recall. Overall, the best F1 Score
Accuracy was achieved by LR using Count Vectorizer, with a
score of 85%.
V. CONCLUSION AND FUTURE WORK
After thorough experimentation and analysis, we have
presented a novel machine learning-based solution to identify
and flag fake reviews. Our approach offers a reliable and
efficient method for detecting deceptive reviews in large
datasets, thus providing valuable insights to online businesses
and consumers alike. The dataset used is collected from the
Kaggle based on the Amazon reviews which are labelled as
fake real. We have used various machine learning algorithms
like SVM, LR, and MNB for training and evaluating the
performance of our model and also compared the
(b) performance of the classifiers with and without extracted
features using two different vectorization techniques, count
vectorizer and TF-IDF vectorizer. The evaluation of the
models reveals that the LR classifier demonstrated the highest
accuracy rate of 85% and recall rate of 92%, indicating its
potential as an effective method for detecting fake review.
The model is deployed as a web app on streamlit platform for
detecting fake reviews.
In our future work, we intend to explore the application of
deep learning techniques to detect fake reviews. We will also
evaluate the effectiveness of our proposed methodology on a
larger dataset that encompasses various domains.
Furthermore, we plan to investigate the use of natural
language processing (NLP) techniques to extract features
from review texts and improve the performance of our
(c) methodology. We expect that ongoing research projects will
Fig. 4. (a) Confusion Matrix for MNB, (b) Confusion Matrix for SVM, (c) continue to investigate this pervasive issue and develop
Confusion Matrix for LR reliable solutions for real-time implementation.

Table 1 represents the summaries of all the three REFERENCES


classifiers when used with two vectorizers. [1] Tadelis, “The economics of reputation and feedback system in e-
commerce marketplaces,” IEEE Internet Computing, vol. 20, pp. 12-
TABLE I. SUMMARY OF RECALL, PRECISION, F1 SCORE AND 19, 2016.
ACCURACY. [2] M. Kathiravan, S. J. Parvez, R. Dheepthi, R. Jayanthi and S. Gowsalya,
“Analysis and Detection of Fake Profile Over Social Media using,” 5th
Model Vectorizer Accuracy Precision Recall F1 Score International Conference on Smart, no.
MNB Count 80 81 77 79 10.1109/ICSSIT55814.2023.10061020, pp. 1164-1169, 2023.
MNB Tf-idf 81 85 74 79 [3] Arjun Mukherjee, V Venkataraman, Bing Liu and N. Glance,
SVM Count 84 79 91 85 “"Identifying Fake,” IEEE Transactions on Knowledge and Data
SVM Tf-idf 84 82 84 83 Engineering, pp. 664-677, 2016.
LR Count 85 80 92 85 [4] Manjanaik N, Parameshachari BD, Hanumanthappa SN, Banu R. Intra
LR Tf-idf 82 81 81 81 Frame Coding In Advanced Video Coding Standard (H. 264) to Obtain
Consistent PSNR and Reduce Bit Rate for Diagonal Down Left Mode
The table presents the performance results of different Using Gaussian Pulse. InIOP Conference Series: Materials Science and
models on various vectorizers. The models used are Engineering 2017 Aug 1 (Vol. 225, No. 1, p. 012209). IOP Publishing.
Multinomial Naive Bayes, Support Vector Machine and [5] M. Rabiul and R. Hassan, “Supervised and Semi-Supervised Machine
Learning Approaches for Fake Review Detection,” International
Logistic Regression. The vectorizers used are the Count Conference on Computer Applications & Information, no.
Vectorizer and Tf-idf Vectorizer. The evaluation metrics 10.1109/ICCAIS48691.2020.9313163, pp. 1-7, 2020.
which are used to measure the performance of each model are [6] S. Ottaviano, “Fake Reviews: A Survey on,” IEEE Access, vol. 8, no.
namely Accuracy, Precision, Recall, and F1 Score. 10.1109/ACCESS.2020.2994878, pp. 103221-103243, 2020.
[7] D. Landgrebe and C. Lee, “Feature extraction based on decision,”
The results show that MNB using Count Vectorizer had IEEE Transactions on Pattern Analysis & Machine Intelligence, 1993
an accuracy of 80%, while MNB using Tf-idf Vectorizer had
[8] Y. Hu and X. jing, “Fake Review Detection in Mobile App,” IEEE
a slightly higher accuracy of 81%. SVM using Count Access, vol. 6, no. 10.1109/ACCESS.2018.2800620, pp. . 6171-6181,
Vectorizer had the highest accuracy of 84%, while SVM 2020.
using Tf-idf Vectorizer had an accuracy of 84%. LR using [9] Jagannathan P, Gurumoorthy S, Stateczny A, Divakarachar PB,
Count Vectorizer had the highest accuracy of 85%, while LR Sengupta J. Collision-aware routing using multi-objective seagull
using Tf-idf Vectorizer had an accuracy of 82%. optimization algorithm for WSN-based IoT. Sensors. 2021 Dec
20;21(24):8496.
The precision and recall scores varied across models and [10] Y. Yao, B. Peng, D. Lu and R. Jin, “Detecting Online Review
vectorizers, but the F1 Score and accuracy was used as the manipulation with Review Graph Convolutional Networks,”
overall evaluation metric, which balances the trade-off IEEE/ACM International Conference on Advances in Social Networks

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:28:37 UTC from IEEE Xplore. Restrictions apply.
Analysis and Mining (ASONAM), no.
10.1109/ACCESS51619.2021.9563292, pp. 136-140, 2019.
[11] V. Mesyura and M. Granik, “Fake news detection using naive Bayes,”
IEEE First Ukraine Conference on Electrical and Computer
Engineering (UKRCON), pp. 900-903, 2017.
[12] S. P and R. G, “Comparative Analysis of Machine Learning
Approaches for the Early Diagnosis of Keratoconus.,” Distributed
Computing and Optimization Techniques, Springer, vol. 903, no.
10.1007/978-981-19-2281-7_23, 2022.
[13] R. Poojary, R. Raina and A. K. Mondal, “Comparative Study of Model
Optimization Techniques in Fine-Tuned CNN Models,” International
Conference on Electrical and Computing Technologies and
Applications (ICECTA), no. 10.1109/ICECTA48151.2019.8959681,
pp. 1-4, 2019.
[14] A. Shete, H. Soni, Z. Sajnani and A. Shete, “Fake News Detection
Using Natural Language Processing and Logistic Regression,” 2nd
International Conference on Advances in Computing, Communication,
Embedded and Secure Systems (ACCESS), 2021.
[15] O. H. Al-Shehabat, A. Alsmadi and Y. A. Abu-Samaha, “Fake news
detection using support vector machine and deep learning with
linguistic features,” IEEE 17th International Symposium on
Biomedical Imaging - ISBI, no. 10.1109/ISBI45749.2020.9098386.,
pp. 97-100, 2020.

Authorized licensed use limited to: Alliance College of Engineering and Design Bangalore. Downloaded on February 11,2024 at 15:28:37 UTC from IEEE Xplore. Restrictions apply.

You might also like