0% found this document useful (0 votes)
158 views4 pages

Fake Review Detection Iee Paper

Uploaded by

apeksha Siddram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views4 pages

Fake Review Detection Iee Paper

Uploaded by

apeksha Siddram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]

IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9

Opinion Mining based Fake Product review


Monitoring and Removal System
Syed Mohammed Anas Santoshi Kumari
Dept. of CSE, Dept. of CSE,
M S Ramaiah University of Applied Sciences, M S Ramaiah University of Applied Sciences,
Bangalore, India Bangalore, India
[email protected] [email protected]

Abstract—Fake review detection and its elimination from the flagged. Since they tend to hype the product or they try t o
given dataset using different Natural Language Processing (NLP) emulate genuine reviews with the same words using it again
techniques is important in several aspects. In this article, the fake and again to make an impact on the buyer. Hence the issue of
2021 6th International Conference on Inventive Computation Technologies (ICICT) | 978-1-7281-8501-9/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICICT50816.2021.9358716

review dataset is trained by applying two different Machine spam filtering requires huge data to train and be effective with
Learning (ML) models to predict the accuracy of how genuine added domain knowledge such as sarcasm sentences used by
are the reviews in a given dataset. The rate of fake reviews in E- users to show their dissent towards the product, sometimes the
commerce industry and even other platforms is increasing when product is good but not the delivery or the packing which
depend on product reviews for the item found online on different
affects the review classification. Here, an NLP technique is
websites and applications. The products of the company were
used to identify such reviews instead of misclassification to a
trusted before making a purchase. S o this fake review problem
must be addressed so that these large E-commerce industries
negative review as in sentiment analysis. To remove unwanted
such as Flipkart, Amazon, etc. can rectify this issue so that the
or outdated product reviews those include data pre-processing.
fake reviewers and spammers are eliminated to prevent users The focus of this research is to create an environment of
from losing trust on online shopping platforms. This model can online E-commerce industry where consumers build trust in a
be used by websites and applications with few thousands of users platform where the products they purchase are genuine and
where it can predict the authenticity of the review based on feedbacks posted on these websites/applications are true, are
which the website owners can take necessary action towards checked regularly by the company where the number of users
them. This model is developed using Naïve Bayes and random
forest methods. By applying these models one can know the
is increasing day by day, henceforth companies like Twitter,
number of spam reviews on a website or application instantly. To
WhatsApp, Facebook use sentiment analysis to check fake
counter such spammers, a sophisticated model is required in news, harmful/derogatory posts and banning such
which a need to be trained on millions of reviews. In this work users/organizations form using their platforms. Parallel to that
”amazon Yelp dataset” is used to train the models and its very E-commerce (Flipkart, Amazon) industries, hotels booking
small dataset is used for training on a very small scale and can be (Trivago), logistics, tourism (Trip Advisor), job search
scaled to get high accuracy and flexibility. (LinkedIn, Glass door), food (Swiggy, Zomato), etc. use
algorithms to tackle fake reviews, spammers to deceive the
Keywords—opinion mining, sentiment analysis, text mining consumers in buying below average products/ services. And
the users need to be alerted of the spammer like “not verified
I. INTRODUCTION profile” hence users need not worry about such false users.
Manual labelling of the reviews is practically time-consuming
The elegance with online review posting has grown at a and less effective. So supervised learning model is used for
faster rate and people buying almost everything online that labelling the reviews and then predicting the label is not
gets delivered at their doorsteps. Hence, people are not subject feasible. For example, Mukherjee et al. manually labelled
to physically inspect the product when buying online so they 2431 reviews for over 8 weeks, so automated review labelling
drastically unwantedly/wontedly depend on reviews of other should be possible to contain time and energy which is
buyers this must be made truthful as much as possible so that difficult and proposed by Sunil Soumya et.al. Some industries
the buyer is not cheated with fake reviewers or spammers time pay money to write fake reviews of their products and services
and again. The problem is simple yet tiring to be accomplished where it is not possible to label the review as spam or not
through/read every review to mark it as a fake or ambiguous spam. Amazon’s “Yelp” dataset has 30% to 40% of spam
category this must be done systematically to get to the root of reviews. Feature selection is an important aspect in selecting
the problem. This problem can be addressed by training an and training these models. In this work, the comparison of two
ML model which deals with the review section to flag a models developed to justify the model performance for this
particular review as genuine or spam. The interesting thing is “Amazon’s yelp” dataset and their relevance to deploy these
spammers who didn't use the product can be caught this way. models in real-time software. The Random Forests (RF) model
A spam review or the usage of different customer id can be performed well compared to the Naïve Bayes algorithm by a
used to filter review of the product falsely to get a good rating large margin in the fake review data analysis. The fake review
of the product. This can be filtered by checking the use of detection problem is addressed fairly and gives a fair insight
words like "awesome", "so good", "fantastic" etc. can be into its legality and need. The purpose is to select a suitable

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 985

Authorized licensed use limited to: California State University Fresno. Downloaded on June 22,2021 at 22:32:38 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9

algorithm to fulfill the task of fake review detection and its monitored to convert raw text to vectors which in turn used as
elimination. features to locate spam reviews [12].
The rest of the article is arranged as follows: Section II
describes the related works, Section III explains methods and III. M ET HODS
datasets, and Section IV depicts the model outline and 1) Dataset
working. Section V demonstrates the result and analysis. Dataset used is “amazon academic review” which contains
Finally, Section VI draws the conclusion of the research work reviews, useful votes, ratings, user id, and many other
along with the future scope of the research. attributes. The useful parameters are retrieved for feature
engineering. The dataset contains thousands of original and
II. RELATED WORK fake reviews mixed to easily assess the accuracy of the model
being implemented using this dataset. The Yelp dataset
The previous analysis is done on the expressed views
through text, blogs, reviews, feedbacks, etc. as opinions by released for the academic challenge contains information for
11,537 businesses. This dataset has 8,282 check-in sets,
users which are unique to compute, study to obtain relevant
43,873 users, 229,907 reviews for these businesses
information, that is nothing but sentiment analysis.
(www.yelp.com/dataset). The dataset is challenging since it
Exiting research used a two-step approach, SVM classifier contains a large set of varied reviews and parameters for
for classifying tweets [1]; other used emoticons, smileys, and training any algorithm.
Hashtags to classify labels into multiple sentiments [2]. The
other researcher used an SVM classifier for training data using 2) Pre-processing
emoticons [3]. Pre-processing is the first step in analyzing any dataset
which includes removing unnecessary attributes, punctuations,
2.1 Existing systems: stop words, missing words, redundant words , etc. to clean the
dataset for training purposes. This ensures proper training of
2.1.1 Lexicon-based methods: - Based on counting the
number of positive and negative words in a sentence- Twitter. the model.

2.1.2 Rule-based methods: – Based on syntactic rules, e.g., 3) Feature Engineering


[query] is pos-adj – Tweet feel This function involves all the methods to remove
unwanted information from the dataset it is also called data
2.1.3 Machine learning based methods: – Based on the cleaning. This step is very necessary to find the gaps and the
classifier built on a training data Twitter sentiment [14]. relationship between the different attributes (columns) and use
them to draw valid conclusions. The libraries from the NLTK
The fake review detection problem the researchers
have developed techniques to address the problem of fake package is a bag of words used to construct a corpus of words.
Term frequency, tokenizer, Stopwords functions are imported
review detection. New models like the ICF++ model that uses
from OrderedDict. Stop words are removed and unwanted
honesty value, their research influenced the accuracy and
increased by 49% [7]. VADER and Polarity based approach words like is, then, to, why, etc. which are not required in this
context and do not add value to feature engineering are
was used to flag the reviews as true, false, suspected
categories and assign polarity +1, -1, and 0 to identify/classify grouped under Stopwords coming under the English language.
Term frequency counts the number of times a particular word
fake reviews and eliminate them using this technique [4]. The
review of all the other methods and techniques used b y has occurred and that can be used by spammers again and
again to identify the spammer.
researchers of the past decade is a great collection of vast
literature compiled for sentiment analysis and gives in-depth 4) Sampling of data
information on methods related to fake review detection [5]. Since a huge number of reviews are used in the dataset the
Spam review detection based on comments made on the data is subjected to sampling before even fed to the classifier.
reviews to help in sensing the review reliability and its The sampling is done to lower to weight on the classifier that
truthfulness and the method achieved 91% of F1_score for this loads the data in chunks. Here, different labels are used to
model [8]. The detection of spam reviews in singleton reviews authentic the fake reviews and then concatenate two columns
using singleton spam review correlated temporal pattern was after labelling and return the data frame.
followed [9]. The algorithm in practice KL divergence
essentially used to differentiate fake reviews from the original
IV. M ODEL OUT LINE AND WORKING
due to its asymmetric property is an issue of pseudo fake
review [10]. The sequence of reviews used to filter spam ones 1) Naïve Bayes algorithm:- A Naive Bayes calculation
used feature extraction up to six times to classify fake reviews was utilized to assemble a double arrangement model
and genuine reviews [6]. Another researcher proposed spam that would anticipate if the survey's conclusion was
review detection using new concept time series prediction positive or negative. A Naive Bayes classifier expects
method which uses pattern recognition to know the suspicious that the estimation of a specific component is free of
time-intervals in which spam review was posted [11]. The the estimation of some other element, given the class
author’s utilized activeness, context similarity, the behavior of variable. It utilizes the preparation information to
ratings of review to compute spam score explored deep neural compute the likelihood of every result dependent on
networks to know models behavior in detecting spam the highlights. One significant trait of the Naive
opinions, where recurrent and convolution networks were also Bayes calculation is that it makes suspicions about

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 986

Authorized licensed use limited to: California State University Fresno. Downloaded on June 22,2021 at 22:32:38 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9

the information. It expects that all the highlights in


the dataset are autonomous and similarly significant.
The equations (1), (2), and (3) shown below are the
standard form of any Naïve Bayes constituted
problem, these are used to compute the probabilities
for predicting values that are in the range (0, 1).
Where p is a probability, a, b, xi , y, y i are values of
which probability is calculated, σ is the standard
deviation and μ is the mean of the attributes [13].

(1)

(2)

(3)

2) Random forest classifier: - It is a supervised learning


algorithm used to train and test machine learning
models. The “forest” means an ensemble of decision Figure 1: Model diagram for fake review detection
trees trained with the “bagging” method. Here
decision trees are combined to increase the The flowchart in “Fig 1” is explained as follows, the start of
performance and learning of the model to get good the problem solving is done with dataset collection which
overall results. It basically merges multiple decision requires the attention to select the correct dataset to check
trees to amplify the performance of the random forest whether it is binary or categorical. To get the data in the
and get a more accurate prediction [13]. required format for building the model, I loaded the reviews in
the Yelp academic dataset review. Json file. Later for brevity,
a) Accuracy= TP+TN/ FP+FN+TN only those attributes were shortlisted which is useful in future
b) Precision= TP/TP+FP events. Feature extraction is done and used to train Random
forests and Naïve Bayes , where it records the relationships
c) Recall (sensitivity) = TP/TP+FN between different attributes and then uses them for
classification. After training the model, the model is fed with
d) F1_score= 2*(Recall*Precision)/ (Recall + Precision)
the new data or test data where its classification accuracy, etc.
All the above parameters determine the performance of the as illustrated in table 1 are computed and tuned accordingly
model, the results of models are shown along with the for better results. The measurement parameters of this model
confusion matrix. are confusion matrix, accuracy, precision, sensitivity,
F1_score.

V. RESULT S AND DISCUSSION


Table 1: Compiled Results of Both models

S. No Parameter Naïve Bayes (in Random Fore sts


%) (in %)
1. Accuracy Score 79.007 89.487
2. Precision Score 70.224 85.577
3. Recall 99.099 94.389
Score(Sensitivity)
4. F1 Score 82.169 89.768

From Table 1, It can infer that the two models performed


fairly well except that the random forests classifier is better
when compared. Hence random forests have got better
accuracy, precision score, and F1 score. It is concluded, a
random forest classifier can be used for the fake product
review monitoring and removal approach. When compared to

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 987

Authorized licensed use limited to: California State University Fresno. Downloaded on June 22,2021 at 22:32:38 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Sixth International Conference on Inventive Computation Technologies [ICICT 2021]
IEEE Xplore Part Number: CFP21F70-ART; ISBN: 978-1-7281-8501-9

the models for diverse applications, they perform well in [4] “Fake review detection using opinion mining” by Dhairya Patel,
certain fields and incompatible in some areas, hence their Aishwerya Kapoor and SameetSonawane, International Research journal
of Engineering and technology (IRJET) , volume 5, issue 12,Dec 2018.
application needs some experience.
[5] Ravi, k. Ravi., 2015. A survey on opinion mining and sentiment
analysis: T asks, approaches and applications. Knowledge based systems,
VI. CONCLUSION AND FUT URE SCOPE 89.14-46.
[6] Khan, K. et al., “Mining opinion components from unstructured reviews:
The results discussed in this article are the comparison A review”. Journal of King Saud University – Computer and
of two models developed to justify the model performance for Information Sciences (2014),
this “Amazon’s yelp” dataset and their relevance to deploy http://dx.doi.org/10.1016/j.jksuci.2014.03.009.
these models in real-time software. Hence Random forests [7] “Fake review detection from product review using modified method of
model performed well compared to the Naïve Bayes algorithm iterative computation framework”, by EkaDyarWahyuni&ArifDjunaidy,
MAT EC web conferences 58.03003(2016) BISST ECH 20 15.
by a large margin. The fake review detection problem is
[8] Saumya, S., Singh, J.P. Detection of spam reviews: a sentiment analysis
addressed fairly and gives a fair insight into its legality and approach. CSIT 6, 137–148 (2018). https://doi.org/10.1007/s40012-018-
need, the purpose is to select an algorithm to fulfill the task of 0193-0.
fake review detection and its elimination. In future work, [9] Xie S, Wang G, Lin S, Yu PS (2012) Review spam detection via
hybrid models and new models can be tried for the fake temporal pattern discovery. In: Proceedings of the 18th ACM SIGKDD
review detection model. By using Google co-lab and NVIDIA international conference on Knowledge discovery and data mining.
ACM, pp 823–831.
graphics GPU, the research can speed up the process of [10] Mukherjee A, Venkataraman V, Liu B, Glance N (2013a) Fake review
execution. detection: classification and analysis of real and pseudo reviews.
T echnical Report UIC-CS-2013–03, University of Illinois at Chicago.
[11] Heydari A, T avakoli M, Salim N (2016) Detection of fake opinions
References using time series. Expert SystAppl 58:83–92.
[12] Ren Y, Ji D (2017) Neural networks for deceptive opinion spam
[1] Barbosa, Luciano & Feng, Junlan. (2010). Robust Sentiment Detection detection: an empirical study. InfSci 385:213–224.
on T witter from Biased and Noisy Data. Coling 2010 - 23rd [13] McCallum, Andrew. "Graphical Models, Lecture2: Bayesian Network
International Conference on Computational Linguistics, Proceedings of Represention" (PDF). Retrieved 22 October 2019.
the Conference. 2. 36-44. [14] Joseph, S. I. T . (2019). SURVEY OF DAT A MINING ALGORITHM’S
[2] Enhanced Sentiment Learning Using T witter Hashtags and Smileys FOR INT ELLIGENT COMPUT ING SYST EM. Journal of trends in
Dmitry Davidov, Oren Tsur, ICNC / 2, Institute of Computer Science Computer Science and Smart technology (T CSST ),1(01), 14 -24.
T he Hebrew University 2010.
[3] Go, Alec & Bhayani, Richa & Huang, Lei. (2009). T witter sentiment
classification using distant supervision. Processing. 150.

978-1-7281-8501-9/21/$31.00 ©2021 IEEE 988

Authorized licensed use limited to: California State University Fresno. Downloaded on June 22,2021 at 22:32:38 UTC from IEEE Xplore. Restrictions apply.

You might also like