Wa0006.

Fake product review detection CPP 22508 sem v
1.0 Abstract
ABSTRACT
Online spam reviews are deceptive evaluations of products and services. They are
often carried out as a deliberate manipulation strategy to deceive the readers.
Recognizing such reviews is an important but challenging problem. In this work, I try
to solve this problem by using different data mining techniques. I explore the strength
and weakness of those data mining techniques in detecting fake review. I start with
different supervised techniques such as Support Vector Ma-chine (SVM),
Multinomial Naive Bayes (MNB), and Multilayer Perceptron.
The results attest that all the above mentioned supervised techniques
can successfully detect fake review with more than 86% accuracy. Then, I work on a
semi-supervised technique which reduces the dimensionality of the input features
vector but offers similar performance to existing approaches. I use a combination of
topic modeling and SVM for the implementation of the semi- supervised tech-nique. I
also compare the results with other approaches that consider all the words of a dataset
as input features. I found that topic words are enough as input features to get similar
accuracy compared to other approaches where researchers consider all the words as
input features. At the end, I propose an unsupervised learning approach named as
Words Basket Analysis for fake review detection. I utilize five Amazon products
review dataset for an experiment and report the performance of the proposed on these
datasets.
2.0 Object and Scope of The Project
OBJECTIVE:-
Our study employs statistical methods to evaluate the performance of detection

mechanism for fake reviews and abate the acconcy of this detection. Here, we
present iterature review on studies that applied statistical methods
Sentiment analysis issues
There are several to emsider when conduct in this section, smer issues are addressed
First, the viewpoint for opinion) observed a negative in a simation might be
considered positive in another situation. Second, people do not always express upon
the same way. Most common a processing techniques employ the fact that minor
changes between the two text fagnar unlikely change the accing
Textual reviews
Most of the available reputation models depend on numeric data available in different
fields, an example ratings in e-commerce. Abe most of the reputation models locas
only in the metall ratings of products with condering the news which are provided by
comes [15]. On the other hand, most webs allow conners to add texmal reviews to
provide a detailed opinion about the product [14] [17]. The reviews are available for
users to read. Also, customers are lydings reviews her than on ratings Reputation
models can use SA methods to extract users opinions and use this data in the
Reputation syscm. This information may include consumers opinions about different
Detecting Fake Reviews Using Machine Learning
Filter and fiction fake reviews hasé substanta gnificance (20) Moraes et al [21]
proposed a technique for courting a single pe text new. A sement chuted document
level in applied for stating a sive or positive sentiment Supervised leaming methods
an amused of two phases alycoction and extraction of reviews utilizing learning
models such as SVM
Exmeting the best and most create app and shiny catering the ones witten reviews in
septive or positive opinies has stracted asennon as a major research field. Although
istianidactors phase, there has been a lot of work related to several languages (22)-
124), Our work used several supervised learning algatus uch as SVM. NB. KNN-
IBK, K and DT-148 fur Somm Classification of test to detect fake reviews
A Comparative Study of different Classification algorithms

Table 1 shows comparative studies on classification algorithms to verify the best method
for detecting fake reviews using different datasets such as News Group dataset, text
documents, and movie reviews dataset. It alsoproves that NB and distributed keyword
vectors (DKV) are accurate about detecting take reviews [12] and [13] While [11] finds that
NB is accurate and a better choice but it is not oriented for detecting fake reviews. Using
the same datasets, finds that SVM is accurate with stopwords method, but it does not focus
on detecting fake reviews, while [10] finds that SVM is only accurate without using
stopwords method, and also without detecting fake reviews, Sentiment Analysis is a very
significant to detect fake reviews [1]. However, they used only supervisor leaming
techniques based on accuracy and precision. Fundamentally, classification accuracy and
precision only are typically not enough information to obtain a good result.
However, in our empirical study, resulta in three cases with movie reviews dataset V1.0
and movie reviews dataset V2.0 and movie reviews dataset V30 prove that
SVM is robust and accurate for detecting fake reviews by evaluation of measuring the
performance with accuracy, precision. E-mesure and recall. However, in our empirical
study, results in three cases with movie reviews dataset V1.0 and movie reviews dataset
V2.0 and movie reviews dataset V3.0 prove that SVM is robust and accurate for detecting
fake reviews.
3.0 Methodology
To accomplish our goal, we analyze a dataset of Product reviews using the

Weka tool for text classification In the proposed methodology, as shown in Figure 1,
we follow some steps that are involved in SA using the approaches described below.
Step 1: Product reviews collection
To provide an exhaustive study of machine learning algorithms, the experiment is

based on analyzing the sentiment value of the standard dataset.
We have used the original dataset of the Product reviews to test our methods of
reviews classification. The dataset is available and has been used in [13], which is
frequently conceded as the standard gold dataset for the researchers working in the
field of the Sentiment Analysis. The first dataset is known as Product reviews dataset
V1.0 which consists of 1400 Product reviews out of which 700 reviews are positive,
and 700 reviews are negative. The second dataset is known as Product reviews
dataset V2.0, which consists of total 2000 Product reviews, 1000 of which are
positive and 1000 of which are negative. The third dataset is known as Product
reviews dataset V3.0, which consists of total 10662 Product reviews, 5331 of which
are positive and $331 of which are negative. A sammmary of the two datasets
collected is described in Table II.
Step 2: Data preprocessing
The preprocessing phase includes two prdiminary operations, shown in Figure

1, which help in transforming the data before the actual SA task. Data preprocessing
plays a significant role in many supervised learning algorithms.
We divided data preprocessing as follows:
StringToWordVector: To prepare the dataset for learning involves transforming

the data by using the
StringToWordVector filter, which is the main tool for text analysis in Weka.
The StringToWordVector filter makes the attribute value in the transformed datasets
Positive or Negative for all single-words, depending on whether the word appears in
the document or not. This filtration process is used for configuring the different steps
of the term extraction. The filtration process comprises the following two sub-
processes
Tokenization
This sub-process makes the provided document classifiable by converting the

content into a set of fitues ning machine learning.
Stopwards Removal
The stopwords are the words we want to filter out, eliminate, before training the
classifier. Some of those words are commonly used (eg., "a" "the" "of." "L" "you," ","
"and") but do not give any substantial information to our labeling scheme, but instead
they introduce confusion to our classifier In this study. we wood a 630 English
stopwords list with Product reviews datusets.
Stopwords removal helps to reduce the memory requirements while classifying the
reviews.
Attribute Selection
Removing the poorly describing attributes can significantly increase the

classification accuracy, in order to maintain a better classification accuracy, because
not all attributes are relevant to the classification work, and the irrelevant attributes
can decrease the performance of the used analysis algoritme, an attribute selection
scheme was used for training the classifier.
Step 3: Feature Selection
Feature selection is an approach which is used to identify a subset of features

which are modly related so the target model, and the goal of feature selection is to
increase the level of accuracy. In this study, we implemented one feature selection
method (BestFirst+ CBSubsetEval, Geneticsearch) widely used for the classification
task of SA with Stopwords methods. The results differ from one method to the other
For example, in our analysis of Product Review datasets, we found that the use of
SVM algorithm is proved to be more accurate in the classification
Step 4: Sentiment Classification algorithms
In this step, we will use sentiment classification algorithms, and they have been
applied in many domains such as commerce, medicine, media, biology, etc. There are
many different techniques in classification method like NB. DT- 148, SVM, K-NN,
Neural Networks, and Genetic Algorithm. In this study, we will use five popular
supervised classiters: NB, DT-4S, SVM. K-NN. KStar algorithms.
• Naive Bayes(NB): The NB classifier is a basic probabilistic classifier based

on applying Hayes theorem. The NB calculates a set of probabilities by combinations
of values in a given dataset. Also, the NB classifier has fast decision-making process
: Support Vector Machine (SVM): SVM in machine learning is a supervised learning

model with the related learning algorithm, which examines data and identifies
poterms, which is used for regression and classification analysts [25].
Recently, many classification algorithms have been proposed, but SVM is still one of
the most widely and most popular used classifiers.
K-Nearest Neighbor (K-NN) K-NN is a type of lazy learning algorithm and is a

non- parametric approach for categizing objects based on closest training. The K-NN
algorithm is very simple algorithm for all machine learning. The performance of the
K-NN algorithm depends on several different key factors, such as a suitable distance
measure, a similarity measure for voting, and. k parameter. A set of vectors and class
labels which are related to each vector constitute each of the training data. In the
simplest way, it will be either positive or negative class. In this study, we are using a
single number "k" with values of ke-3. The number decides how many neighbors
influence the classification
KStar (K): K-star (K) is an instance-based classifier. The class of a test instance is
established in the class of those training instances similar to it, as decided by some
similarity function. K algorithm is usually slower to evaluate the result.
• Decision Tree: The IT-348 approach is useful in the classification problem. In
the testing option, we are using percentage split as the preferred method
Step 5: Detection Processes

After training, the next step is to predict the output of the model on the testing
dataset, and then a confusion matrix is generated, which classfies the reviews as
positive or negative.
Trie Positive: Real Positive Reviews in the testing data, which are correctly classified
by the model as Pouitive (P
False Positive: Fake Positive Reviews in the testing data, which are incorrectly
classified by the model as Positive (1)
True Negative: Real Negative Reviews in the testing data, which are correctly
classified by the model as Negative (N)
False Negative: Fake Negative Reviews in the testing data, which are incorrectly
classified by the model as Negative (N)
• Trac negative (TN) are events which are real and are effectively labeled as real,
True Positive (TP) are events which are fake and are effectively labeled as fake.
Respectively, False Positives (FP) refer to Read events being classfied as
fakes: False Negativos (FN) are fake events incorrectly classified as Real events. The
confusion matrix, shows numerical parameters that could be applied following
measures to evaluate the Detection Process (DP) performance. In Table III, the
confusion matrix shows the counts of real and fake predictions obtained with known
data, and for each algorithm used in this study there is a different performance
evaluation and confusion matrix.
The confusion matrix is a very important part of our study because we can classify
the reviews from datasets whether they are fake or real reviews.
The confinion matrix is applied to each of the five algorithm discussed in Step 4.
Step 6: Comparison of results
In this step, we compared the different accuracy provided by the dataset of

Product reviews with various classification algorithms and identified the most
significant classification algorithm for detecting Fake positive and negative Reviews
4.0 Requirements for Proposed Project.
Software requirements:
1.
5.0 Process Description
Flow Chart:-
.
6.0 References and Bibliography


Wa0006.

Uploaded by

Copyright:

Available Formats

Wa0006.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wa0006.

Uploaded by

Copyright:

Available Formats

Fake product review detection CPP 22508 sem v

2.0 Object and Scope of The Project

Our study employs statistical methods to evaluate the performance of detection

Sentiment analysis issues

Detecting Fake Reviews Using Machine Learning

A Comparative Study of different Classification algorithms

To accomplish our goal, we analyze a dataset of Product reviews using the

Step 1: Product reviews collection

To provide an exhaustive study of machine learning algorithms, the experiment is

Step 2: Data preprocessing

The preprocessing phase includes two prdiminary operations, shown in Figure

StringToWordVector: To prepare the dataset for learning involves transforming

This sub-process makes the provided document classifiable by converting the

Removing the poorly describing attributes can significantly increase the

Step 3: Feature Selection

Feature selection is an approach which is used to identify a subset of features

Step 4: Sentiment Classification algorithms

• Naive Bayes(NB): The NB classifier is a basic probabilistic classifier based

: Support Vector Machine (SVM): SVM in machine learning is a supervised learning

K-Nearest Neighbor (K-NN) K-NN is a type of lazy learning algorithm and is a

Step 5: Detection Processes

Step 6: Comparison of results

In this step, we compared the different accuracy provided by the dataset of

4.0 Requirements for Proposed Project.

5.0 Process Description

6.0 References and Bibliography

You might also like