0% found this document useful (0 votes)
70 views52 pages

Batch 5 1

This document presents a project report on detecting fake online reviews using supervised learning methods. The project uses the Yelp dataset and applies the Gaussian Naive Bayes algorithm to predict fake reviews. The dataset is preprocessed by removing stop words and stemming words before extracting features from the reviews. The dataset is then split into training and testing sets to fit a model using the training set. Model performance is evaluated on the testing set using metrics like accuracy, precision, recall and F1-score. The results can help businesses identify fake reviews and improve the reputation.

Uploaded by

Anaya Preesha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views52 pages

Batch 5 1

This document presents a project report on detecting fake online reviews using supervised learning methods. The project uses the Yelp dataset and applies the Gaussian Naive Bayes algorithm to predict fake reviews. The dataset is preprocessed by removing stop words and stemming words before extracting features from the reviews. The dataset is then split into training and testing sets to fit a model using the training set. Model performance is evaluated on the testing set using metrics like accuracy, precision, recall and F1-score. The results can help businesses identify fake reviews and improve the reputation.

Uploaded by

Anaya Preesha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

DETECTION OF FAKE ONLINE REVIEWS USING

SUPERVISED LEARNING METHODS


PROJECT REPORT
Submitted by

AJAY E (962219104007)
BHUVANESH AV (962219104040)
DERPIN LIJO DJ (962219104047)

In partial fulfillment for the award of the degree


Of

BACHELOR OF ENGINEERING
IN

COMPUTER SCIENCE AND ENGINEERING


St. XAVIER’S CATHOLIC COLLEGE OF ENGINEERING
(An Autonomous Institution)
Chunkankadai, Nagercoil – 629 003

JUNE 2023

i
St. XAVIER’S CATHOLIC COLLEGE OF
ENGINEERING

(An Autonomous Institution)


Chunkankadai, Nagercoil – 629 003.

BONAFIDE CERTIFICATE
Certified that this project report "DETECTION OF FAKE ONLINE
REVIEWS USING SUPERVISED LEARNING METHODS” is the bonafide
work of BHUVANESH AV (962219104040), AJAY E (962219104007),
DERPIN LIJO DJ (962219104047) who carried out the project work under my
supervision.

SIGNATURE SIGNATURE
Mrs. P. R. Sheebha Rani, M.E., M.B.A., Mr. J. Bright Jose, M.E.,

HEAD OF THE DEPARTMENT PROJECT SUPERVISOR

Computer Science and Engineering Computer Science and Engineering

St. Xavier’s Catholic College of St. Xavier’s Catholic College of

Engineering. Engineering.
Chunkankadai-629003 Chunkankadai-629003

Submitted for the viva-voce held at St. Xavier’s Catholic College of


Engineering on ……………

INTERNAL EXAMINER EXTERNAL EXAMINER

ii
ACKNOWLEDGEMENT
We express our prime gratitude to the Almighty God for his presence and
abundant grace in giving knowledge, wisdom and strength to take up this project
and complete it on time.
We would like to deliver our heartiest gratitude to our correspondent,
Rev. Fr. Dr. M. Maria William for making facilities for the successful
completion of our work. We express our gratitude and sincere thanks to our
principal Dr. J. Maheswaran M.E., Ph.D., for having given us spontaneous and
wholehearted encouragement for completing our project successfully.
We are very indebted to Mrs. P.R. Sheebha Rani, M.E., M.B.A., the
Head of Computer Science and Engineering Department, for the deluge of ideas,
assistance and valuable support that she has provided to us all throughout the
project.
We express our gratitude to our supervisor Mr. J. Bright Jose, M.E.,
Assistant Professor in the Department of Computer Science and Engineering, for
her constant guidance, innovative ideas and technical support for the successful
completion of the project. We sincerely thank our project coordinator Mr. J.
Bright Jose, M.E., Assistant Professor in the Department of Computer Science
and Engineering, for his valuable suggestions and constant support in completing
the project on time.
Last but not the least we would like to thank our parents and friends for
their valuable contributions towards this project work. Finally, we believe that
the road to improvement is never ending. We shall acknowledge all suggestions
received for further improvements in the project.

iii
ABSTRACT
This project aims to predict online fake reviews using the Yelp dataset and

the Gaussian Naive Bayes algorithm. Online reviews are increasingly important

for businesses, but fake reviews can negatively impact their reputation. This

project will provide a solution to this problem by accurately detecting fake

reviews. The Yelp dataset contains user reviews, ratings, and metadata for

businesses in various cities. The dataset will be preprocessed by removing stop

words, stemming the words, and converting them to lowercase. Features will be

extracted from the reviews, such as the frequency of each word or the presence

of specific phrases. The dataset will be split into training and testing sets, and the

Gaussian Naive Bayes algorithm will be used to fit the model on the training set.

The performance of the model will be evaluated on the testing set using metrics

such as accuracy, precision, recall, and F1-score. The results of this project will

help businesses identify fake reviews and take appropriate actions to prevent them

from negatively impacting their reputation. The project can be extended by using

other algorithms or datasets to improve the accuracy of fake review detection.

iv
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE


NO.

ABSTRACT iv

LIST OF FIGURES ix

LIST OF ABBREVIATIONS x

1 INTRODUCTION 1
1.1 General 1
1.2 Problem Statement 3
1.2 Objective 3
1.3 Motivation 4
1.4 Scope 5
2 LITERATURE SURVEY 7

3 SYSTEM ARCHITECTURE 16
3.1 Existing System 17
3.2 Proposed System 17
3.2.1 Pre-Processing 17
3.2.2 Extracting Features 17
3.2.3 Splitting 17
3.2.4 Fitting 18
3.2.5 Evaluating 18
3.3 Advantages 19

v
4 SYSTEM REQUIREMENTS 20
4.1 Hardware Requirements 20
4.2 Software Requirements 21

5 SYSTEM DESIGN 21
5.1 System Architecture 23
5.2 Architecture Description 23
5.3 Use-case Diagram 25
5.4 Sequence Diagram 25
5.5 Data Pre-processing-Flow Diagram 26
5.6 Comparison Result 27

6 RESULT AND DISCUSSION 28


6.1 Data Loading
29
6.2 Data Pre-processing
30
6.3 Training
30
6.4 Performance Evaluation
31
6.4.1 Home Page
31
6.4.2 Admin Login
31
6.4.3 Admin Dashboard
32
6.4.4 Add Hotel
33
6.4.5 List of Hotels
33
6.4.6 Customer Signup
34
6.4.7 Customer Login
34
6.4.8 Customer Dashboard
35
6.4.9 Review Page 35

vi
6.4.10 Customer Review 36
6.4.11 Result Page 37

7 CONCLUSION AND FUTURE WORK 38


7.1 Conclusion 38
7.2 Future Work 38
REFERENCES 40

vii
LIST OF FIGURES
Figure No. Figure Name Page
No.

4.1 System Architecture 23

4.2 use case Diagram 25

4.3 Sequence Diagram 26

4.4 Pre Processing Module 27

4.5 Result Module 27

5.1 Data Loading 28

5.2 Data pre processing 28

5.3 Training 29

5.4 Performance Evaluation 29

5.5 Home Page 30

5.6 Admin Login 30

5.7 Admin dashboard 31

5.8 Add Hotel 31

5.9 List of Hotels 32

5.10 Customer Signup 32

viii
5.11 Customer Login 33

5.12 Customer Dashboard 33

5.13 Review Page 34

5.14 Customer review 34

5.15 Result page 36

ix
LIST OF ABBREVATIONS

SL.NO ABBREVIATION EXPANSION

1 NLP NaturalLanguage Processing

2 POS Part-of-speech

3 PU Positive Unlabelled

4 ASM Author Spamicity Model

5 NB Naïve Bayes

6 SVM Support Vector Machine

7 DT Decision Tree

8 RF Random Forest

9 GBTs Gradient-Boosted Trees

10 KNN K- Nearest Neighbours

11 TP True Positive

12 TN True Negative

13 FP False Positive

14 FN False Negative

x
CHAPTER 1

INTRODUCTION

1.1 General
This project aims to predict online fake reviews using the Yelp dataset and the
Gaussian Naive Bayes algorithm. Online reviews are increasingly important for
businesses, but fake reviews can negatively impact their reputation. This project will
provide a solution to this problem by accurately detecting fake reviews. The Yelp
dataset contains user reviews, ratings, and metadata for businesses in various cities.
The dataset will be preprocessed by removing stop words, stemming the words, and
converting them to lowercase. Features will be extracted from the reviews, such as
the frequency of each word or the presence of specific phrases. The dataset will be
split into training and testing sets, and the Gaussian Naive Bayes algorithm will be
used to fit the model on the training set. The performance of the model will be
evaluated on the testing set using metrics such as accuracy, precision, recall, and F1-
score. The results of this project will help businesses identify fake reviews and take
appropriate actions to prevent them from negatively impacting their reputation. The
project can be extended by using other algorithms or datasets to improve the
accuracy of fake review detection.

0
Everyone can freely express his/her views and opinions anonymously and without
the fear of consequences. Social media and online posting have made it even easier
to post confidently and openly. These opinions have both pros and cons while
providing the right feedback to reach the right person which can help fix the issue
and sometimes a con when these get manipulated These opinions are regarded as
valuable. This allows people with malicious intentions to easily make the system to
give people the impression of genuineness and post opinions to promote their own
product or to discredit the competitor products and services, without revealing
identity of themselves or the organization they work for. Such people are called
opinion spammers and these activities can be termed as opinion spamming

One of the biggest applications of opinion mining is in the online and e-commerce
reviews of consumer products, feedback and services. As these opinions are so
helpful for both the user as well as the seller the e-commerce web sites suggest their
customers to leave a feedback and review about their product or service they
purchased. These reviews provide valuable information that is used by potential
customers to know the opinions of previous or current users before they decide to
purchase that product from that seller. Similarly, the seller or service providers use
this information to identify any defects or problems users face with their products
and to understand the competitive information to know the difference about their
similar competitors’ products.

1
1.2 Problem Statement:
The problem this project aims to solve is the detection of fake online reviews. Online
reviews are an essential source of information for businesses and customers.
However, the increasing number of fake reviews posted online poses a significant
challenge for businesses to maintain their reputation and for customers to make
informed decisions. Fake reviews are often misleading and can manipulate
customers' perceptions of businesses. This project seeks to develop a solution to this
problem by accurately detecting fake reviews using the Yelp dataset and the
Gaussian Naive Bayes algorithm.

1.3 Objective:
The objective of this project is to develop a solution for the detection of fake online
reviews using the Yelp dataset and the Gaussian Naive Bayes algorithm. The specific
objectives include:

1.3.1 Preprocessing the yelp dataset by removing stop words, stemming the words,
and converting them to lowercase.

1.3.2 Extracting features from the reviews, such as the frequency of each word or
the presence of specific phrases.

1.3.3 Splitting the dataset into training and testing sets.

1.3.4 Fitting the Gaussian Naive Bayes algorithm on the training set.

1.3.5 Evaluating the performance of the model on the testing set using metrics such
as accuracy, precision, recall, and F1-score.

2
1.3.6 Developing a reliable and accurate fake review detection model to assist
businesses in maintaining their reputation and enabling customers to make
betterinformed decisions.

1.3.7 Providing insights and recommendations for businesses to take appropriate


actions to prevent fake reviews from negatively impacting their reputation.

1.3.8 Extending the project by using other algorithms or datasets to improve the
accuracy of fake review detection.

1.4 Motivation
The motivation behind this project is to address the increasing problem of fake
reviews posted online, which can deceive customers and damage a business's
reputation. With the proliferation of e-commerce and online platforms, online
reviews have become an essential source of information for customers, influencing
their purchase decisions. However, fake reviews posted by businesses or individuals
can mislead customers and manipulate their perceptions of a business's products or
services. Detecting fake reviews is challenging as they are often written to mimic
genuine reviews and can be difficult to differentiate from authentic reviews.
Therefore, developing a reliable and accurate solution for fake review detection is
crucial for maintaining the integrity of online reviews.

The Yelp dataset provides a rich source of information for this project, containing
user reviews, ratings, and metadata for businesses in various cities. The Gaussian
Naive Bayes algorithm is an effective algorithm for text classification tasks, making
it an ideal choice for detecting fake reviews. The results of this project will benefit
both businesses and customers. Businesses can identify fake reviews and take
appropriate actions to prevent them from negatively impacting their reputation.

3
Customers can have more trustworthy and reliable reviews, enabling them to make
better informed decisions. Overall, the motivation for this project is to provide a
solution to the problem of fake online reviews and maintain the integrity of online
reviews for the benefit of businesses and customers alike.

1.5 Scope
The scope of this project is to develop a fake review detection model using the Yelp
dataset and the Gaussian Naive Bayes algorithm. The scope includes:

1.5.1 Preprocessing the Yelp dataset by removing stop words, stemming the words,
and converting them to lowercase.

1.5.2 Extracting features from the reviews, such as the frequency of each word or the
presence of specific phrases.

1.5.3 Splitting the dataset into training and testing sets.

1.5.4 Fitting the Gaussian Naive Bayes algorithm on the training set.

1.5.5 Evaluating the performance of the model on the testing set using metrics such
as accuracy, precision, recall, and F1-score.

1.5.6 Providing insights and recommendations for businesses to take appropriate


actions to prevent fake reviews from negatively impacting their reputation.

1.5.7 Extending the project by using other algorithms or datasets to improve the
accuracy of fake review detection.

The project will focus on developing a fake review detection model using the Yelp
dataset and the Gaussian Naive Bayes algorithm. The project will not address other
types of online fraud, such as identity theft or phishing. Additionally, the project will

4
not investigate the legal implications of posting fake reviews or provide legal advice
to businesses or individuals. The project aims to provide a general solution for fake
review detection, and the performance of the model may vary depending on the
dataset and the algorithm used. Therefore, the project's scope includes exploring the
possibility of extending the project by using other algorithms or datasets to improve
the accuracy of fake review detection. Overall, the scope of this project is to develop
a reliable and accurate fake review detection model and provide insights and
recommendations for businesses to prevent fake reviews from negatively impacting
their reputation.

5
CHAPTER -2

LITERATURE REVIEW

2.1 Implementation Of Fake Review Detection Using Passive


Aggressive Classifier

Authors:Amresh Kumar; Manish Kumar; Anandhan. K; Ajay Shanker Singh

Year:2022

Abstract:In the recent year we have been experiencing a huge surge on internet, due
to many people have started using internet. Nowadays there are many paid and fake
reviews flooding the e-commerce websites like Amazon, Flipkart and many other e-
commerce websites. In which many customers make decisions based on these fake
reviews or comments provided by others who had similar experiences. In today's

competitive environment, anyone may write anything, which has resulted in an


increase in the number of bogus reviews. So, to overcome this problem of fake
reviews we will propose a machine learning model. The model will be trained with
40000 datasets to predict the fake reviews. Machine learning is the one of the
trending techniques to predicts the output based on the experience. Furthermore, we
can test the model using testing data which is also a part of the data set. We are using
Passive Aggressive Classifier and we have achieved an accuracy of 83.53%.

6
2.2 Fake Review Detection Of E-Commerce Electronic
Products Using Machine Learning Techniques

Authors:V P Sumathi; S.M. Pudhiyavan; M. Saran; V. Nandha Kumar

Year:2021

Abstract:The rapid growth of internet access has given rise to a digital era. The
availability of internet access has pushed almost 70% of the population to switch to
internet for their daily needs and accessories. Mainly, E-commerce platforms are
being used at a much higher rate than ever before. People who buy from these
ecommerce platforms make decisions on whether to buy a product or not solely
based on the ratings and reviews of a product that are provided by these platforms.
Due to the simple nature of this review system, sellers and even individuals tend to
exploit it by writing dishonest reviews with an intention of either boosting its ratings
or simply to sabotage it. These fake reviews are aimed at deceiving customers and
convince them to buy/deter a certain product. Due to the lack of a robust system to
identify real and fake reviews, these spams manage to show up on top. To avoid this
problem and provide a more efficient way to filter and provide a more efficient way
to reviews. This work focus on designing machine learning model for fake review
detection and compare the performance of three different algorithms. As a result of
this research work random forest algorithm outperform than other two algorithms.

7
Web based User Interface(UI) designed to remove fake review and display trusted
review based on the ranking.

2.3 Boosting Accuracy of Fake Review Prediction Using


Synthetic Minority Oversampling Technique

Bhawna Saxena, et al., (2022) In recent times prior to making a purchase, the vast
majority read reviews about that product, and their decision is largely driven by the
reviews. Deceitful online sellers often gather fake or spam reviews for their products
or services, thereby reducing the effectiveness of online reviews. The review data is
often imbalanced such that the fake reviews greatly outnumber the genuine reviews.
An imbalance leads to a bias, as the model tends to mostly predict the majority class.
To attain a high-quality classification outcome, the issue of imbalanced data should
be resolved before applying the classification algorithms. This paper studies the
performance of supervised machine learning classifiers pertaining to fake review
detection. The approach put forward in this paper aims to improve the prediction
accuracy of popular supervised learning classifiers Random - Forest, LightGBM,
XGBoost, Naive Bayes, and Decision Tree on an imbalanced review dataset For
boosting the accuracy of these classifiers, the Synthetic Minority Oversampling
Technique is used for addressing the class imbalance problem. The performance of
the classifiers has been studied by changing the oversampling parameters. The

8
application of SMOTE showed a significant improvement in the classifier’s
prediction accuracy.

2.4 A Novel Semi-supervised Algorithm to Find Accuracy in


Fake Review Detection using Comparing with K-neighbours
Algorithm

SrujanaSree, et al..(2022) This study topic is centred on semi-supervised algorithms


and kneighbors algorithms for optimising false review identification in an effort to
find the accuracy of real time fake review detection. The NSS parameter and the
random forest parameter are both adjusted in order to simulate the N-neighbors
algorithm (N=32) and the K-neighbors algorithm (N=32). This is done in order to
optimise the pH. In this work, a total of 20 samples were utilised, and the sample
size was determined by using the Gpower 80% formula to each of the two groups.
According to the findings, the accuracy achieved by the NSS algorithm is much
higher (75.60%) than that achieved by the Random forest method (74.50%). It was
determined that the difference in statistical significance between the semisupervised
algorithm and K-neighbors was 0.916 (p>0.05). When it comes to detecting
fraudulent reviews and improving accuracy %, the results produced by the semi-
supervised algorithm are superior to those produced by the K-neighbors approach.

9
2.5 Detection of fake online reviews using semi-supervised
and supervised learning
Rakibul Hassan, et al..(2019) Online reviews have great impact on today's business
and commerce. Decision making for purchase of online products mostly depends on
reviews given by the users. Hence, opportunistic individuals or groups try to
manipulate product reviews for their own interests. This paper introduces some semi-
supervised and supervised text mining models to detect fake online reviews as well
as compares the efficiency of both techniques on dataset containing hotel reviews.

2.6 Fake Online Reviews: A Unified Detection Model Using


Deception Theories

MujahedAbdulqader, et al..(2022) Online reviews influence consumers’ purchasing


decisions. However, identifying fake online reviews automatically remains a
complex problem, and current detection approaches are inefficient in preventing the
spread of fake reviews. The literature on fake reviews detection lacks a
comprehensive and interpretable theory-based model with high performance, which
enables us to understand the phenomenon from a psychological perspective and
analyze reviews based on user-generated content as well as consumer behavior. In
this research, we synthesized ten well-founded deception theories from psychology,
namely leakage theory, four-factor theory, interpersonal deception theory, self-
presentational theory, reality monitoring theory, criteria-based content analysis,

10
scientific content analysis, verifiability approach, truth-default theory, and
information manipulation theory, and selected nine relevant constructs to develop a
unified model for detecting fake online reviews. These constructs include specificity,
quantity, nonimmediacy, affect, uncertainty, informality, consistency, source
credibility, and deviation in behavior. We characterized the selected constructs using
verbal and non-verbal features to validate the proposed model empirically.
Subsequently, we extracted features from the Yelp datasets and used them to train
four machine learning algorithms, specifically Logistic Regression, Naïve Bayes,
Decision Tree, and Random Forest. We demonstrated that quantity, non-immediacy,
affect, informality, consistency, source credibility, and deviation in behavior are
essential constructs for detecting fake reviews. To our surprise, we discovered that
nonverbal features are more important than verbal features and that combining
features from both types improves the prediction performance. Our theory-based
model outperformed most of the state-of-the-art fake review detection models and
yielded high interpretability and low complexity.

2.7 Parametric Analysis for Fake Reviews Identification

Vikas Attri,et al..(2021) Online reviews are one of the most important aspects in a
buyer's choice to buy a new product or use a service. As a result, it serves as a helpful
source of data for determining public opinion regarding these products and services.
It also provides companies with an indication of what kind of changes they need to
make in their products to improve further. Thus, reviews also give competitors and
product-based organizations a possible option to create fake reviews in order to
advertise or degrade a product based on their interest. Hence, it is vital that the

11
correct reviews are reached to the customers, and for this, the detection of fake ones
is to be done effectively. In order to reduce the time for fake review detection,
automated techniques are being used in the current scenario. Another concern is how
to differentiate between the original and fake reviews. This paper discusses the
various factors that can help in the identification of the same. They are broadly
classified into two types: behavioral and feature-based. Also, the challenges that are
still there in fake the review identification methods are depicted, and the open
research areas where further work can be carried out are also being highlighted. The
factors mentioned in the paper can prove useful for improvising the performance of
any fake review detection system once applied to any real data set.

2.8 Fake Reviews Detection using Support Vector Machine

R. Poonguzhali,et al(2022) One of the fastest expanding business categories in the


world today is internet shopping. People nowadays buy a lot of things from internet
shopping sites. Customers can buy a better quality products based on the reviews
given by previous buyers of the products. Reviews includes text reviews, ratings and
smileys. On a product review there are hundreds of reviews in which some of the
reviews would be fake reviews. Opinion mining from natural languages is a difficult
method for evaluating customers' sentiments, but sentiment analysis provides the best
answer. It provides crucial data for decision-making in a variety of fields. So, we
propose a fake reviews detection system using support vector machine which detect
the fake reviews of the products. The primary goal is to suggest higher-quality
products to the user. We use the support vector machine algorithm to classify the
reviews into positive and negative groups. Finally fake reviews are predicted which

12
are posted by the users. The reviews are grouped as negative, positive and neutral. In
this system, only purchased users can post the reviews and duplicates are verified
based on user id and booking id. Genuine reviews are considered for product
recommendation

2.9 Fake Review Detection on Yelp Dataset Using


Classification Techniques in Machine Learning

Andre Sihombing , et al(2019) This paper provides a summary of our research, which
aims to build a machine learning model that can detect whether the reviews on Yelp's
dataset are true or fake. In particular, we applied and compared different
classification techniques in machine learning to find out which one would give the
best result. Brief descriptions for each of the classification techniques are provided
to aid understanding of why some methods are better than others in some cases. The
best result was achieved by using the XGBoost classification technique, with F-1
score reaching 0.99 in prediction.

2.10 Opinion Spam Detection in Product Reviews Using


SelfTraining Semi-Supervised Learning Approach

Dini AdniNavastara,et al..(2019) The review of a product can influence a buyer's


decision to buy the product. In addition to influencing buyer decisions, fake reviews
can also confuse buyers who are looking for product information from honest and
genuine reviews. We need a system that can filter spam to reduce the negative
influence on product selling and product review writings. Spam that will be detected

13
is the type of brand only spam and not a review. Those types get the initial label
through manual labeling. Manual labeling requires a lot of time and effort.
Therefore, in this paper, we proposed a self-training semi-supervised learning
approach. This method labels spam from the prediction of the labeled training data.
The best results were obtained with a scenario without stemming, merging of review
centric features and bigram, SMOTE borderline1 oversampling and Polynomial
SVM kernel that has accuracy 86.33%.

14
CHAPTER 3

SYSTEM ARCHITECTURE

3.1 Existing System


Currently, there are several existing systems for detecting fake reviews online. These
systems use various methods and algorithms to analyze the reviews and identify
potential fake reviews. Some of the existing systems are:

1. ReviewMeta: ReviewMeta is a tool that uses a machine learning algorithm to


analyze reviews on Amazon and identify potential fake reviews. The tool evaluates
factors such as the reviewer's history, the review's language, and the timing of the
review to identify potential fake reviews.

Fakespot: Fakespot is a web tool that uses a machine learning algorithm to analyze
reviews on Amazon, Yelp, and other platforms. The tool analyzes the language,
sentiment, and reviewer history to identify potential fake reviews.

2. Yelp's Automated Review Filter: Yelp uses an automated review filter to


identify and remove potentially fake reviews. The filter uses an algorithm that
evaluates factors such as the reviewer's activity, the review's content, and the timing
of the review to identify potential fake reviews.

3. Stanford's Deception Detection Project: Stanford University's Deception


Detection Project is a research project that aims to develop a system for detecting
deception in online reviews. The project uses machine learning algorithms to analyze
reviews and identify deceptive language and other cues of deception.

15
4. Cornell's Review Skeptic: Review Skeptic is a web tool developed by Cornell
University that uses a machine learning algorithm to analyze reviews and identify
potential fake reviews. The tool evaluates factors such as the review's language, the
reviewer's history, and the product's characteristics to identify potential fake
reviews.

While these existing systems provide valuable solutions for detecting fake reviews,
they have limitations. Some of the limitations include:

1. The systems may not be effective in detecting all types of fake reviews.

2. The systems may generate false positives or false negatives.

3. The systems may not be able to keep up with the evolving tactics of fake
reviewers.

4. The systems may require a significant amount of data and resources to be


effective.

Therefore, there is still a need for further research and development in the field of
fake review detection to improve the accuracy and reliability of the systems.

3.2 Proposed System


The proposed system for detecting fake reviews is based on the Gaussian Naive
Bayes algorithm and uses the Yelp dataset as a source of reviews. The system
consists of the following steps:

3.2.1 Preprocessing : Preprocessing the Yelp dataset by removing stop words,


stemming the words, and converting them to lowercase

3.2.2 Extracting features : Extracting features from the reviews, such as

16
the frequency of each word or the presence of specific phrases.

3.2.3 Splitting : Splitting the dataset into training and testing sets.

3.2.4 Fitting : Fitting the Gaussian Naive Bayes algorithm on the training set.

3.2.5 Evaluating : Evaluating the performance of the model on the testing set
using metrics such as accuracy, precision, recall, and F1-score.

Using the model to predict whether a review is fake or genuine.

The proposed system is designed to provide a reliable and accurate solution for
detecting fake reviews in the Yelp dataset. The system uses the Gaussian Naive
Bayes algorithm, which is a simple yet effective algorithm for text classification
tasks. The algorithm is based on the Bayes theorem and assumes that the features are
independent, making it a good choice for text data. The system uses the Yelp dataset
as a source of reviews, which provides a rich source of data for fake review detection.
The dataset contains user reviews, ratings, and metadata for businesses in various
cities.

The system's performance will be evaluated using metrics such as accuracy,


precision, recall, and F1-score. The evaluation will provide insights into the
effectiveness of the system in detecting fake reviews and its potential for further
development. Overall, the proposed system aims to provide a reliable and accurate
solution for detecting fake reviews in the Yelp dataset, which can benefit both
businesses and customers. The system can help businesses identify fake reviews and
take appropriate actions to prevent them from negatively impacting their reputation.

17
Customers can have more trustworthy and reliable reviews, enabling them to make
better-informed decisions.

3.3 Advantages
The proposed system for detecting fake reviews using the Yelp dataset and Gaussian
Naive Bayes algorithm offers several advantages, including:

1. High accuracy: The Gaussian Naive Bayes algorithm has been shown to be
highly accurate in text classification tasks, making it a suitable choice for detecting
fake reviews.

2. Fast processing: The Naive Bayes algorithm is computationally efficient and


can process large amounts of data quickly, making it a good choice for processing
the Yelp dataset.

3. Reliability: The proposed system uses a rich dataset from Yelp, which
provides a reliable source of data for detecting fake reviews. The Yelp dataset
contains user reviews, ratings, and metadata for businesses in various cities, which
can help in identifying patterns and trends in the reviews.

4. Scalability: The system can be scaled up to handle larger datasets and more
complex models, making it suitable for businesses and researchers who want to
analyze large amounts of review data.

5. Easy implementation: The Gaussian Naive Bayes algorithm is easy to


implement, making it accessible to businesses and researchers who want to develop
their own systems for detecting fake reviews.

18
Overall, the proposed system provides a reliable, accurate, and scalable solution for
detecting fake reviews in the Yelp dataset, which can benefit both businesses and
customers. The system can help businesses improve their reputation by identifying
and removing fake reviews, while customers can have more trustworthy and reliable
reviews to inform their purchasing decisions.

19
CHAPTER 4

SYSTEM REQUIREMENTS

4.1 SOFTWARE REQUIREMENTS:

Operating System Windows 10 (32 or 64 bit)


Application Server Flask Framework
Front End HTMl,CSS
Packages numpy, sklearn, Pandas, Matplotlib
Back End Python

4.2 HARDWARE REQUIREMENTS:

Processor Core2Duo
Speed 2 GHz
Random Access Memory 4 GB
HDD 500 GB
Key Board Windows Keyboard
Mouse Three Button Mouse
Monitor SVG

20
CHAPTER 5
SYSTEM DESIGN
The system design for detecting fake reviews using the Yelp dataset and Gaussian
Naive Bayes algorithm involves the following steps:

1. Data preprocessing: The Yelp dataset is preprocessed by removing stop


words, stemming the words, and converting them to lowercase. This step helps to
reduce the dimensionality of the data and improve the efficiency of the algorithm.

2. Feature extraction: Features are extracted from the preprocessed reviews. The
features can include the frequency of each word or the presence of specific phrases.
The features are used as inputs to the Gaussian Naive Bayes algorithm.

3. Model training: The Yelp dataset is split into training and testing sets. The
training set is used to train the Gaussian Naive Bayes algorithm. The algorithm
calculates the probability of a review being fake or genuine based on the input
features.

4. Model testing: The testing set is used to evaluate the performance of the
trained model. The model's performance is evaluated using metrics such as accuracy,
precision, recall, and F1-score.

5. Model tuning: The model is tuned by adjusting the hyperparameters of the


Gaussian Naive Bayes algorithm. The hyperparameters can include the smoothing
factor, the variance of the features, and the prior probabilities.

21
6. Prediction: The trained Gaussian Naive Bayes algorithm is used to predict
whether a review is fake or genuine. The system outputs the prediction result for
each review in the dataset.

7. Visualization: The results of the system can be visualized using charts and
graphs to provide insights into the performance of the system. The visualization can
include confusion matrices, precision-recall curves, and ROC curves.

The system design for detecting fake reviews using the Yelp dataset and Gaussian
Naive Bayes algorithm is modular and can be easily modified to incorporate
additional features or algorithms for improved performance. The design is scalable
and can handle large amounts of review data efficiently. Overall, the system design
provides a reliable and accurate solution for detecting fake reviews in the Yelp
dataset.

5.1 System Architecture:

22
Fig 5.1 System Architecture

5.2 Architecture Description


The system architecture for detecting fake reviews using the Yelp dataset and
Gaussian Naive Bayes algorithm can be broken down into the following
components:

1. Yelp dataset: The Yelp dataset is a rich source of review data that includes
user reviews, ratings, and metadata for businesses in various cities. The dataset is
preprocessed by removing stop words, stemming the words, and converting them to
lowercase.

2. Feature extraction: Features are extracted from the preprocessed reviews, such
as the frequency of each word or the presence of specific phrases. These features are
used as inputs to the Gaussian Naive Bayes algorithm.

3. Training and testing sets: The Yelp dataset is split into training and testing
sets. The training set is used to train the Gaussian Naive Bayes algorithm, while the
testing set is used to evaluate the model's performance.

4. Gaussian Naive Bayes algorithm: The Gaussian Naive Bayes algorithm is a


simple yet effective algorithm for text classification tasks. The algorithm is based on
the Bayes theorem and assumes that the features are independent, making it a good
choice for text data.

5. Performance evaluation: The performance of the model is evaluated using


metrics such as accuracy, precision, recall, and F1-score. The evaluation provides
insights into the effectiveness of the system in detecting fake reviews.

23
6. Prediction: The trained Gaussian Naive Bayes algorithm is used to predict
whether a review is fake or genuine. The system outputs the prediction result for
each review in the dataset.

The system architecture for detecting fake reviews using the Yelp dataset and
Gaussian Naive Bayes algorithm is designed to be scalable and easy to implement.
The system can handle large amounts of review data and can be modified to
incorporate additional features or algorithms for improved performance. Overall, the
system architecture provides a reliable and accurate solution for detecting fake
reviews in the Yelp dataset.

5.3 Use-case Diagram


A use case diagram at its simplest is a representation of a user’s interaction with the
system that shows the relationship between the user and the different use cases in
which the user is involved. In our system User will interact with use cases like Send
query, show result

24
Fig 5.3 use case Diagram

5.4 Sequence Diagram:


A sequence diagram shows object interactions arranged in time sequence. It de- picts
the objects and classes involved in the scenario and the sequence of messages
exchanged between the objects needed to carry out the functionality of the scenario

25
Fig 5.4 Sequence Diagram

5.5 Data Preprocessing-Flow Diagram


In Data pre-processing module we take the information of user search hotel from
the Data Scrapping module, on that information we apply algorithm for cleaning the
data using various methods like word tokenizer, stop words removal and feature
selection in this module

26
Fig 5.5 Pre processing Module

5.6 Comparison result


After pre-processing the data we apply our algorithm to detect the fake reviews or
rating in percentage and calculate the actual rating of the user search hotel

Fig 5.6 Result Module

27
CHAPTER 6 RESUTLTS AND DISCUSSION

6.1 Data Loading

Fig 6.1 Data Loading

28
6.2 Data pre processing

Fig 6.2 Data pre processing

29
6.3 Training

Fig 6.3 Training

6.4 Performance Evaluation

30
Web Application

6.4.1 Home Page

Fig 6.4.1 Home Page

6.4.2 Admin Login

31
Fig 6.4.2 Admin Login

6.4.3 Admin dashboard

Fig 6.4.3 Admin dashboard

32
6.4.4 ADD HOTEL

Fig 6.4.4 Add Hotel


6.4.5 LIST OF HOTELS

Fig 6.4.5 List of hotels

33
6.4.6 CUSTOMER SIGNUP

Fig 6.4.6 Customer signup

6.4.7 Customer Login

Fig 6.4.7 Customer Login

34
6.4.8 Customer Dashboard

Fig 6.4.8 Customer dashboard

6.4.9 Review Page

35
Fig 6.4.9 Review page

6.4.10 Customer review

Fig6.4.10 Customer review

36
6.4.11 Result page

Fig 6.4.11 Result page

37
CHAPTER 7
CONCLUSION AND FUTURE WORK
CONCLUSION
In conclusion, the online fake review detection system using the Yelp dataset and
Gaussian Naive Bayes algorithm is an effective solution for detecting fake reviews
on online platforms. The system can help online businesses to maintain their
reputation and protect their customers from fraudulent activities. The system is
designed to preprocess the Yelp dataset, extract features, train and test the Gaussian
Naive Bayes algorithm, and predict whether a review is fake or genuine. The system
can be easily scaled and modified to incorporate additional features or algorithms
for improved performance. A Flask web application can also be developed to provide
a user-friendly interface for the system. The application can accept user inputs in the
form of reviews and provide real-time predictions on their authenticity. Overall, the
online fake review detection system is a valuable tool for online businesses to
maintain their integrity and reputation and ensure that customers can make informed
decisions based on genuine reviews.

FUTURE WORK
There are several areas of future work for the online fake review detection system
using the Yelp dataset and Gaussian Naive Bayes algorithm. Some of the possible
future work areas are:
1. Incorporating more advanced machine learning algorithms: Although the
Gaussian Naive Bayes algorithm is effective for detecting fake reviews, there are

38
more advanced machine learning algorithms that can improve the system's accuracy.
For example, ensemble methods such as Random Forest and Gradient Boosting can
be used to combine multiple models and improve the prediction accuracy.
2. Handling more complex review data: The system can be further improved to
handle more complex review data, such as reviews containing images and videos.
Natural Language Processing techniques can be used to extract features from the
text, and Computer Vision techniques can be used to analyze images and videos.
3. Developing a mobile application: A mobile application can be developed to
provide a more convenient way for users to access the system. The application can
allow users to scan QR codes or take photos of reviews and receive real-time
predictions on their authenticity.
4. Expanding the dataset: The system's performance can be further improved by
expanding the Yelp dataset or using other datasets from different online platforms.

The larger and more diverse the dataset, the more accurate the system's predictions
are likely to be.
5. Improving the user interface: The Flask web application can be further
improved by adding more user-friendly features, such as a search bar, filtering
options, and visualizations of the system's performance.

REFERENCES
1. Jindal, N., & Liu, B. (2008). Opinion spam and analysis. In Proceedings of
the international conference on web search and data mining (pp. 219-230).

39
2. Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive
opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language
Technologies (pp. 309-319).
3. Mukherjee, A., Kumar, A., Liu, B., & Wang, J. (2013). Spotting fake reviewer
groups in consumer reviews. In Proceedings of the 22nd international conference on
World Wide Web (pp. 191-202).
4. Zhang, Y., Sun, A., & Zhang, J. (2016). Detecting fake online reviews using
generative model. In Proceedings of the 2016 IEEE/WIC/ACM international
conference on web intelligence (pp. 703- 706).
5. Chen, Y., Huang, S., & Xu, W. (2019). Detecting fake reviews using a
convolutional neural network. Information Processing & Management, 56(6), 1526-
1538.
6. Yang, J., Li, W., Yu, C., & Luo, X. (2020). A multi-feature fusion approach
for fake review detection. Information Sciences, 507, 1-17.

7. Luo, W., & Li, Y. (2019). A hybrid method for fake review detection. Journal
of Computational Science, 30, 8-16.
8. Wang, X., Li, J., Li, J., & Liu, Y. (2017). A feature-based method for detecting
fake reviews. Journal of Computer Science and Technology, 32(5), 943-955.
9. Fornaciari, R., Guidotti, R., & Zappella, G. (2019). Detecting fake reviews: A
deep learning approach. Information Sciences, 481, 422-441.
10. Lu, C., Hu, Y., Wang, X., & Zhang, H. (2020). An ensemble model for fake
review detection in online social media. Information Sciences, 507, 171-184.
11. Li, Y., Li, Y., Li, Z., Li, Z., & Zhou, X. (2020). A survey of fake review
detection methods. Information Processing & Management, 57(1), 102117.

40
12. Zhang, Y., Jin, C., Shi, H., & Liu, Y. (2020). Detection of fake reviews via a
novel feature set and hybrid feature selection. Information Processing &
Management, 57(1), 102116.
13. Yin, X., Feng, F., Hao, Y., & Li, Q. (2019). A comparative study of supervised
learning algorithms for fake review detection. Journal of Ambient Intelligence and
Humanized Computing, 10(4), 1657-1669.
14. Rayana, S., & Akoglu, L. (2015). Collective opinion spam detection: Bridging
review networks and metadata. In Proceedings of the 24th international conference
on world wide web (pp. 35-36).
15. Zhang, Y., & Zhang, J. (2016). Detecting fake reviews using discriminant
analysis. Information Processing & Management, 52(2), 291-303.

41

You might also like