100% found this document useful (1 vote)
1K views61 pages

Fraud App Detection

This document describes a major project report submitted by four students for their B.Tech degree in Computer Science and Engineering. The project aims to develop a system for detecting fraudulent apps using sentiment analysis. Sentiment analysis involves analyzing user reviews and comments to determine whether public opinion of an app is positive or negative. By analyzing ratings, reviews from users and app admins together, the system can determine if an app is genuine or fraudulent. The students developed the project under the guidance of two professors in the Computer Science department at the University Institute of Technology in Bhopal, India.

Uploaded by

Amul Rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views61 pages

Fraud App Detection

This document describes a major project report submitted by four students for their B.Tech degree in Computer Science and Engineering. The project aims to develop a system for detecting fraudulent apps using sentiment analysis. Sentiment analysis involves analyzing user reviews and comments to determine whether public opinion of an app is positive or negative. By analyzing ratings, reviews from users and app admins together, the system can determine if an app is genuine or fraudulent. The students developed the project under the guidance of two professors in the Computer Science department at the University Institute of Technology in Bhopal, India.

Uploaded by

Amul Rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

FRAUD APP DETECTION

A MAJOR PROJECT REPORT


Submitted for the fulfillment of the requirement for the award of degree

B.TECH

IN

COMPUTER SCIENCE & ENGINEERING

Submitted By: Guided By:


Ms. Navodita Singh -0101CS191069 Dr. Anjana Deen
Ms. Shalni Shau-0101CS203D08 Dr. Raju Baraskar
Ms. Ritu Chhedam –0101CS191094
Ms. Yachna Sahu-0101CS203D12

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


UNIVERSITY INSTITUTE OF TECHNOLOGY
RAJIV GANDHI PROUDYOGIKI VISHWAVIDALAYABHOPAL-462033
SESSION 2022-2023

i
UNIVERSITY INSTITUTE OF TECHNOLOGY
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA,

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE
This is to certify that Navodita Singh, Shalni Shau, Ritu Chhedam, Yachna Sahu of
B.TECH. Final Year, Computer Science &Engineering have completed their Major
Project entitled “Fraud App Detection” during the year 2022-2023 under our
guidance and supervision. We approved the project for the submission of the
fulfillment of the requirement for the award of degree of B.TECH. in Computer
Science &Engineering.

Dr. Anjana Deen Dr. Raju Baraskar


Project Guide Project Guide

Prof. Uday Chourasia


HOD, CSE
UIT RGPV Bhopal

ii
DECLARATION BY STUDENTS

We, hereby declare that the work which is presented in the Major Project, entitled
“Fraud App Detection” submitted in partial fulfillment of the requirement for the
award of Bachelor degree in Computer Science and Engineering has been carried out
at University Institute of Technology RGPV, Bhopal and is an authentic record of our
work carried out under the guidance of Dr. Anjana Deen and Dr. Raju Baraskar
Department of Computer Science and Engineering, UITRGPV, Bhopal. The matter in
this project has not been submitted by us for the award of any other degree.

Signatures
Ms. Navodita Singh –0101CS191069
Ms. Shalni Shau-0101CS203D08
Ms. Ritu Chhedam –0101CS191094
Ms.Yachna Sahu-0101CS203D12

iii
ACKNOWLEDGEMENT

After the completion of major project work, words are not enough to express our
feelings about all those who helped us to reach our goal, feeling above all this is our
indebtedness to the almighty for providing us this moment in life. First and
foremost, we take this opportunity to express our deep regards and heartfelt
gratitude to our project guide Dr. Anjana Deen and Dr. Raju Baraskar for their
inspiring guidance and timely suggestions in carrying out our project successfully.
They have also been a constant source of inspiration for us.

We are extremely thankful to HOD Prof. Uday Chourasia , Head, Department


of Computer Science and Engineering, UIT, RGPV Bhopal, for his cooperation
and motivation during the project. We would also like to thank all the teachers of
our department for providing invaluable support and motivation. We are also
grateful to our friends and colleagues for their help and cooperation throughout this
work.

Ms. Navodita Singh -0101CS191069


Ms. Shalni Shau-0101CS203D08
Ms. Ritu Chhedam –0101CS191094
Ms. Yachna Sahu-0101CS203D1

iv
1) TABLE OF CONTENT

TITLE PAGE NO.

CHAPTER- 1

1. Introduction .................................................................................... 1-5


1.1 Introduction.................................................................. 1-5
1.2 Motivation………………………………………………….6
1.3 Objective……………................................................... 6

CHAPTER-2

2. Literature Survey ......................................................................... 7


2.1 Literature Statements................................................. 8
2.2 Problem Statements ........................................................ 9
2.3 Preposed work……………………………………….10

CHAPTER -3

3. Methodology Used……………………………….11-22
3.1 Long Short-Term Memory ..................................... 11-12
3.2 Module Description .................................................. 12-15.
3.3 Algorithm Used ...................................................... 16-18
3.4 Flow Diagram Proposed Fraud App Detection ........... 19
3.5 Data Processing ....................................................... 20-22

CHAPTER- 4

4.Technology Used............................................................................. 23-38


4.1 Python Language ....................................................... 23-24
4.2 Advantages of Python ............................................... 24-26
4.3 Disadvantages of Python ........................................... 26-27
4.3 Application of Machine learning .............................. 27-31
4.4 Installation of Python on windows ............................ 32-38

v
CHAPTER -5

5. Results and Discussions ................................................................. 39-50


5.1 Screen Shots ................................................... 39-50

CHAPTER -6

6. Conclusions and Future Work .......................................................... 51

CHAPTER -7

7. Bibliography .................................................................................. 52-54

vi
ABSTRACT

In today’s software world there are thousands of fake apps on play store and apple store.
Therefore it is very difficult to identify the geniuses of the application and user sometimes
unknowingly installs fake app. These fake apps can steal your important information from your
device. Therefore, it is very important to identify genuine application. The aim of this project
developed the software which identifies the genuine software from the play store and apple
store, which help to the users. The objective is to develop a system in detecting fraud apps
before the user downloads by using sentimental analysis. Sentimental analysis is to help in
determining the emotional tones behind words which are expressed in online. This method is
useful in monitoring social media and helps to get a brief idea of the public’s opinion on certain
issues. The user cannot always get correct or true reviews about the product on the internet.The
aim of this project can check for user’s sentimental comments on multiple applications. The
reviews may be fake or genuine. Analyzing the rating and reviews together involving both user
and admins comments, we can determine whether the app is genuine or not. Using sentimental
analysis, the machine is able to learn and analyze the sentiments, emotions about reviews and
other texts. The manipulation of review is one of the key aspects of App ranking fraud. We
have used LSTM model to predict the results.

vii
CHAPTER- 1
INTRODUCTION
1.1 Introduction
Fraudulent activities have become a major concern for businesses of all sizes, causing billions
of dollars in losses every year. Fraud detection is the process of identifying and preventing
fraudulent activities by analyzing patterns in data. With the increasing use of technology in
business operations, fraudsters are finding new ways to conduct fraudulent activities, making
it necessary to employ sophisticated fraud detection methods.
Sentiment is an emotion or attitude brought on by the customer's feelings. Opinion mining is
another name for sentiment analysis because it uses user reviews to determine how well-liked
an app is. Sentiment analysis is a step in the machine learning process. [1] Knowledge is
acquired, processed, and then classified as either good or negative depending on how it is felt.
People frequently inquire about other users' reviews of an app before making a purchase. [2]
The process of Sentiment analysis uses natural language processing to collect and examine
the opinion or sentiment of the sentence. It is popular as many people prefer to take some
advice from the users. As the amount of opinions in the form of reviews, blogs, etc. are
increasing continuously, it is beyond the control of manual techniques to analyze huge
amount of reviews & to aggregate them to a efficient decision. Sentiment analysis performs
these tasks into automated processes with less user support. [3] It is not always possible to
have a one technique to fit in all solution because different types of sentences express
sentiments/opinions in different ways. Sentiment words (also called as opinion words) (e.g.,
great, beautiful, bad, etc) cannot distinguish an opinion sentence from a non- opinion one. A
conditional sentence may contain many sentiment words or sentences, but express no opinion.
The type of sentences, i.e., conditional sentences, it have some unique characteristics which
make it hard to determine the orientation of sentiments on topics/features in such of the
sentences. By sentiment orientation, we mean positive, negative or neutral opinions.
Conditional sentences are sentences which describe implications or hypothetical situations &
their consequences. In English language, a variety of conditional connectives can be used to
form these sentences. A conditional sentence contains two clauses: the condition clause and
the consequent clause, that are dependent on each other. Their relationship has significant
implications on whether the sentence describesan opinion. [4] As there are more than
millions of apps on the App store, there is many

1
competition between apps to be on top of the leader board on the basis of popularity. As
leader board is the most important way for promoting apps. The higher rank on the leader
board leads to huge number of downloads & million doll or of profit. Apps give advertisement
to promote their apps on the leader board. Many apps use fraudulent meansto boost their
ranking on the leader board of the App store. There are various means to increase downloads
& ranking of the app which is done by "bot farms" or "human water armies", human water
armies are a group of internet ghostwriters who are paid to post fake reviews. The app is said
to fraud on the basis of 3 parameters: Ranking, Rating & Reviewof the app. In ranking based
we check the historical ranking of the app, there are 3 different ranking phases, rising phase,
maintaining phases & recession phase. The apps ranking risingto peak position on leader
board (ie. rising phase), to keep at the peak position on the leader board (ie. maintaining
phase), & finally decreasing till the end of event (ie. recession phase).The reviews are taken
from the dataset and are converted into tokens on which sentiment analysis is performed.
The most important role played by customers quality, ratings and reviews of that particular
app what happens to download. Not that, sometimes developers are misleading recognition
of their applications or malicious use them as a malware distribution platform throughout.
Occasionally, it is just an improvement for engineers they often hire teams of workers who
commit fraud by sharing and providing false opinions and estimates over application. This
is known as crowd turfing. It is therefore important to make sure that before installing app,
users are provided with the right and true comments to avoid something wrong. In this case,
I an automated solution is needed to win again systematically analyze various ideas and
measurements provided for each application. It will be so it is difficult for a user to decide
to comment on what they are saying scroll through even if the scales they see are fraudulent
or true for their benefit. Thus, we are proposing a system of that will detect malicious
applications on Google Play or App end by giving a complete overview of fraud detection by
scale system. By considering data mining and the emotional analysis, we can get a higher
probability to get real reviews with us suggest a program that takes reviews from registered
users with one or more products and test them as a positive or negative rating. This can be
helpful as well determine the application for fraud and ensure mobile security well we check
in three forms proof based, based ratings, and model-based reviews are three- dimensional
combinations with statistics hypotheses. Regardless, evidence based on suspension may be
affected by the status of the application developer and others real marketing efforts “as a set
time laying down “.

2
In this project we mainly focus to get the review by the users is genuine. users can use the
app by just signing up and write their reviews in the review section and write the name of the
app in the app name section. and now admin can check their reviews in the review sectionin
the review section it shows all the reviews which were written by the different - different
reviewer. and in the chart section they show all the app rating scale based on the reviews.
And reviewer can check the review that which type of review they written they will check in
the dataset.
Developers have developed a ranking fraud detection system for mobile Apps. Specifically,
we show that ranking fraud happened in the leading sessions and provided a method for
mining leading sessions for each App from its historical ranking records. Then, we identify
ranking based evidences and rating based evidences for detecting ranking fraud. Moreover,
we proposed an optimization-based aggregation method to integrate all the evidences for
evaluating the credibility of leading sessions from mobile Apps. An unique perspective of
this approach is that all the evidences can be modeled by statistical hypothesis tests, thus it
is easy to be extended with other evidences from domain knowledge to detect ranking fraud.
Finally, we validate the proposed system with extensive experiments on the real-world App
data collected from the App store. Experimental results showed the effectiveness of the
proposed approach.[15] The main objective is fraud application detection using fuzzy logic
to differentiate the actual fraud apps. The proposed system perform classification of apps &
detect their group whether they belong to good, bad, neutral, very good, very bad. Different
class value & threshold value gives different results of accuracy of time required for
execution.[16] Sentiment Analysis is major task of natural language processing. Data used as
input are online app reviews. The objective content from the sentences are removed and
subjective content is extracted. The subjective content consists of sentiment sentences. In
NLP, part-of-speech (POS) taggers are developed to classify words based on POS. Adjective
and verbs convey opposite sentiment with help of negative prefixes. Sentiment score is
computed for all sentiment tokens.
Information is gathered and is analyzed to determine the sentiment about the information
such as negative or positive sentiment. Before purchasing the app people alwaysenquire about
the opinion of the app by the other users. The process of Sentiment analysis uses natural
language processing) to collect and examine the opinion or sentiment of the sentence. It is
popular as many people prefer to take some advice from the users. As the amount of opinions
in the form of reviews, blogs, etc. are increasing continuously, it is

3
beyond the control of manual techniques to analyze huge amount of reviews & to aggregate
them to an efficient decision. Sentiment analysis performs these tasks into automated
processes with less user support. It is not always possible to have a one technique to fit in all
solution because different types of sentences express sentiments /opinions in different ways.
Sentiment words (also called as opinion words) (e.g., great, beautiful, bad, etc.) cannot
distinguish an opinion sentence from anon-opinion one. A conditional sentence may contain
many sentiment words or sentences, but express no opinion. The type of sentences, i.e.,
conditional sentences, it has some unique characteristics which make it hard to determine
the orientation of sentiments on topics/features in such of the sentences. By sentiment
orientation, we mean positive, negative or neutral opinions. Conditional sentences are
sentences which describe implications or hypothetical situations & their consequences. In
English language, a variety of conditional connectives can be used to formthese sentences. A
conditional sentence contains two clauses: the condition clause and the consequent clause,
that are dependent on each other. Their relationship has significant implications on whether
the sentence describes an opinion. As there are more than millions of apps on the App store,
there is many competitions between apps to be on top of the leaderboard on the basis of
popularity. As leader board is the most important way for promoting apps. The higher rank
on the leader board leads to huge number of downloads & million doll or of profit. Apps
give advertisement to promote their apps on the leader board. Many apps use fraudulent
means to boost their ranking on the leader board of the App store. There are various means
to increase downloads & ranking of the app which is done by "bot farms" or "human water
armies", human water armies are a group of internet ghostwriters who are paid to post
fake reviews. The app is said to fraud on the basis of
3 parameters: Ranking, Rating & Review of the app. In ranking based we check the historical
ranking of the app, there are 3 different ranking phases, rising phase, maintaining phases &
recession phase.
Fraud detection software is designed to identify suspicious patterns and activities in financial
transactions, customer behavior, and other data sources. The software uses advanced
algorithms and machine learning techniques to analyze large volumes of data and detect
anomalies that may indicate fraudulent activities.
The importance of fraud detection cannot be overstated, as it helps organizations to prevent
financial losses, protect their reputation, and maintain customer trust. This paper aims to

4
review the current state-of-the-art techniques in fraud detection, including machine learning,
data mining, and statistical analysis, and how they can be applied to various industries and
business functions. Additionally, we will discuss the challenges involved in fraud detection
and the best practices for implementing a successful fraud detection system.
Fraudulent activities have always been a concern for businesses of all sizes and industries.
The proliferation of technology has made it easier for fraudsters to carry out their malicious
activities, which is why detecting fraud has become more important than ever. Fraudulent
activities can cause significant financial loss, damage to reputation, and even legal
consequences.

One of the most effective ways to prevent fraud is to use fraud detection software. Fraud
detection software uses advanced algorithms and machine learning techniques to analyze large
amounts of data and identify suspicious patterns or activities. By identifying potential fraud
in real-time, businesses can take immediate action to prevent further damage.

In recent years, there has been a significant increase in the number of fraud detection solutions
available on the market. These solutions vary in their approach, complexity, and effectiveness.
Choosing the right solution for a particular business can be a challenging task, but it is crucial
to ensure that the solution can effectively detect and prevent fraudulent activities.

The purpose of this is to provide an overview of the current state of fraud detection software.
Specifically, we will explore the different types of fraud detection software available, their
strengths and weaknesses, and the key factors to consider when selecting a solution for a
particular business. By the end of this paper, readers should have a clear understanding of the
different types of fraud detection software available and be able to make informed decisions
when selecting a solution for their business.

In recent years, mobile devices have become an integral part of our lives, providing us with
a wide range of functionalities and services. As the usage of mobile devices increases, so
doesthe risk of cyber threats, including fraudulent apps that pose a serious threat to users'
privacyand security. Fraudulent apps are designed to trick users into thinking that they are
legitimate,butin reality, they are malicious and can harm users' devices and steal their
personal data.

5
To counteract this threat, researchers and developers have developed various approaches and
techniques to detect fraudulent apps. These approaches include static and dynamic analysis,
machine learning, and behavioral analysis, among others. However, detecting fraudulent apps
is a challenging task, as attackers continuously develop new methods to evade detection and
stay ahead of security measure

1.1 Motivation:
In the modern computer world, the use of the internet is increasing day by day. These days, new
type of fraud occurs everyday and it is not easy to detect and prevent those fraud apps
effectively. One common method of fraud app involves submitting a large number of reviews on
the app and websites to create the willingness of users. Our major task is to detect fraud and
genuine application by using sentiment analysis of review data using machine learning
techniques.

1.1 Objective:

The objective of development of Fraud App Detection system is to:


1. Analyze the features of fraudulent apps and identify key indicators of fraud.
2. Select and implement appropriate machine learning algorithms for fraud detection.
3. Evaluate the performance of the developed fraud detection system using a dataset of
known fraudulent and legitimate apps.

6
CHAPTER- 2
LITERATURE SURVEY
2.1 Literature Survey:
Fraudulent mobile applications have become a growing concern due to the increase in mobile
usage and the rise of mobile commerce. Several techniques have been proposed in the literature
to detect fraudulent mobile applications. In this literature survey, we discuss some of the recent
techniques proposed for detecting fraud in mobile applications.

One of the most widely used techniques for detecting fraudulent mobile applications is static
analysis. Static analysis involves analyzing the code of an application without actually executing
it. This technique can be used to detect malicious code, such as code that accesses sensitive
information or performs unauthorized actions. Researchers have proposed various static analysis
techniques for detecting fraudulent mobile applications. For example, Wang et al. proposed a
technique that uses machine learning to analyze the code of mobile applications and detect
malicious behavior [1].

Dynamic analysis is another technique that is commonly used for detecting fraudulent mobile
applications. Dynamic analysis involves executing an application in a controlled environment
and monitoring its behavior. This technique can be used to detect malicious behavior that is not
evident in the code. For example, Kao et al. proposed a technique that uses dynamic analysis to
detect mobile applications that steal user information [2].

In addition to static and dynamic analysis, researchers have also proposed other techniques for
detecting fraudulent mobile applications. For example, Li et al. proposed a technique that uses
user behavior to detect fraudulent mobile applications [3]. The authors analyzed the behavior of
users who had installed a fraudulent application and identified patterns that could be used to
detect similar applications.
Another approach for fraud app detection is based on user behavior analysis. This technique
focuses on monitoring and analyzing the behavior of users when they interact with an app. This

7
involves collecting data such as the user's location, the time spent on the app, the frequency of
app usage, and the types of actions performed in the app. By analyzing this data, it is possible to
identify suspicious behavior patterns, such as abnormal usage patterns or sudden changes in
behavior.

One example of this approach is the work by Xu et al. [4] which proposes a framework for
detecting mobile app fraud based on user behavior analysis. The framework collects various data
points, such as app usage frequency, user location, and device information, and applies machine
learning algorithms to identify patterns that are indicative of fraudulent behavior. The authors
report promising results, with the framework achieving a fraud detection accuracy of 92%. In
addition to the above approaches, researchers have also proposed various other techniques for
fraud app detection, such as anomaly detection [5], network-based analysis [6], and reputation-
based analysis [7].

Additionally, machine learning techniques such as neural networks, decision trees, and support
vector machines have been employed in various fraud app detection systems [8], Lee and Kim,
[9]; Alharbi et al., [10]. These approaches are based on the analysis of various features of the
apps, such as permissions requested, resource usage, and user reviews. By learning from the
patterns in these features, the machine learning models can accurately classify fraudulent apps.
Recent research has also explored the use of blockchain technology in fraud app detection by
Huang et al., [11]; Wijaya et al., [12], Blockchain provides a decentralized and secure system
for app transactions, which can improve the transparency and accountability of app developers.
By incorporating blockchain into the app verification process, fraudulent apps can be detected
and prevented more effectively.
Another study conducted by Bhattacharya and colleagues [13] proposed a machine learning-
based approach for detecting mobile app fraud. They developed a hybrid model that combined
two machine learning algorithms, namely, Decision Tree (DT) and Artificial Neural Network
(ANN), for fraud detection. The DT algorithm was used to identify the initial set of rules to
differentiate between legitimate and fraudulent apps, while the ANN was used to make final
predictions. The proposed model was evaluated on a dataset of 4000 apps, and it achieved an
accuracy of 92.75% and an F1-score of 0.92.

8.
2.2 Problem Statement:
There are many challenges in fraud App Detection as follows:
1. Changing fraud patterns over time - This is very difficult to deal with as fraudsters
are always looking to find new and innovative ways to go around the plans to commit this act.
It is therefore very important that in-depth learning models are updated with advanced patterns
for recognition. This results in a decrease in the efficiency and effectiveness of the model.
Machine learning is the models therefore need to constantly update or fail their goals.
2. Class Inequality - Only a small percentage of customers have fraudulent intentions. As a
result, there are inequalities in classifying fraud detection models (which often classify
fraudulent or non-fraudulent) making it difficult to enforce. At the root of this challenge is
poor user behavior towards real customers, as catching scammers often involves a decline in
certain legitimate activities.

3. Model Definitions - This limitation is associated with the definition of interpretation as


models often give points that indicate whether the work may be fraudulent or not - without
explaining.

4. Feature construction may be time consuming - Mathematicians may need a lot of time to
create a comprehensive set that delays the process of detecting fraud.

5. APK file of mobile application is uploaded on the web application. APK parser is used to
extract information about the application such as reviews, ratings, and historical record.
Natural Language Processing is used to perform sentiment analysis on the reviews. By
applying rule for detection of fraud application, it generates the graph results. If the rating
count is greater than 3 then it is considered as a positive result. And if the rating count is less
than 3 then it is considered as a negative result
9
2.3 Preposed Work.

The aim of this project a system that would identify such fake applications on the play or app

store. The aim of this project can acquire the probability of determining whether an app is fake

or not, therefore The aim of this project present a system that uses four features that are in-app

purchases, contains ads, ratings, and reviews to determine the probability of an app whether it's

scamming its consumers or not. The sole purpose of the given proposed system is majorly to

review the fraud detection of google play store applications and to use the three - parameter

methods to differentiate certain fraudulent applications or commonly referred to as spam

applications. Experimental analysis is performed on different types of methodology in the

proposed manner for the detection of fraud or fake applications. In system will receive fraud

with three types of evidence, such as ad-based ratings, in-app purchases, and evidence-based

reviews. In addition, the development-based integration approach incorporates all three aspects

to detect fraud. Various machine learning models were implemented which provided different

results for accuracy. By analysis, found that our given proposed method provides 85%

accuracy compared to other algorithms. While independent thinking still exists, the decision

tree section performs better compared to other models such as the Natural language Processing

and sentimental analysis algorithm. It is an intuitive algorithm for separation problems. It is a

reliable real-time guess, a setback problem.

10
CHAPTER-3
METHODOLOGY USED

3.1 Long Short-Term Memory (LSTM):


LSTM stands for Long Short-Term Memory, and it is a type of recurrent neural network (RNN)
architecture used in deep learning.
LSTM networks are designed to address the vanishing gradient problem that can occur in
traditional RNNs. This problem arises when the gradients in backpropagation get smaller and
smaller as they propagate through the network, making it difficult for the model to learn long-
term dependencies.
LSTM networks are able to retain information over longer periods of time by using a system
of gates to selectively allow information to flow into and out of each cell. These gates include
an input gate, an output gate, and a forget gate, which are controlled by sigmoid activation
functions and can be trained to selectively remember or forget information.
By selectively retaining information over time, LSTM networks can be used in applications
such as natural language processing, speech recognition, and image captioning, among others.
The basic idea is to train an LSTM network to analyze the sentiment of a review and then use
the sentiment analysis output as one of the features for fraud detection.
Here are the high-level steps for this approach:

Data preprocessing: Collect a dataset of reviews, and preprocess the text to remove any
unnecessary information such as stop words, special characters, and punctuation. Then, label
each review as either positive or negative sentiment.
Train the LSTM model: Train an LSTM model on the labeled dataset to predict the sentiment
of a review. The input to the model will be the preprocessed text of the review, and the output
will be a binary sentiment label (positive or negative).

11
Feature engineering: Once the LSTM model is trained, use the output of the sentiment analysis
as one of the features for fraud detection. Combine it with other relevant features, such as the
user's rating history, purchase history, and other metadata.

Fraud detection: Finally, use the combined features to build a fraud detection model. This
model should be trained to identify fraudulent reviews by detecting patterns and anomalies in
the data. The sentiment analysis output can help to identify reviews that may be artificially
positive or negative.

The basic idea is to use the sequence of words in the text as input to an LSTM network, which
learns to capture the contextual relationships between words to make accurate sentiment
predictions. The network is trained on a large dataset of text data with labeled sentiment scores,
and then used to predict the sentiment of new, unseen text data.

Input Tokenizatio Word Embedding LSTM Layer

Output

Figure 4.1 Proposed diagram of Fraud App Detection

3.2 Module description


Input Text: The raw text data to be analyzed for sentiment.
The input text refers to the raw text data that is being analyzed for sentiment. This text data can
come from a variety of sources, including customer reviews, social media posts, news articles,
and product descriptions, among others.
Before the text can be used as input to an LSTM network, it must be preprocessed to remove
any unnecessary information such as stop words, special characters, and punctuation. The text
is then typically tokenized, which involves breaking down the text into individual words or
tokens.

12
The input text is then fed into the LSTM network one word at a time, with each word
representing a single time step in the sequence. The LSTM network learns to capture the
relationships between words in the text over time, and uses this information to make a
prediction about the sentiment of the text.
Here We take data from goggle play store api to analysis emotion of user.

Tokenization: The process of breaking down the text data into individual words, or tokens.
Tokenization is the process of breaking down a text into individual words or tokens. The basic
idea is to split the text on whitespace characters such as spaces, tabs, and line breaks, and then
extract each sequence of non-whitespace characters as a token.
For example, suppose we have the following sentence:
The quick brown fox jumped over the lazy dog.
To tokenize this sentence, we can split it on whitespace characters, which gives us the following
tokens:
[The, quick, brown, fox, jumped, over, the, lazy, dog.]

Word Embedding: A technique used to represent each word as a high-dimensional vector. This
allows the LSTM network to capture the meaning and context of each word more accurately.
Word embedding is a technique used in natural language processing to represent words as
numerical vectors in a high-dimensional space. The idea behind word embedding is to capture
the meaning and context of each word in a way that is more easily interpretable by machine
learning algorithms.

There are several methods for creating word embeddings, but one of the most popular is the
Word2Vec algorithm. Word2Vec uses a neural network to learn the co-occurrence patterns of
words in a large corpus of text. The network is trained to predict the likelihood of a word
appearing in the context of other words, and the resulting weights of the hidden layer are used
as the word embeddings.

LSTM Layer: A sequence modeling layer that can capture the relationships between words in
the text data over time. This layer uses a set of gates to selectively retain or discard information
from previous time steps.

13
LSTM (Long Short-Term Memory) is a type of recurrent neural network that is commonly
used for sequence modeling and natural language processing tasks. An LSTM layer consists of
a series of LSTM cells, each of which has a set of learnable parameters that allow it to
selectively store and forget information over time.
To illustrate how an LSTM layer works, let's consider a simple example of sentiment analysis

on a sequence of words. Suppose we have the following input sequence:


I loved the movie, but the ending was disappointing.
e can tokenize this sequence into individual words, and then represent each word using a word
embedding vector. The resulting sequence of embedded vectors is then fed into an LSTM layer,
which processes each word in turn and outputs a final prediction about the sentiment of the
text.
At each time step, the LSTM layer receives an input vector representing the current word, as
well as a hidden state and cell state vector representing the accumulated information from the
previous words in the sequence. The LSTM cell then performs several operations:
Forget gate: The LSTM cell first decides which information to forget from the cell state vector,
based on the current input and the previous hidden state. This is done using a sigmoid activation
function, which outputs a value between 0 and 1 for each element of the cell state vector. Values
close to 0 indicate that the corresponding information should be forgotten, while values close
to 1 indicate that it should be retained.
Input gate: Next, the LSTM cell decides which new information to store in the cell state vector,
based on the current input and the previous hidden state. This is done using another sigmoid
activation function, which outputs a value between 0 and 1 for each element of the input vector.
Values close to 0 indicate that the corresponding information should be ignored, while values
close to 1 indicate that it should be added to the cell state.
Candidate value: The LSTM cell then creates a candidate value that could be added to the cell
state vector, based on the current input and the previous hidden state. This is done using a
hyperbolic tangent (tanh) activation function, which squashes the values to be between -1 and
1. Update cell state: The LSTM cell updates the cell state vector by combining the forget
gate, input gate, and candidate value, as follows:
cell state = forget gate * previous_cell_state + input_gate * candidate_value

Output gate: Finally, the LSTM cell decides which information to output as the current hidden state, based
on the updated cell state and the current input.
14
This is done using another sigmoidactivation function, which outputs a value between 0 and 1 for each
element of the cell state vector.
Values close to 0 indicate that the corresponding information should be ignored, whilevalues close to 1
indicate that it should be included in the output.
The resulting hidden state is then fed into the next LSTM cell in the sequence, along with the
next input vector. This process continues until all of the input vectors have been processed.

The output of the final LSTM cell can then be used to make a prediction about the sentiment
of the text. For example, in a binary sentiment analysis task, the output could be a single value
indicating the probability that the text is positive or negative.

Fully Connected Layer: A layer of neurons that performs a weighted sum of the outputs from
the LSTM layer, and applies an activation function to generate a sentiment prediction.

Output: The final sentiment prediction, which is a score between 0 and 1 indicating the
probability that the text has a positive sentiment

15
3.3 Algorithms used in this project:
LSTM (Long Short-Term Memory) networks use a combination of several methods to process
sequential data and model complex temporal dependencies. Some of the key methods used in
LSTM include:

Recurrent Neural Networks (RNN): LSTMs are a type of RNN, which means that they
are designed to handle sequential data by maintaining an internal state or "memory" that is
updatedat each time step. Recurrent Neural Networks (RNNs) are a class of neural networks
that are designed to handlesequential data. They are able to maintain an internal state, or
"memory", that allows them to process sequences of variable length. This makes them well-
suited to a wide range of tasks, including language modeling, machine translation, and speech
recognition.

Decision tree classifier: Decision trees are a simple and interpretable classifier that used
for classification tasks. In the case of sentiment analysis, a decision tree used to classify
the
sentiment of a text based on features extracted from the output of the LSTM network.

A decision tree classifier is a type of supervised learning algorithm that uses a tree-like structure
to model decisions and their possible consequences. The algorithm works by recursively
splitting the data into subsets based on the values of one or more input features, with each split
resulting in a binary decision node in the tree. At the leaf nodes of the tree, a prediction is made
based on the majority class of the training examples that reach that node.

The decision tree algorithm can be represented by the following equations:


Initialization: Set the root node to contain all training examples, and set the depth of the tree
to 0.

Termination condition: If all the examples at a given node belong to the same class, or if
the depth of the tree exceeds a pre-defined maximum depth, then mark the node as a leaf and
returnthe majority class label of the examples at that node. Splitting criterion: Choose the
feature and threshold that maximize some splitting criterion, suchas information gain or Gini
impurity.

16
Split the data: Partition the examples at the current node into two subsets based on the chosen
feature and threshold, and create two child nodes for the tree.

Recursion: Apply the decision tree algorithm recursively to each child node, using the remaining
features as input.

Stopping criterion: If the stopping criterion is met (e.g., all nodes are pure or the tree depth
exceeds the maximum allowed), then terminate the algorithm and return the resulting tree.

As an example, consider a dataset of emails, each labeled as either spam or non-spam. The
decision tree algorithm might start by splitting the data based on the presence or absence of
certain keywords in the email subject or body. For example, if the keyword "Fraud" is present,
the email is more likely to be spam. The algorithm would then recursively apply this splitting
process to each subset of the data, creating a tree that represents the decision-making process for
classifying emails as spam or non-spam.

Once the decision tree is constructed, it can be used to make predictions on new data by
traversing the tree from the root node to a leaf node, based on the values of the input features.
The majority class label of the training examples at the leaf node is then returned as the predicted
class label for the new example. Decision trees can be very interpretable, as the resulting tree
can be visualized and analyzed to gain insights into the decision-making process. However, they
can also be Fraud to overfitting and may not perform as well as other classifiers on complex
datasets.

Sequence Modeling: LSTMs are specifically designed to model sequential data, such as
naturallanguage text or time-series data. They can handle variable-length input sequences and
outputsequences of varying lengths, making them suitable for a wide range of tasks.

Backpropagation Through Time (BPTT): BPTT is the algorithm used to train LSTM
networks. It is a variant of backpropagation that is designed to handle the fact that the network's
internal state changes over time.

Gradient Clipping: Gradient clipping is a technique used to prevent exploding gradients during
training. In LSTM networks, this is particularly important because the gradients can accumulate
over time due to the recurrent connections.

17
Word Embedding: LSTMs often use word embedding to convert text data into numerical
vectors that can be processed by the network. Word embedding allows the network to capture
the semantic relationships between words in a more meaningful way than other encoding
methods, such as one-hot encoding.

Dropout: Dropout is a regularization technique used to prevent overfitting by randomly


dropping out some of the neurons in the network during training. This forces the network to
learn more robust representations of the input data.

The following machine learning algorithm is used for developing the Fraud App Detection
Algorithm-1

# Data Preprocessing
Load and preprocess the dataset
a=Split the dataset into training and testing sets
# Decision Tree Training
Initialize the Decision Tree classifier
b=Classify data(a)
# LSTM Training
Initialize the LSTM model
Define the LSTM architecture
Compile the model with appropriate loss function and optimizer
Train the LSTM model on the training dataset
Algorithm-2
# Fraud App Detection
For each app in the testing dataset:
Extract relevant features from the app
Pass the features to the LSTM model to obtain LSTM output
Pass the features to the Decision Tree classifier to obtain Decision Tree output
Combine the outputs from both models (e.g., weighted average)
If the combined output exceeds a predefined threshold:
Classify the app as a fraud app
Else:
Classify the app as a legitimate app
18
Figure 4.2 Flow Diagram of Proposed Fraud App Detection

Workflow involving LSTM and RNN models, along with feature extraction using a decision
tree, can be as follows:

19
Data Preparation:
Collect a dataset of reviews, where each review is associated with a sentiment label (positive
or negative).
Perform any necessary data cleaning, such as removing duplicates or handling missing values.
Feature Extraction using Decision Tree:
Use a decision tree model to extract important features from the textual content of the reviews.
The decision tree can rank the features based on their importance scores, allowing you to select
the most relevant features for sentiment analysis.

LSTM -RNN Model:


Prepare the preprocessed reviews and selected features as input to the LSTM or RNN model. Convert the
text data into numerical representations, such as word embeddings or TF-IDFvectors.
If needed, you can also include the extracted features from the decision tree as additional input
features to the LSTM - RNN model.

Model Training and Validation:


Split the dataset into training and validation/test sets.
Train the LSTM - RNN model using the training set, where the objective is to minimize the
loss (e.g., cross-entropy) between predicted and actual sentiment labels.
Validate the trained model using the validation/test set and evaluate its performance using
appropriate metrics like accuracy, precision, recall, or F1-score.

Prediction and Sentiment Analysis:


Apply the trained LSTM or RNN model to unseen reviews or a separate test dataset.
Obtain predictions for the sentiment labels of the reviews.
Analyze the sentiment predictions to gain insights into the sentiment distribution, identify
positive or negative reviews, or perform further analysis based on the specific goals of the
review analysis.
By combining feature extraction using a decision tree with LSTM or RNN models, you can
benefit from both the interpretability of the decision tree for feature selection and the ability of
LSTM or RNN models to capture sequential patterns and dependencies in the reviews. This
workflow can help in extracting relevant features and building a powerful sentiment analysis
system for reviews.

20
Data preprocessing
Data preprocessing is a crucial step in machine learning and data analysis tasks. It involves
transforming raw data into a format that is suitable for analysis and model training. Here are
some common preprocessing techniques:
Data Cleaning:
Handling missing values: Identify and handle missing data by either removing instances with
missing values, imputing missing values using statistical methods, or using advanced
imputation techniques.
Removing duplicates: Check for and remove duplicate records in the dataset.
Handling outliers: Detect and handle outliers, which are extreme values that may affect the
analysis or model performance. This can involve removing outliers or transforming them to
mitigate their impact.
Data Transformation:
Feature scaling: Scale numerical features to a similar range to avoid certain features dominating
others during model training. Common scaling techniques include min-max scaling
(normalization) and standardization.
Encoding categorical variables:
Convert categorical variables into numerical representations. This can involve one-hot
encoding, label encoding, or ordinal encoding, depending on the nature of the data and the
requirements of the model.
Text preprocessing:
For text data, perform techniques like tokenization, removing stopwords (commonly used
words with little semantic value), stemming or lemmatization (reducing wordsto their base
form), and handling special characters or punctuation.
Feature Engineering:
Creating new features: Derive additional features from existing ones that may capture relevant
information. For example, extracting date or time-related features from timestamps or
calculating ratios or percentages from numerical features.
Dimensionality reduction:
Reduce the number of features while preserving important information. Techniques such as
principal component analysis (PCA) or feature selection algorithms can be used for this
purpose

21
Data Splitting:
Splitting into training and test sets: Divide the dataset into training and evaluation/test sets. The
training set is used to train the model, while the test set is used to evaluate its performance.
Common splits are 70-30

Feature Selection By Applying Decision Tree-


Train a Decision Tree Model:

Build a decision tree model using the entire dataset, including allfeatures and the corresponding
target variable.

Evaluate Feature Importance:


Decision trees provide a measure of feature importance based onhow much each feature
contributes to reducing impurity or splitting the data. This measure is typically computed using
metrics such as Gini importance or information gain.

Rank Features:
Rank the features based on their importance scores, with higher scoresindicating greater
importance. This ranking can help identify the most influential features.

22
CHAPTER-4
TECHNOLOGY USED

4.1 Python Language:


Python is an interpreted high-level programming language for general-purpose programming.
Created by Guido van Rossum and first released in 1991, Python has a design philosophy that
emphasizes code readability, notably using significant whitespace.
Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
your program before executing it. This is similar to PERL and PHP.
Python is Interactive − you can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python also acknowledges that speed of development is important. Readable and terse code
is part of this, and so is access to powerful constructs that avoid tedious repetition of code.
Maintainability also ties into this may be an all but useless metric, but it does say something
about how much code you have to scan, read and/or understand to troubleshoot problems
or tweak behaviors. This speed of development, the ease with which a programmer of other
languages can pick up basic Python skills and the huge standard library is key to another area
where Python excels. All its tools have been quick to implement, saved a lot of time, and
several of them have later been patched and updated by people with no Python background -
without breaking.
Python is currently the most widely used multi-purpose, high-level programming language.

Python allows programming in Object-Oriented and Procedural paradigms. Python programs


generally are smaller than other programming languages like Java.
Programmers have to type relatively less and indentation requirement of the language,
makes them readable all the time.
Python language is being used by almost all tech-giant companies like – Google, Amazon,
Facebook, Instagram, Dropbox, Uber… etc.

23
The biggest strength of Python is huge collection of standard library which can be used
for the following –
 Machine Learning

 GUI Applications (like Kivy, Tkinter, PyQt etc.)

 Web frameworks like Django (used by YouTube, Instagram, Dropbox)

 Image processing (like OpenCV, Pillow)

 Web scraping (like Scrapy, Beautiful Soup, Selenium)

 Test frameworks

 Multimedia

4.2 Advantages of Python:


Let’s see how Python dominates over other languages.
1. Extensive Libraries
Python downloads with an extensive library and it contain code for various purposes like
regular expressions, documentation-generation, unit-testing, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.

2. Extensible
As we have seen earlier, Python can be extended to other languages. You can write some
of your
code in languages like C++ or C. This comes in handy, especially in projects.

3. Embeddable
Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++. This lets us add scripting capabilities
to our code in the other language.

24
4. Improved Productivity
The language’s simplicity and extensive libraries render programmers more productive
than languages like Java and C++ do. Also, the fact that you need to write less and get more
things done.

5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future
bright for the Internet Of Things. This is a way to connect the language with the real
world.

6. Simple and Easy


When working with Java, you may have to create a class to print ‘Hello World’. But in
Python, just a print statement will do. It is also quite easy to learn, understand, and code.
This is why when people pick up Python, they have a hard time adjusting to other more
verbose languages like Java.

7. Readable
Because it is not such a verbose language, reading Python is much like reading English.
This is the reason why it is so easy to learn, understand, and code. It also does not need
curly braces to define blocks, and indentation is mandatory. These further aids the
readability of the code.

8. Object-Oriented
This language supports both the procedural and object-oriented programming
paradigms. While functions help us with code reusability, classes and objects let us
model the real world. A class allows the encapsulation of data and functions into one.

9. Free and Open-Source


Like we said earlier, Python is freely available. But not only can you download Python
for free, but you can also download its source code, make changes to it, and even
distribute it. It downloads with an extensive collection of libraries to help you with your
tasks.

25
Advantages of Python Over Other Languages

1. Less Coding
Almost all of the tasks done in Python requires less coding when the same task isdone
in other languages. Python also has an awesome standard library support, so you don’t have
to search for any third-party libraries to get your job done. This is the reason that many
people suggest learning Python to beginners.

2. Affordable
Python is free therefore individuals, small companies or big organizations can leverage
the free available resources to build applications. Python is popular and widely used
so it gives you better community support.

3. Python is for Everyone


Python code can run on any machine whether it is Linux, Mac or Windows. Programmers
need to learn different languages for different jobs but with Python, you can professionally
build web apps, perform data analysis and machine learning, automate things, do web
scraping and also build games and powerful visualizations. It is an all- rounder programming
language.

4.3 Disadvantages of Python


So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing
Python over another language.

1. Speed Limitations
We have seen that Python code is executed line by line. But since Python is interpreted,
it often results in slow execution. This, however, isn’t a problem unless speed is a focal
point for the project. In other words, unless high speed is a requirement, the benefits
offered by Python are enough to distract us from its speed limitations.

2. Weak in Mobile Computing and Browsers


While it serves as an excellent server-side language, Python is much rarely seen on the
client-side. Besides that, it is rarely ever used to implement smartphone-based
applications. One such application is called Carbon Nelle and solve problems. We can
call it data-driven decisions taken by machines, particularly to automate the process.
These data-driven decisions can be used, instead of using programing logic, in the
problems that random Foreston be programmed inherently.
26
The fact is that we can’t do without human intelligence, but other aspect is that we all
need to solve real-world problems with efficiency at a huge scale. That is why the need
for machine learning arises.

Challenges in Machines Learning: -


While Machine Learning is rapidly evolving, making significant strides with
cybersecurity and autonomous cars, this segment of AI as whole still has a long wayto
go. The reason behind is that ML has not been able to overcome number of challenges.
The challenges that ML is facing currently are −

Quality of data − Having good-quality data for ML algorithms is one of the biggest
challenges. Use of low-quality data leads to the problems related to data preprocessing and
feature extraction.

Time-Consuming task − Another challenge faced by ML models is the consumption of


time especially for data acquisition, feature extraction and retrieval.

Lack of specialist persons − As ML technology is still in its infancy stage, availability


of expert resources is a tough job.

No clear objective for formulating business problems − Having no clear objective


and well-defined goal for business problems is another key challenge for ML because
this technology is not that mature yet.

Issue of overfitting & underfitting–If the model is overfitting or underfitting, it


cRandom Forestot be represented well for the problem.

Curse of dimensionality −Another challenge ML model faces is too many features of


data points. This can be a real hindrance.

Difficulty in deployment − Complexity of the ML model makes it quite difficult to be


deployed in real life.

27
4.4 Applications of Machines Learning:
Machine Learning is the most rapidly growing technology and according to researchers
we are in the golden year of AI and ML. It is used to solve many real-world complex
problems which cRandom Forestot be solved with traditional approach. Following are
some real-world applications of ML −

(a) Learn Linear Algebra and Multivariate Calculus-

Both Linear Algebra and Multivariate Calculus are important in Machine Learning.
However, the extent to which you need them depends on your role as a data scientist. If
you are more focused on application heavy machine learning, then you will not be that
heavily focused on math’s as there are many common libraries available. But if you
want to focus on R&D in Machine Learning, then mastery of Linear Algebra and
Multivariate Calculus is very important as you will have to implement many ML
algorithms from scratch.

(b) Learn Statistics-


Data plays a huge role in Machine Learning. In fact, around 80% of your time as an ML
expert will be spent collecting and cleaning data. And statistics is a field that handles
the collection, analysis, and presentation of data So it is no surprise that you need
to learn
it!!! Some of the key concepts in statistics that are important are Statistical Significance,
Probability Distributions, Hypothesis Testing, Regression, etc. Also, Bayesian
Thinking is also a very important part of ML which deals with various concepts like
Conditional Probability, Priors, and Posteriors, Maximum Likelihood, etc.

(c) Learn Python-


Some people prefer to skip Linear Algebra, Multivariate Calculus and Statistics and learn
them as they go along with trial and error. But the one thing that you absolutely
cRandom Forestot skip is Python! While there are other languages you can use for
Machine Learning like R, Scala, etc. Python is currently the most popular language for
ML. In fact, there are many Python libraries that are Specifically useful for Artificial
Intelligence and Machine Learning such as Keras, TensorFlow, Scikit-learn, etc.

28
So, if you want to learn ML, it’s best if you learn Python! You can do that using various
online resources and courses such as Fork Python available Free on GeeksforGeeks.

Step 2 – Learn Various ML Concepts


Now that you are done with the prerequisites, you can move on to actually learning ML
(Which is the fun part!!!) It’s best to start with the basics and then move on to the more
complicated stuff. Some of the basic concepts in ML are:
blunders can set off a chain of errors that can go undetected for long periods of time.
And when they do
get noticed, it takes quite some time to recognize the source of the issue, and even longer
to correct it.

Python Development Steps: -

Guido Van Rossum published the first version of Python code (version 0.9.0) at alt.
sources in February 1991. This release included already exception handling, functions,
and the core data types of list, dict, str and others. It was also object oriented and had
a module system. Python version 1.0 was released in January 1994. The major new
features included in this release were the functional programming tools lambda, map,
filter andreduce, which Guido Van Rossum never liked. Six and a half years later in
October 2000, Python 2.0 was introduced. This release included list comprehensions, a
full garbage collector and it was supporting unicode. Python flourished for another 8
yearsin the versions 2.x before the next major release as Python 3.0 (also known as
"Python3000" and "Py3K") was released. Python 3 is not backwards compatible with
Python
2.x. The emphasis in Python 3 had been on the removal of duplicate programming
constructs and modules, thus fulfilling or coming close to fulfilling the 13th law of the
Zen of Python: "There should be one -- and preferably only one -- obvious way to do
it." Some changes in Python 7.3:

29
 Print is now a function

 Views and iterators instead of lists


 The rules for ordering comparisons have been simplified. E.g. a heterogeneous list
cRandom Forestot be because all the elements of a list must be comparable to each other.
 There is only one integer type left, i.e., int. long is int as well.
 The division of two integers returns a float instead of an integer. "//" can be used
to have the "old" behavior.
 Text Vs. Data Instead of Unicode Vs. 8-bit

Purpose: -
We demonstrated that our approach enables successful segmentation of intra-retinal
layers—even with low-quality images containing speckle noise, low contrast, and
different intensity ranges throughout—with the assistance of the ANIS feature.

Python: -
▪ Useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary data-types can be defined using NumPy
which allows NumPy to seamlessly and speedily integrate with a wide variety of
databases.

Pandas: -
Pandas is an open-source Python Library providing high-performance data manipulation
and analysis tool using its powerful data structures. Python was majorly used for data
munging and preparation. It had very little contribution towards data analysis. Pandas
solved this problem. Using Pandas, we can accomplish five typical steps in the
processing and analysis of data, regardless of the origin of data load, prepare,
manipulate, model, and analyze. Python with Pandas is used in a wide range of fields
including academic and commercial domains including finance, economics, Statistics,
analytics, etc.

30
Matplotlib: -
Matplotlib is a Python 2D plotting library which produces publication quality figures
in a variety of hardcopy formats and interactive environments across platforms.
Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter
Notebook, web application servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible. You can generate
plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a
few lines of code. For examples, see the sample plots and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly
when combined with I Python. For the power user, you have full control of line styles,
font properties, axes properties, etc, via an object-oriented interface or via a set of
functions familiar to MATLAB users.

Scikit – learn: -
Scikit-learn provides a range of supervised and unsupervised learning algorithms
via a consistent interface in Python. It is licensed under a permissive simplified BSD
license and is distributed under many Linux distributions, encouraging academic and
commercial use.

Install Python Step-by-Step in Windows:


Python a versatile programming language does not come pre-installed on your computer
devices. Python was first released in the year 1991 and until today it is a very popular high-
level programming language. Its style philosophy emphasizes code readability with its
notable use of great whitespace.
The object-oriented approach and language construct provided by Python enables
programmers to write both clear and logical code for projects. This software does not
come pre-packaged with Windows.

31
4.5 Installations of Python on Windows:
There have been several updates in the Python version over the years. The question is
how to install Python? It might be confusing for the beginner who is willing to start
learning Python but this tutorial will solve your query. The latest or the newest version of
Python is version 3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cRandom Forestot be used on Windows XP or earlier
devices.

Before you start with the installation process of Python. First, you need to know about
your System Requirements. Based on your system type i.e. operating system and based
processor, you must download.

4.6 System Specification


Software Requirements

• Operating System: Windows 10

• Coding Language: Python 3.7

• Library: Python library

• Navigator: Anaconda

• Database: CSV files

Hardware Requiremen ts:

• Processor - Core I3
• Speed – 2.4 GHz
• RAM - 4GB (min)
• Hard Disk - 120 GB
• Key Board - Standard Keyboard

Step by step approach given below to install the python software.

32
Now, check for the latest and the correct version foryour operating system.

Step 2: Click on the Download Tab.

Step 3: You can either select the Download Python for windows 3.7.4 button in Yellow Color or
you can scroll further down and click on download with respective to their version. Here, we
are downloading the most recent python version for windows3.7.4
33
Step 4: Scroll down the page until you find the Files option.

Step 5: Here you see a different version of python along with the operating system.

Installation of Python

Step 1: Go to Download and Open the downloaded python version to carry out the installation
process.

34
Step 2: Before you click on Install Now, make sure to put a tick on Add Python 3.7 to PATH.

Step 3: Click on Install NOW After the installation is successful. Click on Close.

35
With these above three steps on python installation, you have successfully and
correctly installed Python. Now is the time to verify the installation.
Note: The installation process might take a couple of minutes.

Verify the Python Installation

Step 1: Click on Start


Step 2: In the Windows Run Command, type “cmd”
Step 3: Open the Command prompt option.
Step 4: Let us test whether the python is correctly installed. Type python –V and press Enter.

36
Step 5: You will get the answer as 3.7.4

Note: If you have any of the earlier versions of Python already installed. You must first uninstall
the earlier version and then install the new one.
Check how the Python IDLE works

Step 1: Click on Start

Step 2: In the Windows Run command, type “python idle”

Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program

Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click on
Save

Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.
Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
37
You will see that the command given is launched. With this, we end our tutorial on how to install
Python. You have learned how to download python for windows into your respective operating
system.
Note: Unlike Java, Python doesn’t need semicolons at the end of the statements otherwise it won’t
work. This stack that includes:

38
CHAPTER - 5
RESULTS AND DISCUSSIONS
In this chapter, we present a proposed method for the analysis of fraud in mobile applications
using user reviews. We outline the step-by-step working of our proposed model, including the
feature extraction process and the results obtained.

Figure 5.1 Start anaconda navigator.

39
Figure 5.2 Create New Environment with the project name python.

3) Choose path of project.

40
4) Copy path and past in cmd using cd command change directory.

5) Change directory, give command python app1.py and press enter.

41
6) After pressing enter key show url http://127.0.0.1:5000

7) Past URL in the search bar of chrome or any browser.

42
8) Press Enter key and open the interface of Detection of Fraud Apps, click login

9) enter username(admin) and password(admin).

43
10) login success!

11) press ok button and click choose file.

44
12) after click choose file choose dataset.

13) click dataset and click android_dataset-v1

45
14) click open see this interface and click on upload button

15) see preview upload dataset and scroll down

46
16) click to train I Test.

17) After click on click to train show the box training finished!.

47
18) after click ok button show this window.

19) click on choose file click app then click static then click upload and open this
window

48
20) select APKPURE apk file and press open button. There is interface.

21 click predict and wait some time it is predict model accuracy, predict
class, app name, targetSDK version, file size.

49
21) Click on Analysis.

22) Scroll down see accuracy plot.

50
CHAPTER-6

CONCLUSION & FUTURE WORK

6.1 Conclusion:
LSTM networks are well-suited for analyzing sequential data, such as user reviews, and can
capture long-term dependencies between words in the text. By using an LSTM network for
sentiment analysis, it is possible to identify potentially fraudulent reviews based on the sentiment
expressed.

However, the output of an LSTM network can be complex and difficult to interpret. By
combining the LSTM network with a decision tree classifier, it is possible to create a simple and
interpretable model for fraud detection. The decision tree can use features extracted from the
output of the LSTM network, such as the sentiment of the review and the frequency of certain
words, to make a binary decision on whether the review is fraudulent or not.

The combination of LSTM and decision tree algorithms provides a powerful and flexible
approach to fraud app detection. The LSTM network can capture complex patterns in the data,
while the decision tree provides a clear and interpretable framework for making decisions. This
approach can be easily adapted to different types of fraud detection problems, by modifying the
input data and the features used by the decision tree classifier.

6.2 Future Work

It is important to note that the current study is limited by the use of a 2-3 dataset and further
research is needed to validate the effectiveness of these algorithms on other datasets.
Additionally, it would be interesting to investigate the use of ensemble methods, which
combine the strengths of multiple algorithms, for improved fraud app detection. Research
further in the domain of building a detection system that can detect known attacks as well as
novel attacks. The Fraud Application detection systems exist today can only detect known
attacks. Detecting new attacks or zero-day attack still remains a research topic due to the high
false positive rate of the existing systems.

51
CHAPTER-7

BIBLIOGRAPHY

1. H. Song, M. J. Lynch, and J. K. Cochran, “A macro-social exploratory analysis


of therate of interstate cyber-victimization,” American Journal of Criminal Justice,
vol. 41, no. 3, pp. 583–601, 2016

2. Wang, X., Yu, F., Jiang, X., & Kim, M. (2012). Learning-based mobile
malware detection: Challenges and opportunities. IEEE Wireless
Communications, 19(2), 47-52.

3. Kao, Y. C., Hsu, C. H., & Chen, K. Y. (2015). A novel approach for detecting
mobileapps that steal users’ information. Journal of Network and Computer
Applications, 56, 96- 103.

3. Li, Y., Li, Z., Li, Y., Xue, Y., & Zhu, S. (2015). Detecting fraud mobile
applications via user behavior. In Proceedings of the 22nd ACM SIGSAC
Conference on Computer andCommunications Security (pp. 1347-1358).

4. Alharbi, A., Al Zahrani, A., Alfarraj, O., & Almuairfi, A. (2021). Mobile App
FraudDetection System Based on Machine Learning and Deep Learning. In 2021
IEEE Jordan International Joint Conference on Electrical Engineering and
Information Technology (JEEIT) (pp. 522-527). IEEE.

5. Huang, C., Li, L., Liu, J., Xu, Y., & Li, J. (2020). A Fraudulent Mobile App
Detection Method Based on Blockchain and Machine Learning. IEEE Access, 8,
165563-165574.

6. Lee, J., & Kim, D. (2019). Fraudulent Mobile Application Detection using
Machine Learning Techniques. In 2019 11th International Conference on
Information Technology Convergence and Services (ITCS) (pp. 1-6). IEEE.

52
7. Wijaya, D. D. D., Rasyid, F. M., & Amalia, A. (2021). An Overview of BlockchainTechnology
for Mobile Application Security. Journal of Computer Science and Information
Technology.

8. Zhou, Y., Zhang, J., Yang, X., & Xiang, T. (2017). Fraudulent Mobile App Detectionusing
Network Analysis and Machine Learning Techniques. In 2017 13th International Conference on
Computational Intelligence and Security (CIS) (pp. 70-74). IEEE.

9. A. Almogren, et al. "A Survey of Mobile Application Security Techniques." IEEE


Communications Surveys & Tutorials, vol. 20, no. 1, 2017, pp. 115-139.

10. Zong, et al. "Detecting Repackaged Android Applications with Negative Selection
Algorithm." Journal of Computer Science and Technology, vol. 28, no. 3, 2013, pp. 428-
435.

11. D. De Freitas, et al. "A Survey of Techniques for Detecting Malicious Android
Applications." Journal of Information and Data Management, vol. 5, no. 1, 2014, pp. 1-
13.

14. M. Grace, et al. "Unsafe Exposure Analysis of Mobile In-App Advertisements."


Proceedings of the 5th ACM Conference on Security and Privacy in Wireless and
MobileNetworks, 2012, pp.101-112.

15. S. Arzt, et al. "Flow Droid: Precise Context, Flow, Field, Object-sensitive and Lifecycle-
aware Taint Analysis for Android Apps." ACM SIGPLAN Notices, vol. 49, no. 6,2014,
pp. 259-269.

16. P. Alaei and F. Noorbehbahani, “Incremental anomaly-based Fraud Application


detection system using limited labeled data,” in Web Research (ICWR), 2017 3th
International Conference on, 2017, pp. 178– 184.

53
17. M.Saber, S. Chadli, M. Emharraf, and I. El Farissi, “Modeling and implementation
approach to evaluate the Fraud Application detection system,” in International
Conference on Networked Systems, 2015, pp. 513–517.

18. M. Tavallaee, N. Stakhanova, and A. A. Ghorbani, “Toward credible evaluation of


anomaly-based Fraud Application-detection methods,” IEEE Transactions on Systems,
Man, andCybernetics, Part C (Applications and Reviews), vol. 40, no. 5, pp. 516–524, 2010.

19. A.S. Ashour and S. Gore, “Importance of Fraud Application detection system (IDS),”
International Journal of Scientific and Engineering Research, vol. 2, no. 1, pp.
1–4, 2011.

20. M. Zamani and M. Movahedi, “Machine learning techniques for Fraud Application
detection,” arXiv preprint arXiv:1312.2177, 2013.

21. N. Chakraborty, “Fraud Application detection system and Fraud Application


prevention

22. system: A comparative study,” International Journal of Computing and Business

Research

(IJCBR) ISSN (Online), pp. 2229–6166, 2013.

54

You might also like