Fraud App Detection
Fraud App Detection
B.TECH
IN
i
UNIVERSITY INSTITUTE OF TECHNOLOGY
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA,
CERTIFICATE
This is to certify that Navodita Singh, Shalni Shau, Ritu Chhedam, Yachna Sahu of
B.TECH. Final Year, Computer Science &Engineering have completed their Major
Project entitled “Fraud App Detection” during the year 2022-2023 under our
guidance and supervision. We approved the project for the submission of the
fulfillment of the requirement for the award of degree of B.TECH. in Computer
Science &Engineering.
ii
DECLARATION BY STUDENTS
We, hereby declare that the work which is presented in the Major Project, entitled
“Fraud App Detection” submitted in partial fulfillment of the requirement for the
award of Bachelor degree in Computer Science and Engineering has been carried out
at University Institute of Technology RGPV, Bhopal and is an authentic record of our
work carried out under the guidance of Dr. Anjana Deen and Dr. Raju Baraskar
Department of Computer Science and Engineering, UITRGPV, Bhopal. The matter in
this project has not been submitted by us for the award of any other degree.
Signatures
Ms. Navodita Singh –0101CS191069
Ms. Shalni Shau-0101CS203D08
Ms. Ritu Chhedam –0101CS191094
Ms.Yachna Sahu-0101CS203D12
iii
ACKNOWLEDGEMENT
After the completion of major project work, words are not enough to express our
feelings about all those who helped us to reach our goal, feeling above all this is our
indebtedness to the almighty for providing us this moment in life. First and
foremost, we take this opportunity to express our deep regards and heartfelt
gratitude to our project guide Dr. Anjana Deen and Dr. Raju Baraskar for their
inspiring guidance and timely suggestions in carrying out our project successfully.
They have also been a constant source of inspiration for us.
iv
1) TABLE OF CONTENT
CHAPTER- 1
CHAPTER-2
CHAPTER -3
3. Methodology Used……………………………….11-22
3.1 Long Short-Term Memory ..................................... 11-12
3.2 Module Description .................................................. 12-15.
3.3 Algorithm Used ...................................................... 16-18
3.4 Flow Diagram Proposed Fraud App Detection ........... 19
3.5 Data Processing ....................................................... 20-22
CHAPTER- 4
v
CHAPTER -5
CHAPTER -6
CHAPTER -7
vi
ABSTRACT
In today’s software world there are thousands of fake apps on play store and apple store.
Therefore it is very difficult to identify the geniuses of the application and user sometimes
unknowingly installs fake app. These fake apps can steal your important information from your
device. Therefore, it is very important to identify genuine application. The aim of this project
developed the software which identifies the genuine software from the play store and apple
store, which help to the users. The objective is to develop a system in detecting fraud apps
before the user downloads by using sentimental analysis. Sentimental analysis is to help in
determining the emotional tones behind words which are expressed in online. This method is
useful in monitoring social media and helps to get a brief idea of the public’s opinion on certain
issues. The user cannot always get correct or true reviews about the product on the internet.The
aim of this project can check for user’s sentimental comments on multiple applications. The
reviews may be fake or genuine. Analyzing the rating and reviews together involving both user
and admins comments, we can determine whether the app is genuine or not. Using sentimental
analysis, the machine is able to learn and analyze the sentiments, emotions about reviews and
other texts. The manipulation of review is one of the key aspects of App ranking fraud. We
have used LSTM model to predict the results.
vii
CHAPTER- 1
INTRODUCTION
1.1 Introduction
Fraudulent activities have become a major concern for businesses of all sizes, causing billions
of dollars in losses every year. Fraud detection is the process of identifying and preventing
fraudulent activities by analyzing patterns in data. With the increasing use of technology in
business operations, fraudsters are finding new ways to conduct fraudulent activities, making
it necessary to employ sophisticated fraud detection methods.
Sentiment is an emotion or attitude brought on by the customer's feelings. Opinion mining is
another name for sentiment analysis because it uses user reviews to determine how well-liked
an app is. Sentiment analysis is a step in the machine learning process. [1] Knowledge is
acquired, processed, and then classified as either good or negative depending on how it is felt.
People frequently inquire about other users' reviews of an app before making a purchase. [2]
The process of Sentiment analysis uses natural language processing to collect and examine
the opinion or sentiment of the sentence. It is popular as many people prefer to take some
advice from the users. As the amount of opinions in the form of reviews, blogs, etc. are
increasing continuously, it is beyond the control of manual techniques to analyze huge
amount of reviews & to aggregate them to a efficient decision. Sentiment analysis performs
these tasks into automated processes with less user support. [3] It is not always possible to
have a one technique to fit in all solution because different types of sentences express
sentiments/opinions in different ways. Sentiment words (also called as opinion words) (e.g.,
great, beautiful, bad, etc) cannot distinguish an opinion sentence from a non- opinion one. A
conditional sentence may contain many sentiment words or sentences, but express no opinion.
The type of sentences, i.e., conditional sentences, it have some unique characteristics which
make it hard to determine the orientation of sentiments on topics/features in such of the
sentences. By sentiment orientation, we mean positive, negative or neutral opinions.
Conditional sentences are sentences which describe implications or hypothetical situations &
their consequences. In English language, a variety of conditional connectives can be used to
form these sentences. A conditional sentence contains two clauses: the condition clause and
the consequent clause, that are dependent on each other. Their relationship has significant
implications on whether the sentence describesan opinion. [4] As there are more than
millions of apps on the App store, there is many
1
competition between apps to be on top of the leader board on the basis of popularity. As
leader board is the most important way for promoting apps. The higher rank on the leader
board leads to huge number of downloads & million doll or of profit. Apps give advertisement
to promote their apps on the leader board. Many apps use fraudulent meansto boost their
ranking on the leader board of the App store. There are various means to increase downloads
& ranking of the app which is done by "bot farms" or "human water armies", human water
armies are a group of internet ghostwriters who are paid to post fake reviews. The app is said
to fraud on the basis of 3 parameters: Ranking, Rating & Reviewof the app. In ranking based
we check the historical ranking of the app, there are 3 different ranking phases, rising phase,
maintaining phases & recession phase. The apps ranking risingto peak position on leader
board (ie. rising phase), to keep at the peak position on the leader board (ie. maintaining
phase), & finally decreasing till the end of event (ie. recession phase).The reviews are taken
from the dataset and are converted into tokens on which sentiment analysis is performed.
The most important role played by customers quality, ratings and reviews of that particular
app what happens to download. Not that, sometimes developers are misleading recognition
of their applications or malicious use them as a malware distribution platform throughout.
Occasionally, it is just an improvement for engineers they often hire teams of workers who
commit fraud by sharing and providing false opinions and estimates over application. This
is known as crowd turfing. It is therefore important to make sure that before installing app,
users are provided with the right and true comments to avoid something wrong. In this case,
I an automated solution is needed to win again systematically analyze various ideas and
measurements provided for each application. It will be so it is difficult for a user to decide
to comment on what they are saying scroll through even if the scales they see are fraudulent
or true for their benefit. Thus, we are proposing a system of that will detect malicious
applications on Google Play or App end by giving a complete overview of fraud detection by
scale system. By considering data mining and the emotional analysis, we can get a higher
probability to get real reviews with us suggest a program that takes reviews from registered
users with one or more products and test them as a positive or negative rating. This can be
helpful as well determine the application for fraud and ensure mobile security well we check
in three forms proof based, based ratings, and model-based reviews are three- dimensional
combinations with statistics hypotheses. Regardless, evidence based on suspension may be
affected by the status of the application developer and others real marketing efforts “as a set
time laying down “.
2
In this project we mainly focus to get the review by the users is genuine. users can use the
app by just signing up and write their reviews in the review section and write the name of the
app in the app name section. and now admin can check their reviews in the review sectionin
the review section it shows all the reviews which were written by the different - different
reviewer. and in the chart section they show all the app rating scale based on the reviews.
And reviewer can check the review that which type of review they written they will check in
the dataset.
Developers have developed a ranking fraud detection system for mobile Apps. Specifically,
we show that ranking fraud happened in the leading sessions and provided a method for
mining leading sessions for each App from its historical ranking records. Then, we identify
ranking based evidences and rating based evidences for detecting ranking fraud. Moreover,
we proposed an optimization-based aggregation method to integrate all the evidences for
evaluating the credibility of leading sessions from mobile Apps. An unique perspective of
this approach is that all the evidences can be modeled by statistical hypothesis tests, thus it
is easy to be extended with other evidences from domain knowledge to detect ranking fraud.
Finally, we validate the proposed system with extensive experiments on the real-world App
data collected from the App store. Experimental results showed the effectiveness of the
proposed approach.[15] The main objective is fraud application detection using fuzzy logic
to differentiate the actual fraud apps. The proposed system perform classification of apps &
detect their group whether they belong to good, bad, neutral, very good, very bad. Different
class value & threshold value gives different results of accuracy of time required for
execution.[16] Sentiment Analysis is major task of natural language processing. Data used as
input are online app reviews. The objective content from the sentences are removed and
subjective content is extracted. The subjective content consists of sentiment sentences. In
NLP, part-of-speech (POS) taggers are developed to classify words based on POS. Adjective
and verbs convey opposite sentiment with help of negative prefixes. Sentiment score is
computed for all sentiment tokens.
Information is gathered and is analyzed to determine the sentiment about the information
such as negative or positive sentiment. Before purchasing the app people alwaysenquire about
the opinion of the app by the other users. The process of Sentiment analysis uses natural
language processing) to collect and examine the opinion or sentiment of the sentence. It is
popular as many people prefer to take some advice from the users. As the amount of opinions
in the form of reviews, blogs, etc. are increasing continuously, it is
3
beyond the control of manual techniques to analyze huge amount of reviews & to aggregate
them to an efficient decision. Sentiment analysis performs these tasks into automated
processes with less user support. It is not always possible to have a one technique to fit in all
solution because different types of sentences express sentiments /opinions in different ways.
Sentiment words (also called as opinion words) (e.g., great, beautiful, bad, etc.) cannot
distinguish an opinion sentence from anon-opinion one. A conditional sentence may contain
many sentiment words or sentences, but express no opinion. The type of sentences, i.e.,
conditional sentences, it has some unique characteristics which make it hard to determine
the orientation of sentiments on topics/features in such of the sentences. By sentiment
orientation, we mean positive, negative or neutral opinions. Conditional sentences are
sentences which describe implications or hypothetical situations & their consequences. In
English language, a variety of conditional connectives can be used to formthese sentences. A
conditional sentence contains two clauses: the condition clause and the consequent clause,
that are dependent on each other. Their relationship has significant implications on whether
the sentence describes an opinion. As there are more than millions of apps on the App store,
there is many competitions between apps to be on top of the leaderboard on the basis of
popularity. As leader board is the most important way for promoting apps. The higher rank
on the leader board leads to huge number of downloads & million doll or of profit. Apps
give advertisement to promote their apps on the leader board. Many apps use fraudulent
means to boost their ranking on the leader board of the App store. There are various means
to increase downloads & ranking of the app which is done by "bot farms" or "human water
armies", human water armies are a group of internet ghostwriters who are paid to post
fake reviews. The app is said to fraud on the basis of
3 parameters: Ranking, Rating & Review of the app. In ranking based we check the historical
ranking of the app, there are 3 different ranking phases, rising phase, maintaining phases &
recession phase.
Fraud detection software is designed to identify suspicious patterns and activities in financial
transactions, customer behavior, and other data sources. The software uses advanced
algorithms and machine learning techniques to analyze large volumes of data and detect
anomalies that may indicate fraudulent activities.
The importance of fraud detection cannot be overstated, as it helps organizations to prevent
financial losses, protect their reputation, and maintain customer trust. This paper aims to
4
review the current state-of-the-art techniques in fraud detection, including machine learning,
data mining, and statistical analysis, and how they can be applied to various industries and
business functions. Additionally, we will discuss the challenges involved in fraud detection
and the best practices for implementing a successful fraud detection system.
Fraudulent activities have always been a concern for businesses of all sizes and industries.
The proliferation of technology has made it easier for fraudsters to carry out their malicious
activities, which is why detecting fraud has become more important than ever. Fraudulent
activities can cause significant financial loss, damage to reputation, and even legal
consequences.
One of the most effective ways to prevent fraud is to use fraud detection software. Fraud
detection software uses advanced algorithms and machine learning techniques to analyze large
amounts of data and identify suspicious patterns or activities. By identifying potential fraud
in real-time, businesses can take immediate action to prevent further damage.
In recent years, there has been a significant increase in the number of fraud detection solutions
available on the market. These solutions vary in their approach, complexity, and effectiveness.
Choosing the right solution for a particular business can be a challenging task, but it is crucial
to ensure that the solution can effectively detect and prevent fraudulent activities.
The purpose of this is to provide an overview of the current state of fraud detection software.
Specifically, we will explore the different types of fraud detection software available, their
strengths and weaknesses, and the key factors to consider when selecting a solution for a
particular business. By the end of this paper, readers should have a clear understanding of the
different types of fraud detection software available and be able to make informed decisions
when selecting a solution for their business.
In recent years, mobile devices have become an integral part of our lives, providing us with
a wide range of functionalities and services. As the usage of mobile devices increases, so
doesthe risk of cyber threats, including fraudulent apps that pose a serious threat to users'
privacyand security. Fraudulent apps are designed to trick users into thinking that they are
legitimate,butin reality, they are malicious and can harm users' devices and steal their
personal data.
5
To counteract this threat, researchers and developers have developed various approaches and
techniques to detect fraudulent apps. These approaches include static and dynamic analysis,
machine learning, and behavioral analysis, among others. However, detecting fraudulent apps
is a challenging task, as attackers continuously develop new methods to evade detection and
stay ahead of security measure
1.1 Motivation:
In the modern computer world, the use of the internet is increasing day by day. These days, new
type of fraud occurs everyday and it is not easy to detect and prevent those fraud apps
effectively. One common method of fraud app involves submitting a large number of reviews on
the app and websites to create the willingness of users. Our major task is to detect fraud and
genuine application by using sentiment analysis of review data using machine learning
techniques.
1.1 Objective:
6
CHAPTER- 2
LITERATURE SURVEY
2.1 Literature Survey:
Fraudulent mobile applications have become a growing concern due to the increase in mobile
usage and the rise of mobile commerce. Several techniques have been proposed in the literature
to detect fraudulent mobile applications. In this literature survey, we discuss some of the recent
techniques proposed for detecting fraud in mobile applications.
One of the most widely used techniques for detecting fraudulent mobile applications is static
analysis. Static analysis involves analyzing the code of an application without actually executing
it. This technique can be used to detect malicious code, such as code that accesses sensitive
information or performs unauthorized actions. Researchers have proposed various static analysis
techniques for detecting fraudulent mobile applications. For example, Wang et al. proposed a
technique that uses machine learning to analyze the code of mobile applications and detect
malicious behavior [1].
Dynamic analysis is another technique that is commonly used for detecting fraudulent mobile
applications. Dynamic analysis involves executing an application in a controlled environment
and monitoring its behavior. This technique can be used to detect malicious behavior that is not
evident in the code. For example, Kao et al. proposed a technique that uses dynamic analysis to
detect mobile applications that steal user information [2].
In addition to static and dynamic analysis, researchers have also proposed other techniques for
detecting fraudulent mobile applications. For example, Li et al. proposed a technique that uses
user behavior to detect fraudulent mobile applications [3]. The authors analyzed the behavior of
users who had installed a fraudulent application and identified patterns that could be used to
detect similar applications.
Another approach for fraud app detection is based on user behavior analysis. This technique
focuses on monitoring and analyzing the behavior of users when they interact with an app. This
7
involves collecting data such as the user's location, the time spent on the app, the frequency of
app usage, and the types of actions performed in the app. By analyzing this data, it is possible to
identify suspicious behavior patterns, such as abnormal usage patterns or sudden changes in
behavior.
One example of this approach is the work by Xu et al. [4] which proposes a framework for
detecting mobile app fraud based on user behavior analysis. The framework collects various data
points, such as app usage frequency, user location, and device information, and applies machine
learning algorithms to identify patterns that are indicative of fraudulent behavior. The authors
report promising results, with the framework achieving a fraud detection accuracy of 92%. In
addition to the above approaches, researchers have also proposed various other techniques for
fraud app detection, such as anomaly detection [5], network-based analysis [6], and reputation-
based analysis [7].
Additionally, machine learning techniques such as neural networks, decision trees, and support
vector machines have been employed in various fraud app detection systems [8], Lee and Kim,
[9]; Alharbi et al., [10]. These approaches are based on the analysis of various features of the
apps, such as permissions requested, resource usage, and user reviews. By learning from the
patterns in these features, the machine learning models can accurately classify fraudulent apps.
Recent research has also explored the use of blockchain technology in fraud app detection by
Huang et al., [11]; Wijaya et al., [12], Blockchain provides a decentralized and secure system
for app transactions, which can improve the transparency and accountability of app developers.
By incorporating blockchain into the app verification process, fraudulent apps can be detected
and prevented more effectively.
Another study conducted by Bhattacharya and colleagues [13] proposed a machine learning-
based approach for detecting mobile app fraud. They developed a hybrid model that combined
two machine learning algorithms, namely, Decision Tree (DT) and Artificial Neural Network
(ANN), for fraud detection. The DT algorithm was used to identify the initial set of rules to
differentiate between legitimate and fraudulent apps, while the ANN was used to make final
predictions. The proposed model was evaluated on a dataset of 4000 apps, and it achieved an
accuracy of 92.75% and an F1-score of 0.92.
8.
2.2 Problem Statement:
There are many challenges in fraud App Detection as follows:
1. Changing fraud patterns over time - This is very difficult to deal with as fraudsters
are always looking to find new and innovative ways to go around the plans to commit this act.
It is therefore very important that in-depth learning models are updated with advanced patterns
for recognition. This results in a decrease in the efficiency and effectiveness of the model.
Machine learning is the models therefore need to constantly update or fail their goals.
2. Class Inequality - Only a small percentage of customers have fraudulent intentions. As a
result, there are inequalities in classifying fraud detection models (which often classify
fraudulent or non-fraudulent) making it difficult to enforce. At the root of this challenge is
poor user behavior towards real customers, as catching scammers often involves a decline in
certain legitimate activities.
4. Feature construction may be time consuming - Mathematicians may need a lot of time to
create a comprehensive set that delays the process of detecting fraud.
5. APK file of mobile application is uploaded on the web application. APK parser is used to
extract information about the application such as reviews, ratings, and historical record.
Natural Language Processing is used to perform sentiment analysis on the reviews. By
applying rule for detection of fraud application, it generates the graph results. If the rating
count is greater than 3 then it is considered as a positive result. And if the rating count is less
than 3 then it is considered as a negative result
9
2.3 Preposed Work.
The aim of this project a system that would identify such fake applications on the play or app
store. The aim of this project can acquire the probability of determining whether an app is fake
or not, therefore The aim of this project present a system that uses four features that are in-app
purchases, contains ads, ratings, and reviews to determine the probability of an app whether it's
scamming its consumers or not. The sole purpose of the given proposed system is majorly to
review the fraud detection of google play store applications and to use the three - parameter
proposed manner for the detection of fraud or fake applications. In system will receive fraud
with three types of evidence, such as ad-based ratings, in-app purchases, and evidence-based
reviews. In addition, the development-based integration approach incorporates all three aspects
to detect fraud. Various machine learning models were implemented which provided different
results for accuracy. By analysis, found that our given proposed method provides 85%
accuracy compared to other algorithms. While independent thinking still exists, the decision
tree section performs better compared to other models such as the Natural language Processing
10
CHAPTER-3
METHODOLOGY USED
Data preprocessing: Collect a dataset of reviews, and preprocess the text to remove any
unnecessary information such as stop words, special characters, and punctuation. Then, label
each review as either positive or negative sentiment.
Train the LSTM model: Train an LSTM model on the labeled dataset to predict the sentiment
of a review. The input to the model will be the preprocessed text of the review, and the output
will be a binary sentiment label (positive or negative).
11
Feature engineering: Once the LSTM model is trained, use the output of the sentiment analysis
as one of the features for fraud detection. Combine it with other relevant features, such as the
user's rating history, purchase history, and other metadata.
Fraud detection: Finally, use the combined features to build a fraud detection model. This
model should be trained to identify fraudulent reviews by detecting patterns and anomalies in
the data. The sentiment analysis output can help to identify reviews that may be artificially
positive or negative.
The basic idea is to use the sequence of words in the text as input to an LSTM network, which
learns to capture the contextual relationships between words to make accurate sentiment
predictions. The network is trained on a large dataset of text data with labeled sentiment scores,
and then used to predict the sentiment of new, unseen text data.
Output
12
The input text is then fed into the LSTM network one word at a time, with each word
representing a single time step in the sequence. The LSTM network learns to capture the
relationships between words in the text over time, and uses this information to make a
prediction about the sentiment of the text.
Here We take data from goggle play store api to analysis emotion of user.
Tokenization: The process of breaking down the text data into individual words, or tokens.
Tokenization is the process of breaking down a text into individual words or tokens. The basic
idea is to split the text on whitespace characters such as spaces, tabs, and line breaks, and then
extract each sequence of non-whitespace characters as a token.
For example, suppose we have the following sentence:
The quick brown fox jumped over the lazy dog.
To tokenize this sentence, we can split it on whitespace characters, which gives us the following
tokens:
[The, quick, brown, fox, jumped, over, the, lazy, dog.]
Word Embedding: A technique used to represent each word as a high-dimensional vector. This
allows the LSTM network to capture the meaning and context of each word more accurately.
Word embedding is a technique used in natural language processing to represent words as
numerical vectors in a high-dimensional space. The idea behind word embedding is to capture
the meaning and context of each word in a way that is more easily interpretable by machine
learning algorithms.
There are several methods for creating word embeddings, but one of the most popular is the
Word2Vec algorithm. Word2Vec uses a neural network to learn the co-occurrence patterns of
words in a large corpus of text. The network is trained to predict the likelihood of a word
appearing in the context of other words, and the resulting weights of the hidden layer are used
as the word embeddings.
LSTM Layer: A sequence modeling layer that can capture the relationships between words in
the text data over time. This layer uses a set of gates to selectively retain or discard information
from previous time steps.
13
LSTM (Long Short-Term Memory) is a type of recurrent neural network that is commonly
used for sequence modeling and natural language processing tasks. An LSTM layer consists of
a series of LSTM cells, each of which has a set of learnable parameters that allow it to
selectively store and forget information over time.
To illustrate how an LSTM layer works, let's consider a simple example of sentiment analysis
Output gate: Finally, the LSTM cell decides which information to output as the current hidden state, based
on the updated cell state and the current input.
14
This is done using another sigmoidactivation function, which outputs a value between 0 and 1 for each
element of the cell state vector.
Values close to 0 indicate that the corresponding information should be ignored, whilevalues close to 1
indicate that it should be included in the output.
The resulting hidden state is then fed into the next LSTM cell in the sequence, along with the
next input vector. This process continues until all of the input vectors have been processed.
The output of the final LSTM cell can then be used to make a prediction about the sentiment
of the text. For example, in a binary sentiment analysis task, the output could be a single value
indicating the probability that the text is positive or negative.
Fully Connected Layer: A layer of neurons that performs a weighted sum of the outputs from
the LSTM layer, and applies an activation function to generate a sentiment prediction.
Output: The final sentiment prediction, which is a score between 0 and 1 indicating the
probability that the text has a positive sentiment
15
3.3 Algorithms used in this project:
LSTM (Long Short-Term Memory) networks use a combination of several methods to process
sequential data and model complex temporal dependencies. Some of the key methods used in
LSTM include:
Recurrent Neural Networks (RNN): LSTMs are a type of RNN, which means that they
are designed to handle sequential data by maintaining an internal state or "memory" that is
updatedat each time step. Recurrent Neural Networks (RNNs) are a class of neural networks
that are designed to handlesequential data. They are able to maintain an internal state, or
"memory", that allows them to process sequences of variable length. This makes them well-
suited to a wide range of tasks, including language modeling, machine translation, and speech
recognition.
Decision tree classifier: Decision trees are a simple and interpretable classifier that used
for classification tasks. In the case of sentiment analysis, a decision tree used to classify
the
sentiment of a text based on features extracted from the output of the LSTM network.
A decision tree classifier is a type of supervised learning algorithm that uses a tree-like structure
to model decisions and their possible consequences. The algorithm works by recursively
splitting the data into subsets based on the values of one or more input features, with each split
resulting in a binary decision node in the tree. At the leaf nodes of the tree, a prediction is made
based on the majority class of the training examples that reach that node.
Termination condition: If all the examples at a given node belong to the same class, or if
the depth of the tree exceeds a pre-defined maximum depth, then mark the node as a leaf and
returnthe majority class label of the examples at that node. Splitting criterion: Choose the
feature and threshold that maximize some splitting criterion, suchas information gain or Gini
impurity.
16
Split the data: Partition the examples at the current node into two subsets based on the chosen
feature and threshold, and create two child nodes for the tree.
Recursion: Apply the decision tree algorithm recursively to each child node, using the remaining
features as input.
Stopping criterion: If the stopping criterion is met (e.g., all nodes are pure or the tree depth
exceeds the maximum allowed), then terminate the algorithm and return the resulting tree.
As an example, consider a dataset of emails, each labeled as either spam or non-spam. The
decision tree algorithm might start by splitting the data based on the presence or absence of
certain keywords in the email subject or body. For example, if the keyword "Fraud" is present,
the email is more likely to be spam. The algorithm would then recursively apply this splitting
process to each subset of the data, creating a tree that represents the decision-making process for
classifying emails as spam or non-spam.
Once the decision tree is constructed, it can be used to make predictions on new data by
traversing the tree from the root node to a leaf node, based on the values of the input features.
The majority class label of the training examples at the leaf node is then returned as the predicted
class label for the new example. Decision trees can be very interpretable, as the resulting tree
can be visualized and analyzed to gain insights into the decision-making process. However, they
can also be Fraud to overfitting and may not perform as well as other classifiers on complex
datasets.
Sequence Modeling: LSTMs are specifically designed to model sequential data, such as
naturallanguage text or time-series data. They can handle variable-length input sequences and
outputsequences of varying lengths, making them suitable for a wide range of tasks.
Backpropagation Through Time (BPTT): BPTT is the algorithm used to train LSTM
networks. It is a variant of backpropagation that is designed to handle the fact that the network's
internal state changes over time.
Gradient Clipping: Gradient clipping is a technique used to prevent exploding gradients during
training. In LSTM networks, this is particularly important because the gradients can accumulate
over time due to the recurrent connections.
17
Word Embedding: LSTMs often use word embedding to convert text data into numerical
vectors that can be processed by the network. Word embedding allows the network to capture
the semantic relationships between words in a more meaningful way than other encoding
methods, such as one-hot encoding.
The following machine learning algorithm is used for developing the Fraud App Detection
Algorithm-1
# Data Preprocessing
Load and preprocess the dataset
a=Split the dataset into training and testing sets
# Decision Tree Training
Initialize the Decision Tree classifier
b=Classify data(a)
# LSTM Training
Initialize the LSTM model
Define the LSTM architecture
Compile the model with appropriate loss function and optimizer
Train the LSTM model on the training dataset
Algorithm-2
# Fraud App Detection
For each app in the testing dataset:
Extract relevant features from the app
Pass the features to the LSTM model to obtain LSTM output
Pass the features to the Decision Tree classifier to obtain Decision Tree output
Combine the outputs from both models (e.g., weighted average)
If the combined output exceeds a predefined threshold:
Classify the app as a fraud app
Else:
Classify the app as a legitimate app
18
Figure 4.2 Flow Diagram of Proposed Fraud App Detection
Workflow involving LSTM and RNN models, along with feature extraction using a decision
tree, can be as follows:
19
Data Preparation:
Collect a dataset of reviews, where each review is associated with a sentiment label (positive
or negative).
Perform any necessary data cleaning, such as removing duplicates or handling missing values.
Feature Extraction using Decision Tree:
Use a decision tree model to extract important features from the textual content of the reviews.
The decision tree can rank the features based on their importance scores, allowing you to select
the most relevant features for sentiment analysis.
20
Data preprocessing
Data preprocessing is a crucial step in machine learning and data analysis tasks. It involves
transforming raw data into a format that is suitable for analysis and model training. Here are
some common preprocessing techniques:
Data Cleaning:
Handling missing values: Identify and handle missing data by either removing instances with
missing values, imputing missing values using statistical methods, or using advanced
imputation techniques.
Removing duplicates: Check for and remove duplicate records in the dataset.
Handling outliers: Detect and handle outliers, which are extreme values that may affect the
analysis or model performance. This can involve removing outliers or transforming them to
mitigate their impact.
Data Transformation:
Feature scaling: Scale numerical features to a similar range to avoid certain features dominating
others during model training. Common scaling techniques include min-max scaling
(normalization) and standardization.
Encoding categorical variables:
Convert categorical variables into numerical representations. This can involve one-hot
encoding, label encoding, or ordinal encoding, depending on the nature of the data and the
requirements of the model.
Text preprocessing:
For text data, perform techniques like tokenization, removing stopwords (commonly used
words with little semantic value), stemming or lemmatization (reducing wordsto their base
form), and handling special characters or punctuation.
Feature Engineering:
Creating new features: Derive additional features from existing ones that may capture relevant
information. For example, extracting date or time-related features from timestamps or
calculating ratios or percentages from numerical features.
Dimensionality reduction:
Reduce the number of features while preserving important information. Techniques such as
principal component analysis (PCA) or feature selection algorithms can be used for this
purpose
21
Data Splitting:
Splitting into training and test sets: Divide the dataset into training and evaluation/test sets. The
training set is used to train the model, while the test set is used to evaluate its performance.
Common splits are 70-30
Build a decision tree model using the entire dataset, including allfeatures and the corresponding
target variable.
Rank Features:
Rank the features based on their importance scores, with higher scoresindicating greater
importance. This ranking can help identify the most influential features.
22
CHAPTER-4
TECHNOLOGY USED
23
The biggest strength of Python is huge collection of standard library which can be used
for the following –
Machine Learning
Test frameworks
Multimedia
2. Extensible
As we have seen earlier, Python can be extended to other languages. You can write some
of your
code in languages like C++ or C. This comes in handy, especially in projects.
3. Embeddable
Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++. This lets us add scripting capabilities
to our code in the other language.
24
4. Improved Productivity
The language’s simplicity and extensive libraries render programmers more productive
than languages like Java and C++ do. Also, the fact that you need to write less and get more
things done.
5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future
bright for the Internet Of Things. This is a way to connect the language with the real
world.
7. Readable
Because it is not such a verbose language, reading Python is much like reading English.
This is the reason why it is so easy to learn, understand, and code. It also does not need
curly braces to define blocks, and indentation is mandatory. These further aids the
readability of the code.
8. Object-Oriented
This language supports both the procedural and object-oriented programming
paradigms. While functions help us with code reusability, classes and objects let us
model the real world. A class allows the encapsulation of data and functions into one.
25
Advantages of Python Over Other Languages
1. Less Coding
Almost all of the tasks done in Python requires less coding when the same task isdone
in other languages. Python also has an awesome standard library support, so you don’t have
to search for any third-party libraries to get your job done. This is the reason that many
people suggest learning Python to beginners.
2. Affordable
Python is free therefore individuals, small companies or big organizations can leverage
the free available resources to build applications. Python is popular and widely used
so it gives you better community support.
1. Speed Limitations
We have seen that Python code is executed line by line. But since Python is interpreted,
it often results in slow execution. This, however, isn’t a problem unless speed is a focal
point for the project. In other words, unless high speed is a requirement, the benefits
offered by Python are enough to distract us from its speed limitations.
Quality of data − Having good-quality data for ML algorithms is one of the biggest
challenges. Use of low-quality data leads to the problems related to data preprocessing and
feature extraction.
27
4.4 Applications of Machines Learning:
Machine Learning is the most rapidly growing technology and according to researchers
we are in the golden year of AI and ML. It is used to solve many real-world complex
problems which cRandom Forestot be solved with traditional approach. Following are
some real-world applications of ML −
Both Linear Algebra and Multivariate Calculus are important in Machine Learning.
However, the extent to which you need them depends on your role as a data scientist. If
you are more focused on application heavy machine learning, then you will not be that
heavily focused on math’s as there are many common libraries available. But if you
want to focus on R&D in Machine Learning, then mastery of Linear Algebra and
Multivariate Calculus is very important as you will have to implement many ML
algorithms from scratch.
28
So, if you want to learn ML, it’s best if you learn Python! You can do that using various
online resources and courses such as Fork Python available Free on GeeksforGeeks.
Guido Van Rossum published the first version of Python code (version 0.9.0) at alt.
sources in February 1991. This release included already exception handling, functions,
and the core data types of list, dict, str and others. It was also object oriented and had
a module system. Python version 1.0 was released in January 1994. The major new
features included in this release were the functional programming tools lambda, map,
filter andreduce, which Guido Van Rossum never liked. Six and a half years later in
October 2000, Python 2.0 was introduced. This release included list comprehensions, a
full garbage collector and it was supporting unicode. Python flourished for another 8
yearsin the versions 2.x before the next major release as Python 3.0 (also known as
"Python3000" and "Py3K") was released. Python 3 is not backwards compatible with
Python
2.x. The emphasis in Python 3 had been on the removal of duplicate programming
constructs and modules, thus fulfilling or coming close to fulfilling the 13th law of the
Zen of Python: "There should be one -- and preferably only one -- obvious way to do
it." Some changes in Python 7.3:
29
Print is now a function
Purpose: -
We demonstrated that our approach enables successful segmentation of intra-retinal
layers—even with low-quality images containing speckle noise, low contrast, and
different intensity ranges throughout—with the assistance of the ANIS feature.
Python: -
▪ Useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary data-types can be defined using NumPy
which allows NumPy to seamlessly and speedily integrate with a wide variety of
databases.
Pandas: -
Pandas is an open-source Python Library providing high-performance data manipulation
and analysis tool using its powerful data structures. Python was majorly used for data
munging and preparation. It had very little contribution towards data analysis. Pandas
solved this problem. Using Pandas, we can accomplish five typical steps in the
processing and analysis of data, regardless of the origin of data load, prepare,
manipulate, model, and analyze. Python with Pandas is used in a wide range of fields
including academic and commercial domains including finance, economics, Statistics,
analytics, etc.
30
Matplotlib: -
Matplotlib is a Python 2D plotting library which produces publication quality figures
in a variety of hardcopy formats and interactive environments across platforms.
Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter
Notebook, web application servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible. You can generate
plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a
few lines of code. For examples, see the sample plots and thumbnail gallery.
For simple plotting the pyplot module provides a MATLAB-like interface, particularly
when combined with I Python. For the power user, you have full control of line styles,
font properties, axes properties, etc, via an object-oriented interface or via a set of
functions familiar to MATLAB users.
Scikit – learn: -
Scikit-learn provides a range of supervised and unsupervised learning algorithms
via a consistent interface in Python. It is licensed under a permissive simplified BSD
license and is distributed under many Linux distributions, encouraging academic and
commercial use.
31
4.5 Installations of Python on Windows:
There have been several updates in the Python version over the years. The question is
how to install Python? It might be confusing for the beginner who is willing to start
learning Python but this tutorial will solve your query. The latest or the newest version of
Python is version 3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cRandom Forestot be used on Windows XP or earlier
devices.
Before you start with the installation process of Python. First, you need to know about
your System Requirements. Based on your system type i.e. operating system and based
processor, you must download.
• Navigator: Anaconda
• Processor - Core I3
• Speed – 2.4 GHz
• RAM - 4GB (min)
• Hard Disk - 120 GB
• Key Board - Standard Keyboard
32
Now, check for the latest and the correct version foryour operating system.
Step 3: You can either select the Download Python for windows 3.7.4 button in Yellow Color or
you can scroll further down and click on download with respective to their version. Here, we
are downloading the most recent python version for windows3.7.4
33
Step 4: Scroll down the page until you find the Files option.
Step 5: Here you see a different version of python along with the operating system.
Installation of Python
Step 1: Go to Download and Open the downloaded python version to carry out the installation
process.
34
Step 2: Before you click on Install Now, make sure to put a tick on Add Python 3.7 to PATH.
Step 3: Click on Install NOW After the installation is successful. Click on Close.
35
With these above three steps on python installation, you have successfully and
correctly installed Python. Now is the time to verify the installation.
Note: The installation process might take a couple of minutes.
36
Step 5: You will get the answer as 3.7.4
Note: If you have any of the earlier versions of Python already installed. You must first uninstall
the earlier version and then install the new one.
Check how the Python IDLE works
Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program
Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click on
Save
Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.
Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
37
You will see that the command given is launched. With this, we end our tutorial on how to install
Python. You have learned how to download python for windows into your respective operating
system.
Note: Unlike Java, Python doesn’t need semicolons at the end of the statements otherwise it won’t
work. This stack that includes:
38
CHAPTER - 5
RESULTS AND DISCUSSIONS
In this chapter, we present a proposed method for the analysis of fraud in mobile applications
using user reviews. We outline the step-by-step working of our proposed model, including the
feature extraction process and the results obtained.
39
Figure 5.2 Create New Environment with the project name python.
40
4) Copy path and past in cmd using cd command change directory.
41
6) After pressing enter key show url http://127.0.0.1:5000
42
8) Press Enter key and open the interface of Detection of Fraud Apps, click login
43
10) login success!
44
12) after click choose file choose dataset.
45
14) click open see this interface and click on upload button
46
16) click to train I Test.
17) After click on click to train show the box training finished!.
47
18) after click ok button show this window.
19) click on choose file click app then click static then click upload and open this
window
48
20) select APKPURE apk file and press open button. There is interface.
21 click predict and wait some time it is predict model accuracy, predict
class, app name, targetSDK version, file size.
49
21) Click on Analysis.
50
CHAPTER-6
6.1 Conclusion:
LSTM networks are well-suited for analyzing sequential data, such as user reviews, and can
capture long-term dependencies between words in the text. By using an LSTM network for
sentiment analysis, it is possible to identify potentially fraudulent reviews based on the sentiment
expressed.
However, the output of an LSTM network can be complex and difficult to interpret. By
combining the LSTM network with a decision tree classifier, it is possible to create a simple and
interpretable model for fraud detection. The decision tree can use features extracted from the
output of the LSTM network, such as the sentiment of the review and the frequency of certain
words, to make a binary decision on whether the review is fraudulent or not.
The combination of LSTM and decision tree algorithms provides a powerful and flexible
approach to fraud app detection. The LSTM network can capture complex patterns in the data,
while the decision tree provides a clear and interpretable framework for making decisions. This
approach can be easily adapted to different types of fraud detection problems, by modifying the
input data and the features used by the decision tree classifier.
It is important to note that the current study is limited by the use of a 2-3 dataset and further
research is needed to validate the effectiveness of these algorithms on other datasets.
Additionally, it would be interesting to investigate the use of ensemble methods, which
combine the strengths of multiple algorithms, for improved fraud app detection. Research
further in the domain of building a detection system that can detect known attacks as well as
novel attacks. The Fraud Application detection systems exist today can only detect known
attacks. Detecting new attacks or zero-day attack still remains a research topic due to the high
false positive rate of the existing systems.
51
CHAPTER-7
BIBLIOGRAPHY
2. Wang, X., Yu, F., Jiang, X., & Kim, M. (2012). Learning-based mobile
malware detection: Challenges and opportunities. IEEE Wireless
Communications, 19(2), 47-52.
3. Kao, Y. C., Hsu, C. H., & Chen, K. Y. (2015). A novel approach for detecting
mobileapps that steal users’ information. Journal of Network and Computer
Applications, 56, 96- 103.
3. Li, Y., Li, Z., Li, Y., Xue, Y., & Zhu, S. (2015). Detecting fraud mobile
applications via user behavior. In Proceedings of the 22nd ACM SIGSAC
Conference on Computer andCommunications Security (pp. 1347-1358).
4. Alharbi, A., Al Zahrani, A., Alfarraj, O., & Almuairfi, A. (2021). Mobile App
FraudDetection System Based on Machine Learning and Deep Learning. In 2021
IEEE Jordan International Joint Conference on Electrical Engineering and
Information Technology (JEEIT) (pp. 522-527). IEEE.
5. Huang, C., Li, L., Liu, J., Xu, Y., & Li, J. (2020). A Fraudulent Mobile App
Detection Method Based on Blockchain and Machine Learning. IEEE Access, 8,
165563-165574.
6. Lee, J., & Kim, D. (2019). Fraudulent Mobile Application Detection using
Machine Learning Techniques. In 2019 11th International Conference on
Information Technology Convergence and Services (ITCS) (pp. 1-6). IEEE.
52
7. Wijaya, D. D. D., Rasyid, F. M., & Amalia, A. (2021). An Overview of BlockchainTechnology
for Mobile Application Security. Journal of Computer Science and Information
Technology.
8. Zhou, Y., Zhang, J., Yang, X., & Xiang, T. (2017). Fraudulent Mobile App Detectionusing
Network Analysis and Machine Learning Techniques. In 2017 13th International Conference on
Computational Intelligence and Security (CIS) (pp. 70-74). IEEE.
10. Zong, et al. "Detecting Repackaged Android Applications with Negative Selection
Algorithm." Journal of Computer Science and Technology, vol. 28, no. 3, 2013, pp. 428-
435.
11. D. De Freitas, et al. "A Survey of Techniques for Detecting Malicious Android
Applications." Journal of Information and Data Management, vol. 5, no. 1, 2014, pp. 1-
13.
15. S. Arzt, et al. "Flow Droid: Precise Context, Flow, Field, Object-sensitive and Lifecycle-
aware Taint Analysis for Android Apps." ACM SIGPLAN Notices, vol. 49, no. 6,2014,
pp. 259-269.
53
17. M.Saber, S. Chadli, M. Emharraf, and I. El Farissi, “Modeling and implementation
approach to evaluate the Fraud Application detection system,” in International
Conference on Networked Systems, 2015, pp. 513–517.
19. A.S. Ashour and S. Gore, “Importance of Fraud Application detection system (IDS),”
International Journal of Scientific and Engineering Research, vol. 2, no. 1, pp.
1–4, 2011.
20. M. Zamani and M. Movahedi, “Machine learning techniques for Fraud Application
detection,” arXiv preprint arXiv:1312.2177, 2013.
Research
54