Final Report FYP 12 Aug
Final Report FYP 12 Aug
Spring 2022
Supervised By
Dr. Qamar Mehmood
PROJECT REPORT
NUMBER OF
Version V 3.0 MEMBERS
3
MEMBERS’ SIGNATURES
Supervisor’s Signature
i
Capital University of Science and Technology, Islamabad Department of Computer Science
APPROVAL CERTIFICATE
Committee Signatures:
Supervisor:
Project Coordinator:
Head of Department:
ii
Capital University of Science and Technology, Islamabad Department of Computer Science
(Dr. Abdul Basit)
DECLARATION
We, hereby, pronounce that "No piece of the work, in this final year project has been submitted on
the side of an application for one more degree or qualification of this or some other institute". It is
additionally pronounced that this undergrad project, neither in general nor as a section thereof has
been replicated out from any sources, wherever references have been provided.
MEMBERS’ SIGNATURES
iii
Capital University of Science and Technology, Islamabad Department of Computer Science
ACKNOWLEDGEMENT
We would like to thank Allah Almighty for enabling us to complete this project and its report.
We are highly obliged to Dr Qamar Mahmood and HoD Dr Abdul Basit for giving us the
opportunity to work on this project. We dedicate this acknowledgement to all our professors who
shared their ideas and knowledge, and guided us during our project making process. We would
also like to thank our seniors who had been a source of encouragement and always extended their
help.
iv
Capital University of Science and Technology, Islamabad Department of Computer Science
Executive Summary
In today’s world, the Android mobile operating system, developed by Google has become most
popular among the rest of operating systems. This OS possesses vulnerabilities; therefore
hackers are targeting this operating system. Thus, it’s important to detect Android’s Malware.
There are various techniques to detect Android Malware. In this project we have mainly worked
with static and dynamic analysis along with machine learning algorithms. We have performed
static and dynamic analysis on Android APK files provided in CIC-Maldroid 2020 dataset. On
the other hand, we implemented ML algorithms on the dataset. Afterwards, we compared the
accuracies of both these detection techniques i.e. static, dynamic analysis and machine learning
algorithms. All four ML algorithms obtained better accuracy than static, dynamic analysis. W
contributed in research by obtaining more accuracies of the ML algorithms than mentioned in the
CIC-Maldroid 2020 research paper. A desktop application is made having various options to
facilitate malware analysts, researchers and Android mobile users.
v
Capital University of Science and Technology, Islamabad Department of Computer Science
Table of Contents
DECLARATION ………………………………………………………………………………...iii
ACKNOWLEDGEMENT ……………………………………………………………………….iv
Chapter 1 ………………………………………………………………………………………….1
Introduction ……………………………………………………………………………………….1
Chapter 2 ………………………………………………………………………………………...16
vii
Capital University of Science and Technology, Islamabad Department of Computer Science
2.4.3. Malware Analyst Use-case Diagram ………………………………...……………
29
Chapter 3 ………………………………………………………………………………………...67
ix
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 5 ………………………………………………………………………………………...91
Chapter 6 ……………………………………………………………………………………….109
Chapter 7 ……………………………………………………………………………………….118
Chapter 8 ……………………………………………………………………………………….141
Chapter 9 ……………………………………………………………………………………….146
References ……………………………………………………………………………………...147
xi
Capital University of Science and Technology, Islamabad Department of Computer Science
List of Figures
Figure 1. 1: Architecture of Android OS [5]...................................................................................2
Figure 1. 2: Trojan attack on Android.............................................................................................3
Figure 1. 3: Gaining root access in Android....................................................................................6
Figure 1. 4: Session hijacking [26]..................................................................................................7
Figure 1. 5: Man in the Middle SSL attack.....................................................................................8
Figure 1. 6: Compromising authentication by user enumeration using a Brute force attack..........9
Figure 1. 7: Project work breakdown............................................................................................14
Figure 1. 8: Project timeline..........................................................................................................15
xiii
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 3: Authentication module...............................................................................................80
Figure 4. 4: Dataset Error..............................................................................................................81
Figure 4. 5: Login Function...........................................................................................................82
Figure 4. 6: Prune Dataset.............................................................................................................83
Figure 4. 7: view dataset................................................................................................................84
Figure 4. 8: Individual report generation.......................................................................................85
Figure 4. 9: Report of selected users.............................................................................................85
Figure 4. 10: Report of all users....................................................................................................86
Figure 4. 11: Get dataset module...................................................................................................86
Figure 4. 12: Dataset......................................................................................................................87
Figure 4. 13: Apply ML Algorithm module..................................................................................88
Figure 4. 14: SVM module............................................................................................................88
Figure 4. 15: Random Forest Module............................................................................................89
Figure 4. 16: Decision Trees module.............................................................................................89
Figure 4. 17: Selection of Algorithm in tuning..............................................................................90
Figure 4. 18: Editing Hyper Parameters in Tuning........................................................................90
xiv
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 13: Test case group report generation 1.......................................................................103
Figure 5. 14: Test case group report generation 2.......................................................................103
Figure 5. 15: Test case group report generation 3.......................................................................104
Figure 5. 16: Test Case save csv..................................................................................................105
Figure 5. 17: Test case implement ML algorithms Random Forest............................................108
xv
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 12: SVM Example 1.....................................................................................................135
Figure 7. 13: SVM Example 2 ....................................................................................................135
Figure 7. 14: grid search cv.........................................................................................................136
Figure 7. 15: Excel file of apks results........................................................................................138
Figure 7. 16: Code snippet...........................................................................................................138
Figure 7. 17: Library Function.....................................................................................................139
Figure 7. 18: use of compulsory parameters................................................................................139
Figure 7. 19: Bar Graph...............................................................................................................140
xvi
Capital University of Science and Technology, Islamabad Department of Computer Science
List of Tables
xviii
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 1
Introduction
The use of Android OS has increased rapidly. Our laptops, mobiles, tablets, watches etc. are
comprised of this OS. Android being most popular among other operating systems, has
become prominent in the eyes of attackers. One of the most dangerous threat on the internet
which is been rising for the several years is Android malware. The malwares can cause severe
damages and compromise confidentiality and integrity of our data. Therefore, it’s important
to analyze and detect any malware and eliminate it from our systems.
In today’s world, the Android mobile operating system, developed by Google has become most
popular among the rest of operating systems. The historical backdrop of Android starts in
October 2003. The system was created by a California-based company named Android Inc. for
mobiles and digital cameras. Android Inc. was acquired by Google in 2005 and after two years,
Google released Android as mobile Operating system [1].
On 23rd September 2008, Android 1.0, the first commercial version was released. Android 1.0
and 1.1 did not have specific code names [2]. Android version 1.5 and onwards are named
after consumables like Cupcake (1.5), Donut (1.6), Eclair (2.0), Froyo (2.2), Gingerbread (2.3)
etc. [3] . The current stable version of Android is Red velvet cake (11) and it was released on
8th September, 2020 [4]. Google no longer supports Nougat (7.0) and the versions previous to
it.
The architecture of the Android operating system involves four layers. Android applications
are present at the top of all layers. The Application Framework (second) layer offers numerous
higher-level types of assistance to applications such as Java classes. A set of libraries are
present at the third layer. These libraries include libc, SQLite database, SSL libraries etc.
Android runtime on the third layer includes some libraries and dalvik virtual machine which is
a Java virtual machine specially designed and optimized for Android. It uses core features of
Linux like memory management and multi-threading, which is inherited in the Java language.
1
Capital University of Science and Technology, Islamabad Department of Computer Science
Linux is at the bottom of all layers. Linux is privilege control so each app which runs on
android is given a process id by Linux.
The share of Android mobile OS is 72.73% worldwide [5]. Android can run on multiple types
of devices like TV, tablets, mobile etc. [6]. The Open Handset Alliance (OHA) is a
consortium whose goal is to develop open standards for mobile devices, promote innovation
in mobile phones and provide a better experience for consumers at a lower cost [7]So
Android was designed to run on devices of multiple manufacturers. Due to the popularity and
adversity of Android, attackers are targeting this operating system. Android is the most
heavily targeted mobile operating system by malware at a market share of 85% across the
world [8] Attackers target these devices to compromise confidentiality and integrity of user’s
data.
2
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 1.2: Trojan attack on Android
Android based malware detection is important as in today’s world, Android’s users have
increased up to 1.6bn [1]. These large numbers of users may include
non-professionals/laypeople. A layperson might know about the threats and risks but most
probably do not know about the vulnerabilities of the OS and how those vulnerabilities can
be exploited.
Different kinds of malware perform different malicious activities. Instances of some of the
popular types of Androids based malware are Trojans that run on the Android operating
system are usually either specially-crafted programs that are designed to look like desirable
software (e.g., games, system updates or utilities), or copies of legitimate programs that have
been repackaged to include harmful components [9], key loggers are the malwares which
record the keystrokes, Ransomware takes out significant data of user like photos, documents,
videos etc. and encrypts it to put up a demand of paying ransom to the malware makers,
Spyware enables attackers to access all the information on our phone, including contacts,
calls, texts, and other sensitive information, and also hijacks your microphone and camera
3
Capital University of Science and Technology, Islamabad Department of Computer Science
[10] etc. So it’s important to perform a security check-up and detect if the system is under
attack from any malware.
As android users are increasing rapidly and cybercriminals are more interested in android-
based devices, detection of android-based malware is important to make android devices
more secure. In this project we will focus on analysis and detection of android-based
malware. The malware detection process is done in three steps that are malware analysis,
feature extraction/selection and classification/detection. So, for malware analysis [11] The
following four methods are used:
Static malware analysis malware analysis is done without actually running executable files.
Static analysis is basically signature identification querying cryptographic hash codes and
strings. Static analysis consumes less resources as malware is not executed on machines
during analysis. At the same time by using static analysis, we can’t detect malware with code
obfuscation because of not finding signatures.
The second analysis technique is dynamic malware analysis which is also known as behavior
analysis. In this technique malware is executed in a controlled environment like in a virtual
machine or emulator and then its behavior is analyzed for instance, analyzing API calls and
system calls. This analysis technique is better as it can detect unknown and new malwares
and can detect obfuscated code, but it takes too much resources to execute and analyze the
behavior of malware and it can’t detect zero-day malwares.
Hybrid malware analysis technique is basically the combination of both static and dynamic
analysis techniques. As both analysis techniques have their own limitations, so by using
hybrid approach, these limitations can be overcome as static analysis is cheap but can’t detect
code obfuscation and dynamic analysis is resource consuming but it can detect new variants
of malware so hybrid analysis approach is better in a way that malware can be detected with
more accuracy.
4
Capital University of Science and Technology, Islamabad Department of Computer Science
The fourth analysis technique is memory analysis which is becoming popular for android-
based malware analysis as it provides more comprehensive analysis of malware by observing
code and memory images. This technique is based on memory forensics. It executes malware
and after execution, memory images are analyzed to get information about running programs.
Features extracted from this technique provide results with more accuracy and it can detect
API hooking, DLL injections and hidden processes.
After analysis and feature selection, the next step is detection of malware. So, detection
techniques for android-based malware [12] can be categorized into several types but here we
will discuss three main categories that are:
● Signature based
● Behavior based
● Heuristic based
Behavior based detection techniques detect malware by observing behavior of executable and
analyzing its functionality. In this method, behavior of executable under detection is
compared with existing malwares executable’s behavior. Thus, this method detects malware
with new variants efficiently. But it can’t detect zero-day malware efficiently and it also
requires manual work.
In heuristic-based detection techniques both behavior and signature features of executable are
used to detect malware. Other hybrid features like API calls, n-grams etc. are also used. In
this technique data mining and machine learning is used for detection and classification. This
technique is helpful in detecting new variants of malware as well as zero-day malware.
Several datasets are available publicly. The one we will be using for training and testing of
machine learning algorithms is the CIC-MalDroid-2020 dataset. This dataset is recent and
big, so it is preferable to be used for malware analysis. In this project we will implement
different machine learning algorithms on the selected feature vector. We will calculate
accuracies of these algorithms on dataset and compare the results that which ML algorithm
5
Capital University of Science and Technology, Islamabad Department of Computer Science
performing with better accuracy. We will do static and dynamic analysis of malware by
configuring virtual environment. We will also analyze which analysis technique is better
among heuristic based, behavior based and static based analysis. We will use Virus Total (a
famous tool for malware detection) for counter checking whether our detection is correct or
not.
The process of removing the limitations or restrictions, running the android on a tablet or
mobile is called rooting. Rooting or jailbreaking a device bypasses data safety and
encryption schemes at the system. On a standard Android configuration, no app can
access any other app’s data, no matter how many permissions the app asks for.
6
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 1. 3- Gaining root access in Android
This all changes when we run an application as root [23]. If a person has access to the
root file system he/she can disable critical apps of the system, delete critical files of the
system, and thus can prevent the normal functioning of the device. If a person has a clear
idea of how to use the device when it is rooted him/her just needs to be more careful but
for a non-technical user, lack of root in android helps him/her.
Gaining root access also requires avoiding the security restrictions put in place by the
Android operating system [24]. For example, we know millions of gamers play PUBG
mobile. The number of gamers that are using hacks to get an advantage over other gamers
are increasing rapidly and these hacks like sharpshooter only work on rooted devices and
these hack APKs are downloaded from untrusted sources. Due to this, the gamer doesn't
know how malicious those APKs would be.
Applications may use TLS/SSL during authentication but they fail to regularly encrypt
network site visitors whilst it's miles vital to shield sensitive communications like plain
text session id [25]. Encryption ought to be used for all authenticated connections, mainly
Internet-accessible web pages.
7
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 1. 4 - Session hijacking [26]
The above figure shows Cross-site scripting (XSS) session hijacking where an attacker
has exploited server vulnerabilities and injected malicious client-side script into the
webpages. When a user gets authenticated on the server, the server returns page code with
an injected script. When a compromised page is loaded, the malicious code will execute
on the user's side. If the HTTP only attribute is not set by the server in session cookies,
the session key can be gained by injected script and start sending cookies of the session to
the attacker.
8
Capital University of Science and Technology, Islamabad Department of Computer Science
connection wherein the certificate isn’t validated can be exposed to unauthorized access
or modification.
A user has requested a website which is intercepted by the attacker. Attacker established
an SSL session with the legitimate site using his own private key. On the server, the
certificate is validated through the attacker's private key. So the site will respond with an
SSL certificate. The attacker will create a fake certificate and send it to the user. If on the
user’s side, the certificate is validated, then the user will keep on communicating with the
attacker assuming it's the server end.
9
Capital University of Science and Technology, Islamabad Department of Computer Science
accordance with a security policy. Authorization and authentication procedures must
determine what a user, service, or application is allowed to do.
For example, an attacker enters different names and passwords into a banking app. On
a username, the application displays that the password is incorrect. Here the attacker
will know that the username was correct and he just needs to guess passwords for it.
This way user enumeration happened and his search space was reduced. Now he’ll try
different passwords through a malware which tries different combinations of
passwords. If he gets successful in this brute force attack, authentication/authorization
will be compromised and he’ll be able to gain unauthorized access.
10
Capital University of Science and Technology, Islamabad Department of Computer Science
1.2.2 Vulnerable Android Libraries
The vulnerabilities of Android libraries can be used in order to perform malware attacks on
the systems. These malwares can cause severe damage to our systems once they enter them
so their detection is important. The two mentioned vulnerabilities in Android libraries were
discovered in the past.
The Android package installer was unable to verify the validity of certificates as
certificate chaining verification was not done properly. Before installing, the application
certificate is verified but identity can claim to be issued by another identity so the
malicious certificate appears to be a verified one.
All applications have a unique identity but due to improper certificate validation, there
was a vulnerability that allowed applications to copy the identity of another application.
In this way, malicious applications were able to copy the identity of a legitimate
application. It was called FAKE ID vulnerability [25].
There was a vulnerability in the android browser AOSP through which hackers bypassed
SOP. SOP, Same Origin Policy is a security mechanism that allows scripts to access
information from the same site it originated but not the information from pages of another
site. So the web application is prevented from getting information from another tab,
currently opened by the user. Due to this vulnerability, hackers were allowed to get
sensitive information of the user present in other tabs opened by him/her. This was done
by sending a malformed JavaScript: URL handler with a null byte, which led to the SOP
not being enforced [26]. Now the AOSP browser is not part of the android devices so the
problem has been resolved.
11
Capital University of Science and Technology, Islamabad Department of Computer Science
1.3. Existing Examples / Solutions
Machine learning algorithms have been broadly used for malware detection. ML algorithms
are better for malware detection as they detect malware with high accuracy [11]. Various
machine learning algorithms have been applied on different malware datasets. Mostly ML
classification algorithms are used for malware detection as we need to classify between
malware and benign. Static and dynamic analysis has also been used for malware detection.
The recent work on our selected dataset was done in 2020 in which four machine learning
algorithms Random Forest, Decision Tree, Naïve Bayes and K nearest neighbors were
applied and accuracies were reported too.
Our contribution is that we are using a recent dataset MalDroid-2020 dataset. Its advantage
is that we know new malware keeps on being introduced by the attackers, so, this recent
dataset has records of latest malware samples. Those features will be chosen in feature
vector which are strongly correlated for malware detection. We will apply different ML to
check their accuracies and compare them. We will do signature based and behavior-based
analysis of malware by configuring different malware detection tools. We will also analyze
which analysis technique is better among machine learning techniques, dynamic and static
analysis. We will develop a desktop application which will provide users the facility to
prune dataset, apply machine learning algorithms and can optimize or tune the hyper
parameters. This application will be useful as user can do his own experiment on our
selected dataset and cn get accuracy results and visual graphs too.
1.4. Stakeholders
Malware analyst is a person whose job is to identify, examine, and understand various
forms of malware and their delivery methods. These malwares consist of different types of
adware, bots, bugs, rootkits, spyware, ransomware, Trojan horses, viruses, and
worms. Malware analysts will disassemble and reverse engineer the malicious code after
the organization’s incident response team has identified an attack. This product will help
12
Capital University of Science and Technology, Islamabad Department of Computer Science
malware analysts in analyzing malware applications [13]. If the feature vector is the same
as one we will test and train ML algorithms on, malware analysts can detect through ML
model.
Android mobile users can be categorized into different age groups or whether they are
professionals or non-professionals. The main category that is most vulnerable to these
malware apps are the people who download APKs from the mobile browser and don’t have
any idea that these APKs might be malicious. In future, this product can be converted into a
website which takes an APK and detects if it’s malicious or benign. Mobile users can also
verify from the website whether a particular app is malicious or not, Researchers from
the University of Cambridge [14] found that 87% of all Android smartphones are exposed
to at least one critical vulnerability. So almost all android mobile users are stakeholders
[15].
1.4.3. Researchers
Researchers in the field of malware analysis and detection from all over the world can
benefit from this research paper as we will be applying different machine learning
algorithms for detecting malware and comparing their accuracies. Moreover, we will
perform static and dynamic analysis on malware’s APKs, and will compare results of this
analysis with machine learning models, so researcher can use do research on our these
findings too.
This product will be useful to Malware analysis Labs in future as this product can be
used for detection of malware applications and those labs can buy the licensed product
as the product will be available in the market. They can also detect malware apps
through their techniques and cross check them with our techniques and verify whether
they have detected malware correctly or not.
13
Capital University of Science and Technology, Islamabad Department of Computer Science
This product will be useful for researcher, they can do research on our selected data set
and can compare results of our experiments with their results. Moreover, we will
perform static and dynamic analysis on malware’s APKs, and will compare results of
this analysis with machine learning models, so researcher can use do research on these
findings too.
This project can be converted into a product in future which can be provided to the
clients on the basis of monthly subscription. Client can be any android mobile user
who wants to check whether an app is malicious or not.
● Python provides machine learning algorithms libraries, many frameworks, and extensions
which makes implementation of machine learning algorithms really easy. Also, python is
hugely valued by cyber security experts as it is used in penetration testing and malware
detection, etc.[16]. So, we will use Python language for implementing machine learning
algorithms for malware detection.
● Jupyter notebook is easy to use as it provides code, output, explanations (text) in a single
document. We’ll use the Jupyter notebook for implementation of ML algorithms as it is
the best choice for code implementation.
● Kali Linux is a safe environment for testing so we’ll perform static and dynamic analysis
of APKs in Kali Linux using the APK tool, DEX2JAR, and JD-GUI.
● Spyder is an easy-to-use Python development IDE so we’ll use Spyder for the interface
development.
● Google colab note books use cloud resources so we’ll use them for the implementation of
machine learning algorithms.
14
Capital University of Science and Technology, Islamabad Department of Computer Science
1.7. Project work breakdown
15
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 1.7: Project work breakdown
16
Capital University of Science and Technology, Islamabad Department of Computer Science
1.8. Timeline
17
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 2
The process of determining user expectations for a new or modified product is known as
requirement analysis. These characteristics, referred to as criteria, must be quantitative,
relevant, and specific. Functional requirements are a term used in software engineering to
describe such requirements.
Administrator
Malware Analyst
Researcher
Mobile User
19
Capital University of Science and Technology, Islamabad Department of Computer Science
regarding malware types.
20
Capital University of Science and Technology, Islamabad Department of Computer Science
3 Malware analyst can view the visual results of cross Core Completed
validation.
4 Mobile user can view the visual results of cross Core Completed
validation.
5 Researcher user can view the visual results of cross Core Completed
validation.
6 Malware analyst can take guidance from the help Intermediate Completed
manual.
7 Researcher can take guidance from the help manual. Intermediate Completed
21
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 1 - Admin use case diagram
The table 2.4 is the ‘signup’ description table in which the signup details are provided.
Description: The user can sign up by the first time he/she used the system by providing a
username, password, email address and selecting a role.
Trigger: Signup button
Preconditions: The user provides a username, password email address, choose a role and
click on the sign-up button.
Post The user will be signed up to the system and now he/she will be able to use
conditions: the system.
22
Capital University of Science and Technology, Islamabad Department of Computer Science
Normal Flow: User System
1: User will be signed up to The system provides a sign-up form for the
the system and now he/she user.
will be able to use the
system.
2: User fills in the form by System signs up the user.
providing a username,
password, and address.
Alternative User cancels the current form.
Flows:
Exceptions: 1. The system is not responding.
2. The database is not responding.
3. User has not filled the form correctly.
The table 2.5 is the ‘sign in’ description table in which the sign in details are provided.
Description: User will sign into the system by providing username and password.
Trigger: Sign-in button
Preconditions: User provides username, password and then clicks on the sign-in button.
23
Capital University of Science and Technology, Islamabad Department of Computer Science
Post User will be signed in to the system.
conditions:
Normal Flow: User System
1: User will click the sign-in The system will provide the user sign-in
button to request for sign in form.
2: The customer will fill out The system will allow users to log in to
the form by providing a the system.
username, password.
Alternative User will cancel the current form.
Flows:
Exceptions: 1. The database is not responding.
2. User has not filled the form correctly.
3. System is not responding.
The table 2.6 is the ‘View Individual report’ description table in which the details of reports
of the users are provided. These details can be viewed by the system admin only.
24
Capital University of Science and Technology, Islamabad Department of Computer Science
Preconditions: Admin must be logged in his/her account and select the view individual
report option.
Post conditions: Record of the user will be shown.
The table 2.7 is the ‘View general report’ description table in which the details of reports of
the users are provided. These details can be viewed by the system admin only.
Preconditions: Admin will click on view general report option to view the records of
the users.
Post Records of the users will be shown.
conditions:
26
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 2 - Mobile user use case diagram
The table 2.8 is the ‘View mobile’s behavior information’ description table in which the
details of information shown to the mobile user are provided.
27
Capital University of Science and Technology, Islamabad Department of Computer Science
Table 2. 8 - Use Case 5 View mobile’s behavior information
Description: Mobile user can view the information of suspicious behavior of mobile
after a malware attack.
Preconditions: Mobile user will be shown the option of mobile’s behavior information.
Post conditions: Mobile user will be shown all the information regarding mobile’s
behavior.
1: Mobile user will select System will display all the information
option to view mobile’s about how the mobile’s behavior gets
behavior information suspicious after a malware attack
Alternative Cancel view mobile’s behavior information option
Flows:
Exceptions: 1. The view mobile’s behavior information page is not responding.
28
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.9 is the ‘View information of types of malwares’ description table in which the
details of information shown to the mobile user are provided.
1: Mobile user will select System will display all the common types
option to view information of Android malwares.
of types of malwares
Alternative Cancel view information of types of malwares option
Flows:
Exceptions: 1. The view information of types of malwares page is not
responding.
29
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.10 is the ‘View guidelines for mobile’s protection’ description table in which the
details of information shown to the mobile user are provided.
Description: Mobile user can view the guidelines for mobile’s protection
Trigger: Select the option to view the guidelines for mobile’s protection
Preconditions: Mobile user will be shown the option to view the guidelines for mobile’s
protection
Post Mobile user will be shown all the guidelines for mobile’s protection
conditions:
1: Mobile user will select System will display all the guidelines for
option to view guidelines for mobile’s protection
mobile’s protection
Alternative Cancel view guidelines for mobile’s protection option
Flows:
Exceptions: 1. The view guidelines for mobile’s protection page is not
responding.
30
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.11 is the ‘View cross-validation results’ description table. This description table
is regarding how the cross-validation results are shown to mobile user, researcher and
malware analyst.
1: User will select option System will display a graph showing the
‘cross validation’. results of static analysis, dynamic
analysis, SVM algorithm, KNN, RF and
Decision tree algorithm.
Alternative User logs out.
Flows:
Exceptions: 1. The ‘cross validation’ option is not responding.
31
Capital University of Science and Technology, Islamabad Department of Computer Science
2.4.3. Malware Analyst Use-case Diagram
The use-case diagram for malware analyst is shown below in figure 2.3.
32
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 3 - Malware Analyst use case diagram
The table 2.12 is the ‘View dataset’ description table. This description table is regarding how
malware analyst and researcher can view the complete dataset.
33
Capital University of Science and Technology, Islamabad Department of Computer Science
Table 2. 12 - Use Case 9 View dataset
34
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.13 is the ‘View correlation of features’ description table. This description table is
regarding how malware analyst can view the correlation between different features of the
dataset.
35
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.14 is the ‘Sort by correlation’ description table. This description table is regarding
how malware analyst can sort the features by correlation.
36
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.15 is the ‘Prune dataset’ description table. This description table is regarding how
malware analyst can prune the features of the dataset.
37
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.16 is the ‘Select rows’ description table. This description table is regarding how
malware analyst can select rows during pruning of the dataset.
38
Capital University of Science and Technology, Islamabad Department of Computer Science
Exceptions: 1. The system is not responding.
2. The database isn’t responding.
The table 2.17 is the ‘Select rows’ description table. This description table is regarding how
malware analyst can select columns during pruning of the dataset.
The table 2.18 is the ‘Save pruned dataset’ description table. This description table is
regarding how malware analyst can save the dataset after pruning.
40
Capital University of Science and Technology, Islamabad Department of Computer Science
click on save dataset. his/her system in the form of .csv.
Alternative Malware analyst log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding
The table 2.19 is the ‘Upload csv’ description table. This description table is regarding how
malware analyst can upload the csv to apply ML algorithms.
41
Capital University of Science and Technology, Islamabad Department of Computer Science
Normal Flow: Malware analyst System
1: Malware analyst will click A csv shall be uploaded.
on upload a csv.
The table 2.20 is the ‘Select classification algorithm’ description table. This description table
is regarding how malware analyst can select ML algorithms to apply them on the dataset.
42
Capital University of Science and Technology, Islamabad Department of Computer Science
Trigger: Select classification algorithm button
Preconditions: The user must be logged-in as malware analyst and select any classification
algorithm to be applied on the dataset.
Post conditions: The results of the ML algorithm on the dataset shall be shown.
Normal Flow: Malware analyst System
1: Malware analyst will All the algorithm’s names provided by the
click on select system shall be displayed
classification algorithm.
2: Malware analyst shall Results of the applied machine learning
click on any classification algorithm shall be shown.
algorithm.
Alternative Malware analyst log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding.
The table 2.21 is the ‘Tune ML algorithms’ description table. This description table is
regarding how malware analyst can tune the hyper parameters of the ML algorithms.
43
Capital University of Science and Technology, Islamabad Department of Computer Science
Description: Malware analyst can tune the ML algorithms. Tuning involves changing
the default value of a variable, changing the amount of testing and
training data etc.
Trigger: Tune dataset button
Preconditions: The user must be logged-in as malware analyst and must select a
classification algorithm first.
Post The ML algorithm shall be tuned
conditions:
Normal Flow: Malware analyst System
1: Malware analyst will A classification algorithm shall to selected
select a classification
algorithm.
2: Malware analyst will The algorithm will be tuned according the
select tune the algorithm requirements of the Malware analyst.
option.
Alternative Malware analyst log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding
The table 2.22 is the ‘View accuracies’ description table. This description table is regarding
how malware analyst and researcher can view the accuracies of implemented ML algorithms.
44
Capital University of Science and Technology, Islamabad Department of Computer Science
Actors: Malware analyst and researcher
Description: Malware analyst and researcher can view the accuracies of the
implemented ML algorithms.
Trigger: View accuracies button
Preconditions: The user must be logged-in as malware analyst or researcher
Post conditions: The accuracies for the implemented ML algorithms shall be shown.
Normal Flow: User System
1: Malware analyst or The accuracies with complete results of
researcher will click on view the ML algorithms shall be shown.
accuracies option.
Alternative Malware analyst or researcher log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding.
The table 2.23 is the ‘View visual results’ description table. This description table is
regarding how malware analyst and researcher can view the visual results in the form of
graphs.
The table 2.24 is the ‘Take guidance from manual’ description table. This description table is
regarding how malware analyst and researcher can view the manual for help.
46
Capital University of Science and Technology, Islamabad Department of Computer Science
Use Case Take guidance from manual
Name:
Created By: Quratulain Tariq Last Updated By: Safia
Mansoor
Date Created: 20/10/2022 Last Revision Date: 20/10/2022
1: User will select option System will display a manual which shall
‘manual’. be different in case of both the users.
The table 2.25 is the ‘View correlation graph’ description table. This description table is
regarding how malware can view the correlation graph of all the features with the class label.
47
Capital University of Science and Technology, Islamabad Department of Computer Science
Use Case ID: Uc23
Description: Malware analyst can view the correlation graph of all the features with the
class label.
Trigger: Select the option ‘view correlation’.
1: User will select option System will display a drop down having
‘dataset’. further options.
48
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.26 is the ‘View dynamic accuracy graph’ description table. This description table
is regarding how malware can view the graphs during tuning of the hyper parameters.
Description: Users can see the accuracy graphs while hyper parameter tuning.
49
Capital University of Science and Technology, Islamabad Department of Computer Science
2.4.4. Researcher Use-case Diagram
50
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.27 is the ‘View dataset details’ description table. This description table is
regarding how researcher can view the details of CIC-Maldroid 2020 dataset.
Actors: Researcher
Description: Researcher can see the details of the dataset used.
Trigger: Select View Dataset Details option.
Preconditions Researcher will click on View Dataset Details option to get the overview
: of the dataset used.
51
Capital University of Science and Technology, Islamabad Department of Computer Science
2.5 System Sequence Diagrams
Below are the system sequence diagrams. Figure 2.5 is ssd to view general
report.
52
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 6 - SSD request to view individual report
53
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 8 - SSD admin request sign out
54
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.10 is ssd regarding mobile user sign in.
55
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 12: SSD to display malware types
56
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.14 is ssd regarding mobile user request to view protection guidelines.
Figure 2.15 is ssd regarding mobile user request to view user manual.
58
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 17: SSD for researcher sign in
59
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.19 is ssd regarding researcher request to view results of algorithms.
60
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.20 is ssd regarding researcher request to view correlation graph.
61
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.21 is ssd regarding researcher request to display dataset.
Figure 2.22 is ssd regarding researcher request to view cross validation results.
62
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 22: SSD Researcher request to view cross validation results
63
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 24: SSD Malware Analyst sign up
64
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 25: SSD Malware Analyst request to sign in
65
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.27 is ssd regarding Malware Analyst request to display pruned
dataset.
66
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.28 is ssd regarding Malware Analyst signout.
67
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 29: SSD Malware Analyst request to display results of algorithms
graph.
68
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 30: SSD Malware Analyst request to view correlation graph
Figure 2.31 is ssd regarding Malware Analyst request to select csv file.
69
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 32: SSD Malware Analyst request to tune the selected algorithm
Figure 2.33 is ssd regarding Malware Analyst request apply machine learning
algorithm.
70
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 33: SSD Malware Analyst request to apply machine learning algorithm
Figure 2.34 is ssd regarding Malware Analyst request to view cross validation
results.
71
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 34: SSD Malware Analyst request to view cross validation results
Figure 2.35 is ssd regarding Malware Analyst request to view user manual.
72
Capital University of Science and Technology, Islamabad Department of Computer Science
2.6. Domain Model
The domain model contains the main entities of the system. The entities of our system are
user, dataset, APK, and Static & dynamic results. .These entities have different relations
amongst them.
73
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 3
System Design
In this chapter we will define our system’s modules, processes, data, interface and
architecture of software. We will discuss the software's architecture, communication of
external entities (users) with our system and flow of data between database, process and
users. Moreover, the database design is also decided according to the selected functional
requirements.
Presentation layer
Data layer
This layer is about the interface display and presentation. This layer shows the services
provided by the system. Some services of our system are a facility to prune the dataset and
view the whole dataset etc.
This layer is all about the Implementation related to user interface like page transition and
control using different buttons etc. In this layer we will handle page transition of different
users after login and on other actions or queries.
This layer actually handles the logic behind different actions. Poper processes are worked for
the implementation of business logic and specific modules are written for this purpose.
74
Capital University of Science and Technology, Islamabad Department of Computer Science
3.1.4. Data layer
This deals with the data storage, access and distribution among different users. For the whole
system activities data layer is used for retrieval or access of data. Like to view dataset we
have to use data layer to access data from database.
75
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 3. 1: Software Architecture
76
Capital University of Science and Technology, Islamabad Department of Computer Science
3.2. Data Flow Diagrams
We will use a modular approach to design the software. We have designed data flow
diagrams which are user to show the flow of data among external users and system modules.
We have designed data flow diagram up to three levels.
3.2.1. Level-0
In this level we treat our system as a black box and user interaction with the system is shown.
Our system can be used by four different types of user malware analyst, researcher, mobile
user, and administrator. Their interaction with the system is shown in the diagram below.
77
Capital University of Science and Technology, Islamabad Department of Computer Science
3.2.2. Level-1
In this level of DFD we have opened the system to some extent and shown a brief view of the
system. Here we have explained how a specific user will interact with the system and how the
system will respond after performing different procedures and activities.
3.2.3. Level 2
In this level we have explain system working in detail. In this diagram communication of
process with users is shown quite deeply. At this level we have shown almost the whole
procedure by which the user will get specific information. For example, the first malware
analyst will perform an authentication procedure then he will select the option that he wants
to do pruning or view dataset and will be shown results respectively.
78
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 3. 4: Level 2 Data flow diagram
79
Capital University of Science and Technology, Islamabad Department of Computer Science
3.3. Entity Relation Diagram
The entity relationship diagram is the first step in designing the schema of database and
construction of database. On later stages this ERD is refined and normalized with proper
steps and final database schema is generated. The entity-relationship model (or ER model) is
shown in figure 3.4. In our database, there will be three tables. The first will be maintained
for users which will contain all the information of the user like username, id, password, and
email. The second table is of feature vector which contains the selected features. Similarly,
the third table contains the results of static and dynamic analysis.
80
Capital University of Science and Technology, Islamabad Department of Computer Science
3.4. Database Schema
Now the final design of our database or database schema is shown in figure 3.5 which is
generated after normalization. We will generate three separate tables in our database. Here
User and static & dynamic analysis tables have a one-to-many relationship with each other.
Similar is the case with Users and dataset.
81
Capital University of Science and Technology, Islamabad Department of Computer Science
3.5. User Interface Design
The user interface is quite important in software as user interaction with system all depend on
interface. Because of this reason it's important to design such a system which is quite user
friendly and simple. We will design quite a simple and user-friendly interface and make sure
to give proper guide lines to users for system interaction.
Home Page
On the home page we have displayed the way this site would be useful for different users.
Other than that, the user can Login to his account or can sign up to generate his account if he
is new user.
82
Capital University of Science and Technology, Islamabad Department of Computer Science
Signup:
Every new user has to sign up to the system. Afterwards the user can login to the system and
can use the system. The details required to sign up are username, password, email and role of
the user.
Figure 3. 8: Signup
83
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 4
Software Development
In this chapter we will explain all standards, protocols and modules we have used for
development of our system. We have explained the way naming conventions, comments,
indentation have been used in our system. Moreover tools and database used for development
of system. We have discussed all the modules used in our system.
During the development of software, different coding standards are being followed to make
code more understandable and editable if required in future. The indentation, declaration,
naming convention, and statement standards used while coding the project are described as
follows:
4.1.1. Indentation
It refers to whitespaces (single tab or four spaces) that signify the beginning of a block of
code. Indentation is very important in Python as it serves more purposes than just code
readability. Statements which have the same indentation are treated as the same block by
Python. So, a group of statements having the same indentation level (same number of leading
whitespaces) is considered as a block by Python unlike languages like C, C++, etc. where
curly brackets represent the block of code.
Python uses colons along with indentation. Colon is placed when a new block of code is
introduced that must be intended to right. An error is thrown by the interpreter if we forget to
intend the statements after the colon. The end of the block is specified by unindenting the
next line of code (which is not part of your block).
Example from code:
84
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 1: Indention example from code
For writing a function, the colon is used and the block of code which can be single or
compound statements are intended right so Python doesn’t require the use of curly braces for
that. Similar is the case with ‘if statements’, after writing the expression in parentheses, the
colon is placed and the block of code is right intended.
Here in the below-taken part of our code for designing a push-button, compound statements
are used. Each line contains one statement. Note the start and end of this design contain
opening and closing parentheses respectively.
Python Comments
Comments are considered good practice in coding. The reason is that comments are human
readable language thus provide a better understanding of the code. Moreover, well
commented code makes bug finding easier and helpful for editing. We have used comments
for easy understanding, editing, and debugging of code. We have added comments above of
each module and on other places. We have followed the pattern shown below
85
Capital University of Science and Technology, Islamabad Department of Computer Science
4.1.3. Naming Convention
Naming convention is a set of rules for choosing characters for naming variables. They also
reduce the effort needed to understand the code. We used full English descriptors that
accurately describe the variable, method, or class. For example, using view_dataset and user
instead of names like a1, b1. Mixed cases are used to make names readable with lower case
letters, in general, capitalizing the first letter of class names and interface names. The table
below will show our naming style.
Push Button btn_name Name of all push buttons are like the key word btn and then
underscore and after that mention name i.e. btn_login,
btn_logout etc.
Label lb_name Labels are used to label input text and their names are like
lb_userName.
Input box Inp_name Input boxes are there for getting input from user. Sample
86
Capital University of Science and Technology, Islamabad Department of Computer Science
input box names are inp_passw, inp_HyperParam1 etc.
Combo box Combo_name Combo box is used for different purposes and their sample
names are combo_selectAlgo, combo_role.
Table Tbl_name Tables are used to view dataset, reports etc. sample names
for tables are tbl_prune, tbl_indiviReport.
Tab Widget T_name Tab widget is used for multiple pages on same window. So
tab widget is named like T_Admin, T_Reser etc.
Vertical VerL_name Vertical layout were used to layout items in vertical manner
layout and were names as VerL_Mob.
Grid layout gridL_name Grid layout is used to layout all items in kind of a tabular
form. These layouts were named as gridL_Admin etc.
Frame F_name Frame is used to design side bar and for graph display.
Naming convention used for this purpose is like this
F_ReserManue, F_AdminManue
We have used Python Spyder IDE as our development environment. It is the most
comprehensible IDE and the best choice for machine learning. As most of our project is
based on machine learning and deep learning, Spyder is the most suitable for coding.
Moreover, Spyder is lighter and comes with the facility of a lot of libraries. The other
alternatives were PyCharm and Jupyter Notebook. We didn’t use PyCharm as it is quite
heavy and Spyder is better for machine learning algorithms implementation. Jupyter
87
Capital University of Science and Technology, Islamabad Department of Computer Science
Notebook is not a good option for software development even though it is good for machine
learning algorithms. For interface designing of our software, we have used library PyQt with
its latest version 6. This is famous for interface design, that’s why we preferred it over other
options like Tkinter, Kivy etc.
In our system, database is required to store user’s details and datasets so good database
management is required for this purpose. We have selected the MySQL database
management system.
MySQL:
88
Capital University of Science and Technology, Islamabad Department of Computer Science
4.4.1. Authentication module
Sign-up
A user needs to sign up if he/she is using the system for the first time.
Input:
To sign up, a user needs to enter his name, password, email, and select a role from the
drop-down menu. A user cannot use the same name which was already used by
another user so he must enter a unique username. The email entered by a user should
be valid for instance, it shouldn’t be like “ali@” or “ali.com”. Another constraint is
89
Capital University of Science and Technology, Islamabad Department of Computer Science
that a username, password, email cannot exceed the lengths 20, 16, and 30
respectively.
Output:
If a user enters a user name that is already registered on the database, a data server
error will be shown on the screen, indicating that the username is not uploaded on the
server.
If a user enters an email that is not valid like it does not contain “@” or “.com”, a data
server error will be shown for it.
While entering the username, password, email if their lengths exceed 20, 16, and 30
respectively a data server error will be shown. No error will be shown if the username
is unique, the email is valid and the length of username, password, email do not
exceed 20, 16, and 30 respectively. Hence, the system will show the login page
afterward. The below-pasted image of the interface shows the error as an output if an
already registered name is used as a user name.
LOGIN
If the user is not visiting the system for the first time and already has an account on the
system then for authentication, he/she has to log in to access the system. So, for this purpose,
we have implemented an option login for the user, as the user is directed to the login page
after signing up too.
Input:
90
Capital University of Science and Technology, Islamabad Department of Computer Science
As for the log-in, the user has to enter his username and password. According to the
role of the user, he will be directed to the corresponding page.
Output:
In case some user enters wrong credentials like the wrong password or username that
does not exist then in that case an error message will pop up showing “Invalid
credentials”. We will not inform the user whether the username is incorrect or
password because it is harmful to the user from a security perspective.
Prune dataset
91
Capital University of Science and Technology, Islamabad Department of Computer Science
We have provided the option of dataset pruning to malware analysts. A malware analyst will
apply machine learning algorithms on the original dataset or the dataset pruned by himself.
So for this purpose, we have provided pruning functionality to malware analysts, in which
malware analysts can prune the dataset by column or by row.
Input:
Malware analyst will enter the number of rows by which he/she wants to prune the
dataset. The number of rows will be selected by spin box and to prune the dataset by
column malware analyst will select column names and the dataset will be pruned by
features of his own choice.
Output:
92
Capital University of Science and Technology, Islamabad Department of Computer Science
The output of this module will be a dataset table which shows the dataset after
pruning. It will be displayed in the form of a table. We have deployed constraints like
the user can’t select the number of rows more than the size of the dataset.
View dataset
An option to view the dataset is provided to the malware analysts and the researchers. So that
they can view the dataset they will be using to perform machine learning algorithms on.
Input:
After selecting the role, the user (Malware analyst/Researcher) will click on the view
dataset button.
Output:
The output of this module will be a dataset table opened in a new window after the
user has pressed the view dataset button. Dataset will be fetched from the database
and if the database is not linked in the backend, then an error message will appear.
In this module, the admin will be provided with the option to generate reports to get an
idea of how many people are using the system. Two different types of reports will be
generated. Only the administrator can access this level of information.
Input:
The input required to generate reports are different as initially the admin has to
select whether he required a report of an individual user or of a specific group of
users or all users.
93
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 8: Individual report generation
Output:
According to the selected option, a list of users with all details will be displayed. In
the figure report of all users who are Malware analysts are displayed.
94
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 9: Report of selected users
95
Capital University of Science and Technology, Islamabad Department of Computer Science
4.4.4 Get dataset Module
As for different tasks like view dataset, implementation of machine learning algorithms,
pruning dataset is required. So instead of repetition of same code again and again for all these
modules a separate module has been created for this purpose. This module returns the dataset
after fetching it from database.
Output:
96
Capital University of Science and Technology, Islamabad Department of Computer Science
4.4.5. Apply ML algorithm module
This module is created to get input from user and call the related machine learning algorithm
module with parameters. This module handles implementation of ML algorithm. This module
will get machine learning algorithm from user and call the function with default parameters.
SVM module
97
Capital University of Science and Technology, Islamabad Department of Computer Science
SVM module is created for implementation of support vector machine. This module takes
three parameters which are dataset, C, gamma. Dataset is the dataset for implementation, C
and gamma are hyper parameters used for tuning of algorithms.
RF module is created for implementation of random forest. This module tasks seven
parameters. One of them is dataset and all others are related to hyper parameters of random
forest.
98
Capital University of Science and Technology, Islamabad Department of Computer Science
DT module is created for implementation of decision tree. It takes seven parameters which
are dataset, max_depth, min_split, max_leaf, min_leaf, n_estimators and max_feature. These
parametrs are passed for tunning of decision tree algorithims.
This module give user the opportunity to tune hyper parameters of machine learning
algorithm. Initially user have to select machine learning algorithm and then give values of
hyper parameters and click apply button. Accuracy of you tuned model will be displayed
along with line graph. Tuning of Decision tree algorithm is shown below in figure.
99
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 18: Editing Hyper Parameters in Tuning
Chapter 5
Software Testing
It is important to check the performance and usability of a system. Because of this reason we
have performed software testing. Testing is performed for all modules of our system like
signup, sign in, prune dataset, etc.
100
Capital University of Science and Technology, Islamabad Department of Computer Science
to be aware of the source code or development method.
We are using black-box testing because it is better to check the performance of the system as
this testing is not concerned with the code logic or development method. It is concerned with
what users input to the system and what output is produced against the given input. In black-
box testing, we test the system against the pre-defined requirements.
Inputs:
101
Capital University of Science and Technology, Islamabad Department of Computer Science
1- username= Fatima, password= fatima12, email= [email protected], role= researcher
2- username= Ahsan, password=12345, email= ahsan@, role= malware analyst
3- username= Fatima, password= fatima32, email= [email protected], role= mobile user
4- username= Alia, password= alia123Custuni1234567789, email= [email protected], role=
researcher
Expected Results:
1. Successfully signed up
2. Data server error as email is invalid
3. Data server error as name Fatima is not unique
4. Data server error as the password exceeds 16
Actual Results:
1. Passed
2. Passed
3. Passed
4. Passed
Description:
The first input is valid as it has a unique name Fatima, the password is not exceeding 16
digits of length and the email is in the correct format so it is valid. Thus, the user will be
successfully signed in. The record is successfully stored in the database and the user will be
directed to the sign-in page.
102
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 1: Successful test case signup
In the 2nd input, the email is invalid so a data server error will be shown.
Name, Fatima is already taken, it is not unique so data server error will be shown.
103
Capital University of Science and Technology, Islamabad Department of Computer Science
In the 4th input, the password is exceeding the length of 16 so a data server error will be
shown.
Inputs:
1- User name= Nida, password=12345
2- User name= Fatima, password= fatma45
Expected Results:
Actual Results:
1- Passed
2- Passed
Description:
The user Nida did not sign up and she is trying to directly sign-in that’s why an error is
shown.
104
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 5: Test case Login invalid Credentials Error
The password for the user Fatima is actually “fatima12” but here the wrong password is
entered that’s why invalid credentials error is shown.
Inputs:
1- Button “Malware Types” pressed
2- Button “Mobile’s Behavior” pressed
3- Button “Mobile’s Protection” pressed
4- Button “Dataset details” pressed
Expected Results:
1- Information of types of malwares will be displayed
2- Information of suspicious behavior of mobile will be displayed
105
Capital University of Science and Technology, Islamabad Department of Computer Science
3- Information of five simple ways to protect the phone will be displayed
4- Information of types of malwares will be displayed
Actual Results:
1- Passed
2- Passed
3- Passed
4- Passed
Description:
A Mobile user can view different types of information. When the button “Malware Types” is
pressed by him/her, the system should display the information on types of malware. If the
system correctly shows the information of types of malware, the test is passed. Similar is the
case for viewing information on mobile behavior, mobile protection, and dataset details.
106
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 7: Test case View information 1
107
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 8: Test case View information 2
108
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 9: Test case View information 3
Input:
1- Button “View Dataset” pressed
109
Capital University of Science and Technology, Islamabad Department of Computer Science
Actual Result: passed
Description
After pressing the view dataset button, the dataset will be displayed in a new window.
Inputs:
1- username = Ali
2- username = Ahmed
110
Capital University of Science and Technology, Islamabad Department of Computer Science
Expected Results:
1- Details of user “Ali” will be displayed
2- Error will occur as no such user exist
Actual Results:
1- Passed
2- Passed
Description:
Individual reports will be generated by the admin. Admin will provide the username of the
individual and its details will be displayed. So as for the first input user’s details will be
displayed as shown in the figure.
As for the second input, an error will popup displaying that “User Not Exist” as any user with
this username does not exist in our system. The output displayed by our system is shown in
the figure below.
111
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 12: Test case individual report, user not exist error
Inputs:
1- option = Malware Analyst
2- option = Researcher
3- option = Mobile User
Expected Results:
1- List of all Malware analysts will be displayed
2- List of all Researcher will be displayed
112
Capital University of Science and Technology, Islamabad Department of Computer Science
3- List of all Mobile user will be displayed
Actual Results:
1- passed
2- passed
3- passed
Description:
According to the selected option relevant details of individuals will be displayed. For the
first, input all malware analysts will be displayed as shown in figure. For the second input all
Researchers data will be displayed the as shown in figure. For the input Mobile User, all
records related to mobile user will be displayed shown in figure.
113
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 14: Test case group report generation 2
Inputs:
1- Request to save csv by clicking prune button.
Expected Results:
1-Confirmation message that file has been saved
Actual Results:
1- passed
Description:
Malware analysts have the option to save .csv file after performing the whole pruning so that
he can apply ML algorithms later. When malware analyst will request to save csv file if file is
saved a message will display "file saved successfully” confirming file is saved.
115
Capital University of Science and Technology, Islamabad Department of Computer Science
System: Malware Analyser
Inputs:
1- browse file
Expected Results:
1-Confirmation message that file has been uploaded
Actual Results:
1- passed
Description:
Malware analysts have the option to upload .csv file which he has already pruned and saved.
So, for this option as soon as a user clicks the browse file option file explorer window will
appear to select csv file.
116
Capital University of Science and Technology, Islamabad Department of Computer Science
Table 5. 9: Test Case Upload csv
117
Capital University of Science and Technology, Islamabad Department of Computer Science
Version: 1 Test Type: Unit testing
Inputs:
1- option = KNN
3- option = SVM
4- option = Decision Tree
5- option = Random Forest
Expected Results:
1- Confusion metrics for KNN
2- Confusion metrics for Naive Bayes
3- Confusion metrics for SVM
Actual Results:
1- passed
2- passed
3- passed
Description:
Malware analysts have the option to implement machine learning algorithms on dataset they
have pruned themselves. So, as they select the algorithm and click the button to apply it
confusion metrics will appear.
118
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 17: Test case implement ML algorithms Random Forest
119
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 6
In this chapter we have discussed the tools and techniques used for static and dynamic
analysis. We have performed static and dynamic analysis on 500 APKs which are subset of
selected dataset. The purpose of this analysis is to cross validate these results with one we
will get from machine learning algorithms.
It is the Android Package format used by the Android operating system and a number of other
Android-based operating systems for the distribution and installation of mobile applications,
mobile games, and middleware. These have a apk extension.
120
Capital University of Science and Technology, Islamabad Department of Computer Science
These files are typically downloaded from the Google Play store and saved in ZIP format.
The contents found in APK files include AndroidManifest.xml, classes. dex, and resources.
arsc file; as well as a META-INF and res folder.
AndroidManifest.xml – This XML file contains the Mata data of the Android
application. This includes the package name, activity names, main activity (the entry
point to the app), Android version support, hardware features support, permissions,
and other configurations.
Android manifest file describes how each component of the application interacts. The
four main components are activity- handles user interaction with the screen of
smartphone, service – handles background processes associated with an application,
broadcast receiver – handles communications between OS and applications and
content providers- handles database. The communications between these components
are done using messages called intents.
Classes. dex – These are the files containing Java code that is converted from Java
Virtual Machine-compatible .classfiles to Dalvik-compatible .dex (Dalvik Executable)
files before installation on a device. Thus executed by Android runtime.
Resources. arsc – This binary file contains the list of the program’s compiled
resources and their IDs. These resources include layouts, images, strings, styles, etc.
Assets – This directory contains application assets. For accessing data (text, music,
XML, fonts) in raw form, assets are the only way.
Res – It contains all the resources that are not compiled into Resources. arsc.
META-INF – It is a directory with APK Mata data such as the signatures.
Following is the activity diagram for static analysis of Android, showing all the activities
performed.
121
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 6. 2: Activity diagram for static analysis
1. APK Tool
In an attempt to analyze the manifest.xml file when we double click on it, it either gets open in
an unreadable format or it doesn’t get open at all and an error message is popped up. So the
APK file needs to be decoded by the APK tool.
APK tool isn’t present in Kali Linux by default so it has to be installed using the command
“apt-get install apktool”.
Once the tool is installed, write the command “apktool d nameofapkfile” on the terminal.
122
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 6. 4: apk tool command
A new folder will be created having the same name as that of the APK file. By clicking on the
manifest.xml file present in the newly created folder, we can view it in a readable format.
2. JD-GUI
JD-GUI is a standalone graphical utility that displays Java source codes of “.class” files. The
APK file has classes. dex which contains the compiled java code. We need to decompile the
file in order to view the java classes’ code. For that write command on the terminal “d2j-
dex2jar nameofapkfile”.
A new file will be created having a .jar extension. Open the file in the JD-GUI tool (java
decompiler) to view all the classes in a human-readable format.
6.1.4. Techniques
We can search for dangerous keywords in the java classes to see if the application is
malicious. Following is the list of malicious keywords.
123
Capital University of Science and Technology, Islamabad Department of Computer Science
● admin
● camera
● GET
● POST
● https
● HTTP
● audio
● SQL
● address
● monitor
● send
● ACTION_CALL
● MMS
● ACTION_SEND
● ftp
● SMS
● socket
Opening the java file in JD-GUI, we can search if there are any malicious keywords in the
code.
124
Capital University of Science and Technology, Islamabad Department of Computer Science
For instance, the above .jar file has the malicious keyword address along with other malicious
keywords.
2. Dangerous Permissions search
Android permissions can pose a huge threat if they are granted to malicious applications.
Following is the list of permissions that can be used to perform malicious activities. During
static analysis, the following permissions were searched in the Manifest.xml file of an APK.
ACCESS_BACKGROUND_LOCATION
ACCESS_COARSE_LOCATION
ACCESS_FINE_LOCATION
ACCESS_MEDIA_LOCATION
ACTIVITY_RECOGNITION
ANSWER_PHONE_CALLS
BODY_SENSORS
CALL_PHONE
CAMERA
GET_ACCOUNTS
MODIFY_PHONE_STATE
INSTALL_PACKAGES
PROCESS_OUTGOING_CALLS
READ_CALENDAR
READ_CALL_LOG
READ_CONTRACTS
READ_EXTERNAL_STORAGE
READ_PHONE_NUMBERS
READ_PHONE_STATE
READ_SMS
RECEIVE_MMS
RECEIVE_SMS
RECEIVE_WAP_PUSH
RECORD_AUDIO
SEND_SMS
USE_SIP
WRITE_CALENDAR
WRITE_CALL_LOG
WRITE_CONTRACTS
125
Capital University of Science and Technology, Islamabad Department of Computer Science
WRITE_EXTERNAL_STORAGE
WRITE_APN_SETTINGS
WRITE_SETTINGS
The folder created by the APK tool has a readable manifest.xml file. we can double click on the
manifest file and read the permissions.
MobSF
126
Capital University of Science and Technology, Islamabad Department of Computer Science
MobSF is a malware analysis and security assessment tool that performs dynamic analysis on
android applications. Advantage of mobsf is that it is the latest and better than all the
conventional dynamic analysis tools. It is more powerful because almost all dynamic analysis
techniques can be applied through this tool. The developers of MobSF have maintained the
documentation in which all steps are mentioned from the installation process to the dynamic
analysis part.
Demonstration
After downloading all the software required as mentioned in the documentation we will have
to download a virtual machine to run Android software. For this purpose many VMs are
available but we choose Genymotion as it is recommended by the MobSF developers.
First, we opened mobsf and then we uploaded the APK we want to perform dynamic analysis
on.
127
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 6. 9: Dynamic Analyzer in MobSF
MobSF loaded the APK and then we click on the dynamic analyzer. After clicking a new
window will be opened where you can see the live screen of the android mobile device. The
main thing that we are concerned within dynamic analysis is the activity section. We can
choose the activity from the activity option and then click on start activity.
After starting the main activity nothing showed up on the screen but when we will click on
the second activity a window will be appeared asking the user to enter his credit card details
and from here we can conclude that it is a banking malware app.
128
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 6. 11: Activity Running
129
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 7
As heuristic techniques perform better than static and dynamic techniques for malware
analysis. So for implementation of heuristic based techniques we need a malware dataset for
training of machine learning algorithms. ML algorithms perform better on large and good
dataset. So, to get a desired dataset we have researched a lot of datasets and after their
comparison we have selected a dataset CIC-malDroid-2020. The whole work on this is
discussed in this chapter.
We will use machine learning algorithms for the analysis of android-based malware. For
implementation and good results of machine learning algorithms, a good and big dataset is
always required so that algorithm learns fast and improves its results [17].
The dataset selected for this work is CICMalDroid-2020 which has malware and benign
samples. The dataset is publicly available on the University of New Brunswick site [18].
In the selected dataset, data has been generated by collecting APKs from several sources
including Virus Total service, Contagio security blog, AMD, and other datasets used by
recent research contributions and then running them on a VMI-based dynamic analysis
system known as CopperDroid [19]. Initially the collected number of APKs were 17850 but
later on when these samples were analyzed inside virtual environment i.e. virtual machines
around 5000 samples were damaged or failed to run so that they can be analyzed. Because of
this reason size of remaining dataset is of 13,077 samples from which 9803 samples are
malware and 1795 samples are benign. Samples are categorized as follows:
1. Adware (1,253)
2. Banking (100)
3. SMS malware (3,904)
4. Riskware (2,546)
130
Capital University of Science and Technology, Islamabad Department of Computer Science
5. Benign (1,795)
The publishers have provided us with three different types of files. These files are as follows
Capturing logs
APK files
Csv files
Csv files were of three different types static records analysis, dynamic analysis records,
binder calls. We have used binder calls csv file which contain combined static and dynamic
records. It has 470 features and and around 15000 records.
In our project we have used all files. APK files were used for static and dynamic analysis. We
have downloaded 500 APKs and performed static and dynamic analysis on them using
different analysis tools like JD-GUI. This task was quite time taking as it is manual task.
131
Capital University of Science and Technology, Islamabad Department of Computer Science
Capturing logs were also being used for generating data again. As already provided dataset
do not have refrence with APKs so we need to generate it again as we need to add name of
APK files in csv file too, so that we can cross validate the results of static and dynamic
analysis with machine learning results.
Some other latest datasets are also publicly available. Their comparison is given in table 2.1
which shows that the CIC-MalDroid dataset is the latest compared to other datasets and
APKs are also available for it. The inves-AndMal dataset is a year older and the latest one
(CIC-AndMal) doesn’t have APKs. So, considering these reasons, we are selecting the CIC-
MalDroid dataset.
[19]
Malware categories present in three latest datasets are shown in table 7.2 which shows that
adware is present in all datasets while banking malware, SMS malware, Riskware,
ransomware scareware, and Trojans aren’t present in all categories. It shows that Adware is
most common in Androids. Because these are in form of advertisements so its easy to be
target a large community without any risk.
132
Capital University of Science and Technology, Islamabad Department of Computer Science
malware
Inves- ✔ - ✔ - ✔ ✔ -
AndMal [20]
CIC- ✔ ✔ ✔ ✔ - - -
MalDroid [19]
CIC- ✔ ✔ - ✔ ✔ ✔ ✔
AndMal [21]
[22]
133
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 3: Extraction of Dataset
134
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 4: Extraction of Dataset
This extracted dataset will be provided to malware analyst to perform experiments and to get
predictions for new APK’s data. We will implement different machine learning algorithms
and provide users the opportunity to implement and tune them according to the requirements.
7.3. ML algorithms
A machine learning algorithm is the technique by which the Artificial Intelligence systems
perform their tasks, generally predicting output values from the data given to them. There are
mainly four types of ML algorithms: Supervised ML algorithms- they have input data and
135
Capital University of Science and Technology, Islamabad Department of Computer Science
class labels. Unsupervised ML algorithms- they are do not have class labels. Transfer
learning- uses data of previous task to complete a new but related task. Reinforcement
learning- rewards the desired behaviors and punish the undesired ones to direct unsupervised
machine learning.
For implementation of machine learning algorithms base paper [17] has used using a PC with
3.60 GHz Core i7-4790 CPU and 32 GB RAM. Our system is 3.20 GHz Core i3-2600 4 GB
RAM so we can’t use our systems for implementation of these algorithms. The
implementation of ML algorithms is done on Google colab notebooks which are Jupyter
notebooks that run on cloud.
A decision tree is a tree-like structure that is used as a model for classifying data. A decision
tree is consists of three types of nodes
136
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 5: Decision tree example
Step 1:
Step 2:
Step 3:
137
Capital University of Science and Technology, Islamabad Department of Computer Science
Just like we calculated E for Sunny, we’ll calculate for Outcast and Rainy and add them
to get entropy for Outlook.
After finding entropy of all 4 attributes, find information gain using this formula
Choose the attribute that gives the highest information gain after the split. i.e. 0.247
Step 5:
Step 6:
The Sunny and Rainy attributes need to be split. They can split using, Temperature,
Humidity or windy. Let’s consider Rainy first. Humidity produces homogeneous group.
Step 7:
138
Capital University of Science and Technology, Islamabad Department of Computer Science
.
We have performed tuning of the hyper parameters by finding their optimal values. The
hyper parameters we’ve tuned are:
Criterion:
It is the function to measure the quality of a split. Supported criteria are “gini” for the
Gini impurity and “entropy” for the information gain. Gini Impurity measures the
divergences between the probability distributions of the target attribute’s values and
splits a node such that it gives the least amount of impurity. Information gain uses the
entropy measure as the impurity measure and splits a node such that it gives the most
amount of information gain.
139
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 7: Criterion
Min_sample_split:
140
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 2: minimum sample split
Min_sample_leaf:
The minimum number of samples required to be at a leaf node. A split point at any
depth will only be considered if it leaves at least min_samples_leaf training samples
in each of the left and right branches. It is used to control over-fitting by defining that
each leaf has more than one element. Thus ensuring that the tree cannot overfit the
training dataset by creating a bunch of small branches exclusively for one sample
each.
141
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 8: minimum sample leaf
Max_features:
It is the number of features to consider when looking for the best split. Every time
there is a split, our algorithm looks at a number of features and takes the one with the
optimal metric i.e. accuracy using entropy, and creates two branches according to that
feature. Another use of max_features is to limit overfitting. By choosing a reduced
number of features, we can increase the stability of the tree and reduce variance
(variability in the model prediction) and over-fitting.
142
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 9:max features
Results:
The accuracy we have obtained for the Decision Tree algorithm is 92.16% while that
mentioned in the research paper is 90.75%.
· KNN calculates the distance of test data point with all the points of the training data to
make a prediction.
· The test data point shall be assigned the class which majority of nearest training data
points shall have.
· Select K.
· In these k neighbors, count the number of the data points in each category.
· Assign the category to the new data point that the maximum number of neighbors
possess.
· Model is ready.
Draw a plot between K and error rates defining a range. Select K having least error
rate.
144
Capital University of Science and Technology, Islamabad Department of Computer Science
and 7 for the new tissue, respectively. Taking the value of k=3. We shall use the data
of already existing tissue papers (categorized good or bad) to apply KNN algorithm.
We shall apply the Euclidian distance for the X1 and X2 values of new tissue paper with all
the existing records. It’s basically finding the distance with all the points.
Afterwards, we shall select the 3 neighbors having the minimum distance and see their label.
As two of them have the label ‘Good’ so new tissue shall be categorized as good.
We have performed tuning of the hyper parameters by finding their optimal values. The
hyper parameters we’ve tuned are:
n_neighbors:
It represents the number of neighbors to use for kneighbors queries. Its default value is 5.
leaf_size:
This parameter is passed to BallTree or KDTree (both are algorithms which can be used in
kNN). This (parameter) can affect the speed of the construction and query, as well as the
memory required to store the tree. The optimal value depends on the nature of the problem.
It's default value is 30.
P:
Power parameter i.e. p is for the Minkowski metric. When p = 1, this is equivalent to using
manhattan_distance (l1), and euclidean_distance (l2) for p = 2. Its default value is 2.
145
Capital University of Science and Technology, Islamabad Department of Computer Science
Grid Search CV:
Rather than using Greedy approach, we are using Grid Search CV for finding the optimal
values of the hyper parameters.
We added the parameter refit with the value 'True'. This parameter is used for refitting an
estimator using the best found parameters on the whole dataset. First, it runs the same loop
with cross-validation, to find the best parameter combination. Once it has the best
combination, it runs fit again on all data passed to fit (without cross-validation), to build a
single new model using the best parameter setting.
Verbose just means the text output describing the process. The higher the number, the more
verbose means more messages.
When verbose > 1: Computation time for each fold and parameter candidate is displayed.
When verbose > 3: Fold and candidate parameter indexes are also displayed together with the
starting time of the computation.
146
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 4: Grid Search CV
Results:
Decision trees are highly sensitive to training data means changing the data a little can result
in a completely different decision tree which could result in high variance so our model might
fail to generalize. Random forest is a collection of multiple random trees hence, less sensitive
to the training data.
Working:
In Random Forest, we randomly make the subsets of the original data keeping the number of
rows equal. We make a decision tree for each subset independently but we randomly select
subset of features for making these decision trees. For prediction, the test data point shall be
147
Capital University of Science and Technology, Islamabad Department of Computer Science
passed to each decision tree. The class/category given by maximum decision trees shall be
assigned to the test data point.
We have performed tuning of the hyper parameters by finding their optimal values. The
hyper parameters we’ve tuned are:
Grid Search CV is used for finding the optimal values of the hyper parameters.
Results:
The best found parameters are 'class_weight': 'balanced', 'criterion': 'gini', 'max_features':
'sqrt', 'n_estimators': 100. The accuracy we’ve obtained is 94.94% and that mentioned in
research paper is 93.44%.
148
Capital University of Science and Technology, Islamabad Department of Computer Science
7.3.4. Support Vector Machine (SVM)
SVM is a supervised machine learning algorithm commonly used for classification problems.
● The data points in SVM algorithm are plotted in n-dimensional space (n is for the
number of features) where the value of each feature being the value of a particular
coordinate.
● Afterwards, hyperplane is defined to differentiate the two classes well for the purpose
of classification.
● Support Vectors are simply the coordinates of individual observation. The SVM
classifier is a frontier that best segregates the two classes (hyper-plane/ line). [18]
Working
SVM identifies the right hyperplane by considering the data points. Here, in this scenario
hyperplane A is not differentiating the two classes well. Similar is the case with
hyperplane C. We can see that the hyperplane B is differentiating the two classes well so
here hyperplane B shall be selected.
Let’s consider another scenario in which all the hyperplanes are differentiating
between the classes well so which hyperplane shall we choose?
149
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 6: SVM Example 1.1
By maximizing the distance between the nearest data points of both the classes we can
identify the right hyperplane and this distance is called margin. So the right hyperplane shall
be C.
An important point is that we don’t need to add this feature of selecting hyperplane manually
as SVM do that itself through its kernel technique.
We have performed tuning of the hyper parameters by finding their optimal values. The
hyper parameters we’ve tuned are:
150
Capital University of Science and Technology, Islamabad Department of Computer Science
Grid Search CV has been used to find the optimal values of the parameters.
Results:
The optimal values of the hyper parameters found by the Grid Search CV are: C=100000 and
gamma=1e-08. The accuracy obtained for SVM is 86.52% and that mentioned in the research
paper is 78.1%.
The comparison of machine learning algorithms accuracies of our experiment with research
paper [19] is shown in the table below.
151
Capital University of Science and Technology, Islamabad Department of Computer Science
7.4. Cross validation
In this project other than other module one module was to compare malware analysis and try
to analyze which technique is better. We aimed to cross validate static and dynamic analysis
techniques. We have performed cross validation to analyze the difference of accuracies
between static, dynamic analysis and the accuracies of machine learning algorithms.
Moreover we have compared results of both types of techniques.
For this purpose we have selected 500 APK’s, 100 from each category according to our time
span. We first perform static and dynamic analysis and analyzed the five hundred APK’s one
by one using different tools like APK tool, JD-GUI, Mob-SF etc. The results of this analysis
was maintained in an excel file. It contained APK’s name, prediction from each tool
(malware, benign), final class label which was decided according to the majority label of
tools and the original label of the APK. The excel file containing results looks like this
After static and dynamic analysis we was supposed to apply machine learning algorithms on
the dataset of same 500 APK’s. As the dataset available don’t have APK name in it so it was
impossible to select records of same APKs, for this purpose we extracted data from .json files
(available for each APK). We did extraction of data from self-written python script which
152
Capital University of Science and Technology, Islamabad Department of Computer Science
initially unzip the folder and then export data in .csv file. After that we applied all four
machine learning algorithms we have used in this whole research which are KNN, SVM, RF,
and DT. A code snippet to show ML algorithm implementation is shown below in image.
After getting results of machine learning algorithms in form of accuracies. Now we were
supposed to calculate accuracy of static and dynamic analysis from the maintained excel file.
For this purpose we decided to use python built-in library Sklearn. The default library
function to calculate accuracy is as follows
The compulsory parameters for this functions are actual labels, predicted labels. As we have
maintained both of these in our excel file of static and dynamic analysis results, so we used
it like this
153
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 11: use of compulsory parameters
After getting all the accuracy we draw a bar graph to compare accuracies of all malware
analysis techniques. The results are shown in the graph
The above graph show accuracy of all machine learning algorithm along with accuracy of
static and dynamic analysis. From the results it is quite clear that machine learning algorithm
perform very well as compared to static dynamic analysis. Machine Learning Algorithm
Random Forrest performed the best with the error rate of 4.7% while error rate of static and
dynamic analysis is 24.2%. Even though machine learning algorithm also have error to some
154
Capital University of Science and Technology, Islamabad Department of Computer Science
extent but still it is quite less as compared to static and dynamic analysis as Random forest
perform almost 20% better than static and dynamic analysis. This shows that for malware
analysis machine learning perform better than static dynamic analysis. The major reason of
bad performance of static and dynamic analysis is that in this technique human is
involvement is present to a great extent, because of human error accuracy also effect.
Chapter 8
Software Deployment
The software is windows based and its setup for windows will be provided.
155
Capital University of Science and Technology, Islamabad Department of Computer Science
We will provide user with a setup file (.exe). User will right click on the exe and click run as
administrator.
Installation of software will be done smoothly only by clicking next on the setup window. If
wanted to change location of installation of software then browse otherwise go ahead. After
installation of setup now find out directory named Database in the directory in which you
have installed the setup. If you haven’t change directory during installation then it will be
C: /Users/”username”/Program Files(x86)/Malware Analyser/Database. The file inside
Database folder named ‘malware.sql’ will be used in next steps.
User needs to download and install XAMP setup on his machine. After installation
156
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 8. 2: XAMPP 1
Click on the Button “Admin” present next to Stop in the MySQL service row
(highlighted in the image below)
Figure 8. 3: XAMPP 2
Local host site will be open like shown below on your browser.
157
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 8. 4: php my admin
Create new database with name malware by following the steps shown below.
After creating empty database we need to import database into this. Follow following
steps to import database.
158
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 8. 6:Import Database
After the import make sure if database has been imported or not by clicking on the
database in left side pane. It will look like this.
159
Capital University of Science and Technology, Islamabad Department of Computer Science
Now came to the initially extracted folder and double click the exe file and wait for
some time a black screen will appear like shown in image bellow but don’t cancel it
until you want to close the software.
In sometime software will start and now you can use it. Keep in mind while using
software XAMP will remain in start mode like shown in step 1.
160
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 9
9. Project Evaluation
In the FYP - 1 Final evaluation of the project some amendments were suggested. We have
applied the suggested changes accordingly. Following were the suggestions given by the
respected teachers:
Table 9. 1: Project Evaluation
Sr.No. Suggestions
1 Add progress bar while loading data as it takes a lot of time to load.
2 Add progress bar pruning is time consuming so that user will be aware of time
required for this task.
5 Malware analysts can view dynamic accuracy graphs during hyper parameter
tuning.
161
Capital University of Science and Technology, Islamabad Department of Computer Science
References
https://algorithmia.com/blog/the-importance-of-machine-learning-data.
[18] "maldroid-2020," [Online]. Available: https://www.unb.ca/cic/datasets/maldroid-
2020.html.
[19] S. K. A. F. A. F. R. A. D. &. G. A. A. Mahdavifar, "Dynamic Android Malware
Category
163
Capital University of Science and Technology, Islamabad Department of Computer Science
Classification using Semi-Supervised Deep Learning," 2020 IEEE Intl Conf on
Dependable,
Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and
Computing, Intl
Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and
Technology Congress,
pp. 515-522, 08 2020.
[20] L. K. A. F. A. &. L. A. H. Taheri, "Extensible android malware detection and
family classification
using network-flows and API-calls," 2019 International Carnahan Conference on
Security
Technology (ICCST), pp. 1-8, 2019.
[21] D. S. L. B. K. G. L. A. H. G. F. &. M. F. Keyes, "EntropLyzer: Android Malware
Classification
and Characterization Using Entropy Analysis of Dynamic Characteristics," 2021
Reconciling
Data Analytics, Automation, Privacy, and Security: A Big Data Challenge
(RDAAPS), pp. 1-12, 2021.
[22] A. L. A. H. K. G. T. L. G. F. &. M. F. Rahali, "DIDroid: Android Malware
Classification and
Characterization Using Deep Image Learning," 2020 the 10th International
Conference on
Communication and Network Security, pp. 70-82, 2020.
[23] C. HOFFMAN, "the-case-against-root-why-android-devices-dont-come-rooted,"
20 6 2017. [Online]. Available: https://www.howtogeek.com/132115/the-case-
against-root-
why-android-devices-dont-come-rooted/.
[24] bullguard.com, "android-rooting-risks," [Online]. Available:
https://www.bullguard.com/bullguard-security-center/mobile-security/mobile-
threats
164
Capital University of Science and Technology, Islamabad Department of Computer Science
/android-rooting-risks.aspx.
[25] K. Casey, "top-7-vulnerabilities-in-android-applications-2019," 20 09 2019.
[Online].
Available: https://codersera.com/blog/top-7-vulnerabilities-in-android-
applications-2019/.
[26] S. Srivatsa, "android_security," 15 12 2014. [Online]. Available:
https://www.cse.wustl.edu/~jain/cse571-14/ftp/android_security.pdf.
[27] tutorialspoint, "tutorialspoint.com," [Online]. Available:
https://www.tutorialspoint.com/android/android_architecture.htm.
[28] Z. Banach, "session-hijacking," 22 08 2019. [Online]. Available:
https://www.netsparker.com/blog/web-security/session-hijacking/.
165
Capital University of Science and Technology, Islamabad Department of Computer Science