0% found this document useful (0 votes)
405 views184 pages

Final Report FYP 12 Aug

This document presents a project that aims to develop an Android malware analysis system using machine learning techniques. A team of 3 students from the Capital University of Science and Technology in Islamabad, Pakistan worked on the project under the supervision of Dr. Qamar Mehmood in the spring of 2022. The project involves performing static and dynamic analysis on Android APK files from the CIC-Maldroid 2020 dataset and implementing machine learning algorithms to detect malware. It also describes developing a desktop application for malware analysts, researchers and Android users with options to facilitate malware analysis.

Uploaded by

Quratulain Tariq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
405 views184 pages

Final Report FYP 12 Aug

This document presents a project that aims to develop an Android malware analysis system using machine learning techniques. A team of 3 students from the Capital University of Science and Technology in Islamabad, Pakistan worked on the project under the supervision of Dr. Qamar Mehmood in the spring of 2022. The project involves performing static and dynamic analysis on Android APK files from the CIC-Maldroid 2020 dataset and implementing machine learning algorithms to detect malware. It also describes developing a desktop application for malware analysts, researchers and Android users with options to facilitate malware analysis.

Uploaded by

Quratulain Tariq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 184

Android Malware Analysis System using

Machine Learning Techniques

Safia Mansoor (BCS183086)

Quratulain Tariq (BCS183097)

Abdullah Arif (BCS183120)

Spring 2022
Supervised By
Dr. Qamar Mehmood

Department of Computer Science


Capital University of Science & Technology, Islamabad
Capital University of Science and Technology, Islamabad Department of Computer Science
Submission Form for Final-Year

PROJECT REPORT

NUMBER OF
Version V 3.0 MEMBERS
3

TITLE Android Malware analysis system using machine learning techniques

SUPERVISOR NAME Dr. Qamar Mehmood


MEMBER NAME REG. NO. EMAIL ADDRESS

Quratulain Tariq BCS183097 [email protected]

Safia Mansoor BCS183086 [email protected]

Abdullah Arif BCS183120 [email protected]

MEMBERS’ SIGNATURES

Supervisor’s Signature

i
Capital University of Science and Technology, Islamabad Department of Computer Science
APPROVAL CERTIFICATE

This project, entitled as “Android Malware Analysis System


using Machine Learning Techniques” has been approved for the
award of

Bachelors of Science in Computer Science

Committee Signatures:

Supervisor:

(Dr. Qamar Mahmood)

Project Coordinator:

(Mr. Syed Abdul Basit)

Head of Department:

ii
Capital University of Science and Technology, Islamabad Department of Computer Science
(Dr. Abdul Basit)

DECLARATION

We, hereby, pronounce that "No piece of the work, in this final year project has been submitted on
the side of an application for one more degree or qualification of this or some other institute". It is
additionally pronounced that this undergrad project, neither in general nor as a section thereof has
been replicated out from any sources, wherever references have been provided.

MEMBERS’ SIGNATURES

iii
Capital University of Science and Technology, Islamabad Department of Computer Science
ACKNOWLEDGEMENT

We would like to thank Allah Almighty for enabling us to complete this project and its report.
We are highly obliged to Dr Qamar Mahmood and HoD Dr Abdul Basit for giving us the
opportunity to work on this project. We dedicate this acknowledgement to all our professors who
shared their ideas and knowledge, and guided us during our project making process. We would
also like to thank our seniors who had been a source of encouragement and always extended their
help.

iv
Capital University of Science and Technology, Islamabad Department of Computer Science
Executive Summary

In today’s world, the Android mobile operating system, developed by Google has become most
popular among the rest of operating systems. This OS possesses vulnerabilities; therefore
hackers are targeting this operating system. Thus, it’s important to detect Android’s Malware.
There are various techniques to detect Android Malware. In this project we have mainly worked
with static and dynamic analysis along with machine learning algorithms. We have performed
static and dynamic analysis on Android APK files provided in CIC-Maldroid 2020 dataset. On
the other hand, we implemented ML algorithms on the dataset. Afterwards, we compared the
accuracies of both these detection techniques i.e. static, dynamic analysis and machine learning
algorithms. All four ML algorithms obtained better accuracy than static, dynamic analysis. W
contributed in research by obtaining more accuracies of the ML algorithms than mentioned in the
CIC-Maldroid 2020 research paper. A desktop application is made having various options to
facilitate malware analysts, researchers and Android mobile users.

v
Capital University of Science and Technology, Islamabad Department of Computer Science
Table of Contents
DECLARATION ………………………………………………………………………………...iii

ACKNOWLEDGEMENT ……………………………………………………………………….iv

Executive Summary ………………………………………………………………………………v

Chapter 1 ………………………………………………………………………………………….1

Introduction ……………………………………………………………………………………….1

1.1. Project Introduction


…………………………………………………………………………..1

1.2. Literature Review ………………………………………………………….…………………6

1.2.1 Vulnerabilities in Android Applications ……………………………………………6

1.2.1.1 Binary protection ……………………………....................……………….....6


1.2.1.2. Insufficient Transport Layer Protection …………………………………….7
1.2.1.3. Cryptography-Improper Certificate Validation
……………………………..8
1.2.1.4. Brute Force – User Enumeration ……………………………………………
9
1.2.2 Vulnerable Android Libraries ………………………………………..……………10

1.2.2.1. Android package installer ………………………………………………….10


1.2.2.2. Android Browser AOSP …………………………………………………...11
1.3. Existing Examples / Solutions ………………………………………………………………
10

1.4. Stakeholders ……………………………………………………………...….……………...11


vi
Capital University of Science and Technology, Islamabad Department of Computer Science
1.4.1. Malware analysts …………………………………………………….….………..11

1.4.2. Android mobile users ………………………………………………..….…………


11

1.4.3. Researchers ………………………………………………………….……………12

1.5. Business scope …………………………………………………………....….


……………...12

1.6. Useful Tools and Technologies …………………………………………..….………………


13

1.7. Project work breakdown ………………………………………………….….


……………...14

1.8. Timeline ……………………………………………………….................….………………


15

Chapter 2 ………………………………………………………………………………………...16

Requirement Specification and Analysis ………………………………………………………..16

2.1. Functional Requirements ………………………………………………....…………………


16

2.2. Non-Functional Requirements …………………………………………....…………………


18

2.3. Selected Functional Requirements ……………………………………….…………………18

2.4. System Use Case Modeling ……………………………………………....…………………


19

2.4.1. Admin Use-case Diagram …………………………………………....……………


19

2.4.2. Mobile-user Use-case Diagram ……………………………………...……………24

vii
Capital University of Science and Technology, Islamabad Department of Computer Science
2.4.3. Malware Analyst Use-case Diagram ………………………………...……………
29

2.4.4. Researcher Use-case Diagram ……………………………………………………45

2.5 System Sequence Diagrams …………………………………………………………………47

2.6. Domain Model …………………………………………………………....…………………


66

Chapter 3 ………………………………………………………………………………………...67

System Design ………………………………………………………………………………….. 67

3.1. Software Architecture


……………………………………………………………………….67

3.1.1. Presentation layer …………………………………………………………………67

3.1.2. User interface layer ………………………………………………….……………67

3.1.3. Business logic layer ………………………………………………………………67

3.1.4. Data layer …………………………………………………………………………68

3.2. Data Flow Diagrams ……………………………………………………..…………………69

3.2.1. Level-0 ………………………………………………………………….…………


69

3.2.2. Level-1 ………………………………………………………………….…………


70

3.2.3. Level 2 ………………………………………………………………….…………


70

3.3. Entity Relation Diagram …………………………………………………….………………


72

3.4. Database Schema ………………………………………………………....….……………..73

3.5. User Interface Design …………………………………………………….….……………..74


viii
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 4 ………………………………………………………………………………………...76

Software Development …………………………………………………………………………..76

4.1. Coding Standards ………………………………………………………….………………


76

4.1.1. Indentation …………………………………………………………..….…………


76

4.1.2. Statement Standards …………………………………………………….………...77

4.1.3. Naming Convention …………………………………………………….…………


78

4.2. Development Environment …………………………………………….….………………


79

4.3. Database management System …………………………………………….………………


79

4.4. Software Description …………………………………………………...….………………


80

4.4.1. Authentication module ………………………………………………….………...80

4.4.2. Exploratory Data Analysis Module ………………………………….….…………


83

4.4.3. Report Generation Module …………………………………………..….………..84

4.4.4 Get dataset Module …………………………………………………..….…………


86

4.4.5. Apply ML algorithm module ………………………………………..….…………


87

4.4.6. Tune ML algorithm module …………………………………………….…………


89

ix
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 5 ………………………………………………………………………………………...91

Software Testing ………………………………………………………………………………... 91

5.1. Testing Methodology …………………………………………………….….………………


91

5.1.1. Black Box Testing …………………………………………………...….…………


91

Chapter 6 ……………………………………………………………………………………….109

Static and Dynamic Analysis …………………………………………………………………..109

6.1. Static Analysis of Android Applications ……………………………….….………………


109

6.1.1. APK file and it’s Structure…………………………………………..….………..109

6.1.2. Activity Diagram …………………………………………………...….………..110

6.1.3. Tools usage ………………………………………………………....……………


111

6.1.4. Techniques …………………………………………………………….…………


112

6.2. Dynamic analysis of android APKs …………………………………….….………………


115

6.2.1. Tools Used …………………………………………………………….…………


115

Chapter 7 ……………………………………………………………………………………….118

ML Implementation and Cross Validation ……………………………………………………. 118

7.1. Selection of dataset ……………………………………………………..….…………….118

7.2. Extraction of dataset …………………………………………………….….……………121

7.3. ML algorithms …………………………………………………………..….……………122


x
Capital University of Science and Technology, Islamabad Department of Computer Science
7.3.1. Environment Specification …………………………………………….………..123

7.3.2. Implementation of ML algorithms …………………………………….…………


123

7.3.3. Decision Tree Algorithm …………………………………………..….…………


123

7.3.2. K- Nearest Neighbor (KNN) ……………………………………….….…………


129

7.3.3. Random Forest Algorithm ………………………………………….….………..133

7.4. Cross validation ………………………………………………………...….……………..137

Chapter 8 ……………………………………………………………………………………….141

Software Deployment …………………………………………………………………………. 141

8.1. Installation and Deployment Process Description ……………………...….


……………….141

8.1.1. Setup Dependency ………………………………………………….….………..142

Chapter 9 ……………………………………………………………………………………….146

9. Project Evaluation …………………………………………………………………………...146

References ……………………………………………………………………………………...147

xi
Capital University of Science and Technology, Islamabad Department of Computer Science
List of Figures
Figure 1. 1: Architecture of Android OS [5]...................................................................................2
Figure 1. 2: Trojan attack on Android.............................................................................................3
Figure 1. 3: Gaining root access in Android....................................................................................6
Figure 1. 4: Session hijacking [26]..................................................................................................7
Figure 1. 5: Man in the Middle SSL attack.....................................................................................8
Figure 1. 6: Compromising authentication by user enumeration using a Brute force attack..........9
Figure 1. 7: Project work breakdown............................................................................................14
Figure 1. 8: Project timeline..........................................................................................................15

Figure 2. 1: Admin use case diagram............................................................................................19


Figure 2. 2: Mobile user use case diagram....................................................................................24
Figure 2. 3: Malware Analyst use case diagram............................................................................29
Figure 2. 4: Researcher Use case diagram.....................................................................................45
Figure 2. 5: SSD request to view general report............................................................................47
Figure 2. 6: SSD request to view individual report.......................................................................47
Figure 2. 7: SSD admin request sign in.........................................................................................48
Figure 2. 8: SSD admin request sign out.......................................................................................48
Figure 2. 9: SSD for mobile user sign up......................................................................................49
Figure 2. 10: SSD for mobile user sign in.....................................................................................49
Figure 2. 11: SSD for mobile behavior information......................................................................50
Figure 2. 12: SSD to display malware types..................................................................................50
Figure 2. 13: SSD mobile user Sign out........................................................................................50
Figure 2. 14: SSD mobile user request to view protection guidelines...........................................51
Figure 2. 15: SSD mobile user request to view user manual.........................................................51
Figure 2. 16: SSD for researcher signup........................................................................................52
Figure 2. 17: SSD for researcher sign in........................................................................................52
Figure 2. 18: SSD Researcher Sign out.........................................................................................53
Figure 2. 19: SSD Researcher request to view results of algorithms............................................53
xii
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 20: SSD Researcher request to view correlation graph..................................................54
Figure 2. 21: SSD Researcher request to display dataset..............................................................54
Figure 2. 22: SSD Researcher request to view cross validation results.........................................55
Figure 2. 23: SSD Researcher request to view user manual..........................................................55
Figure 2. 24: SSD Malware Analyst sign up.................................................................................56
Figure 2. 25: SSD Malware Analyst request to sign in................................................................57
Figure 2. 26: SSD Malware Analyst request to Display Dataset...................................................58
Figure 2. 27: SSD Malware Analyst request to display pruned dataset........................................59
Figure 2. 28: SSD Malware Analyst Sign out...............................................................................60
Figure 2. 29: SSD Malware Analyst request to display results of algorithms...............................60
Figure 2. 30: SSD Malware Analyst request to view correlation graph........................................61
Figure 2. 31: SSD Malware Analyst request to select csv file......................................................62
Figure 2. 32: SSD Malware Analyst request to tune the selected algorithm.................................63
Figure 2. 33: SSD Malware Analyst request to apply machine learning algorithm......................64
Figure 2. 34: SSD Malware Analyst request to view cross validation results...............................65
Figure 2. 35: SSD Malware Analyst request to view user manual................................................65
Figure 2. 36: Domain Model.........................................................................................................66

Figure 3. 1: Software Architecture................................................................................................68


Figure 3. 2: Level 0 Data flow diagram.........................................................................................69
Figure 3. 3: Level 1 Data flow diagram.........................................................................................70
Figure 3. 4: Level 2 Data flow diagram.........................................................................................71
Figure 3. 5: Entity Relationship Diagram......................................................................................72
Figure 3. 6: Database schema........................................................................................................73
Figure 3. 7: Home page.................................................................................................................74
Figure 3. 8: Signup........................................................................................................................75

Figure 4. 1: Indention example from code.....................................................................................76


Figure 4. 2: Statement standard examples.....................................................................................77

xiii
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 3: Authentication module...............................................................................................80
Figure 4. 4: Dataset Error..............................................................................................................81
Figure 4. 5: Login Function...........................................................................................................82
Figure 4. 6: Prune Dataset.............................................................................................................83
Figure 4. 7: view dataset................................................................................................................84
Figure 4. 8: Individual report generation.......................................................................................85
Figure 4. 9: Report of selected users.............................................................................................85
Figure 4. 10: Report of all users....................................................................................................86
Figure 4. 11: Get dataset module...................................................................................................86
Figure 4. 12: Dataset......................................................................................................................87
Figure 4. 13: Apply ML Algorithm module..................................................................................88
Figure 4. 14: SVM module............................................................................................................88
Figure 4. 15: Random Forest Module............................................................................................89
Figure 4. 16: Decision Trees module.............................................................................................89
Figure 4. 17: Selection of Algorithm in tuning..............................................................................90
Figure 4. 18: Editing Hyper Parameters in Tuning........................................................................90

Figure 5. 1: Successful test case signup.........................................................................................93


Figure 5. 2: Test case signup Invalid Email..................................................................................93
Figure 5. 3: Test case signup Username already taken..................................................................93
Figure 5. 4: Test case signup Password error................................................................................94
Figure 5. 5: Test case Login invalid Credentials Error..................................................................95
Figure 5. 6: Test case Invalid Credentials Error 2.........................................................................95
Figure 5. 7: Test case View information 1....................................................................................96
Figure 5. 8: Test case View information 2....................................................................................97
Figure 5. 9: Test case View information 3....................................................................................98
Figure 5. 10: Test case View Dataset............................................................................................99
Figure 5. 11: Test case individual report, user exist....................................................................100
Figure 5. 12: Test case individual report, user not exist error.....................................................101

xiv
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 13: Test case group report generation 1.......................................................................103
Figure 5. 14: Test case group report generation 2.......................................................................103
Figure 5. 15: Test case group report generation 3.......................................................................104
Figure 5. 16: Test Case save csv..................................................................................................105
Figure 5. 17: Test case implement ML algorithms Random Forest............................................108

Figure 6. 1: APK structure files...................................................................................................109


Figure 6. 2: Activity diagram for static analysis..........................................................................110
Figure 6. 3: apktool installation...................................................................................................111
Figure 6. 4: apk tool command....................................................................................................111
Figure 6. 5: jdgui tool command..................................................................................................112
Figure 6. 6: Java file in JD-GUI..................................................................................................113
Figure 6. 7: Permissions in Manifest file.....................................................................................114
Figure 6. 8: Activity diagram for dynamic analysis....................................................................115
Figure 6. 9: Dynamic Analyzer in MobSF..................................................................................116
Figure 6. 10: Activities shown in apk..........................................................................................116
Figure 6. 11: Activity Running....................................................................................................117

Figure 7. 1: Classes of Dataset....................................................................................................119


Figure 7. 2: Extraction of Dataset................................................................................................122
Figure 7. 3: Decision tree example..............................................................................................124
Figure 7. 4: Complete Decision Tree...........................................................................................125
Figure 7. 5: Criterion...................................................................................................................126
Figure 7. 6: minimum sample split..............................................................................................127
Figure 7. 7: minimum sample leaf...............................................................................................128
Figure 7. 8: max features.............................................................................................................129
Figure 7. 9: KNN Example [17]..................................................................................................130
Figure 7. 10: KNN mathematical example..................................................................................131
Figure 7. 11: Grid Search CV......................................................................................................132

xv
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 12: SVM Example 1.....................................................................................................135
Figure 7. 13: SVM Example 2 ....................................................................................................135
Figure 7. 14: grid search cv.........................................................................................................136
Figure 7. 15: Excel file of apks results........................................................................................138
Figure 7. 16: Code snippet...........................................................................................................138
Figure 7. 17: Library Function.....................................................................................................139
Figure 7. 18: use of compulsory parameters................................................................................139
Figure 7. 19: Bar Graph...............................................................................................................140

Figure 8. 1: Run as Administrator...............................................................................................141


Figure 8. 2: XAMPP 1.................................................................................................................142
Figure 8. 3: XAMPP 2.................................................................................................................142
Figure 8. 4: php my admin...........................................................................................................143
Figure 8. 5: New Database...........................................................................................................143
Figure 8. 6: Import Database.......................................................................................................144
Figure 8. 7: verify import.............................................................................................................144
Figure 8. 8: Waiting Screen.........................................................................................................145

xvi
Capital University of Science and Technology, Islamabad Department of Computer Science
List of Tables

Table 2. 1: Functional Requirements.............................................................................................16


Table 2. 2: Non-Functional Requirements.....................................................................................18
Table 2. 3: Selected Functional Requirements..............................................................................18
Table 2. 4: Use case 1 - Signup.....................................................................................................20
Table 2. 5: Use case 2 - Sign in.....................................................................................................21
Table 2. 6: Use case 3 view individual report...............................................................................22
Table 2. 7: Use Case 4 - View general report................................................................................23
Table 2. 8: Use Case 5 View mobile’s behavior information........................................................25
Table 2. 9: Use case 6 View information of types of malwares....................................................26
Table 2. 10: Use Case 7 View guidelines for mobile’s protection................................................27
Table 2. 11: Use Case 8 View cross validation results..................................................................28
Table 2. 12: Use Case 9 View dataset...........................................................................................30
Table 2. 13: Use Case 11 View correlation of features.................................................................31
Table 2. 14: Use Case 12 Sort by correlation................................................................................32
Table 2. 15: Use Case 13 Prune dataset.........................................................................................33
Table 2. 16: Use case 14 Select rows............................................................................................34
Table 2. 17: Use Case 15 Select Columns.....................................................................................35
Table 2. 18: Use Case 16 save pruned dataset...............................................................................36
Table 2. 19: Use Case 17 Upload csv............................................................................................37
Table 2. 20: Use Case 18 Select classification algorithm..............................................................38
Table 2. 21: Use Case 19 Tune ML algorithms.............................................................................39
Table 2. 22: Use Case 20 View accuracies....................................................................................40
Table 2. 23: Use Case 21 View visual results...............................................................................41
Table 2. 24: Use Case 22 Take guidance from manual.................................................................42
Table 2. 25: Use Case 23 View correlation graph.........................................................................43
Table 2. 26: Use Case 24 view dynamic accuracy graph..............................................................44
Table 2. 27 - Use Case 22 View dataset details.............................................................................46
xvii
Capital University of Science and Technology, Islamabad Department of Computer Science
Table 4. 1: Naming Style...............................................................................................................78

Table 5. 1: Test case sign up..........................................................................................................92


Table 5. 2: Test Case Sign in.........................................................................................................94
Table 5. 3: Test case view information..........................................................................................95
Table 5. 5: Test Case View Dataset...............................................................................................99
Table 5. 6: Test Case Individual Report Generation...................................................................100
Table 5. 7: Test Case Role based report generation....................................................................102
Table 5. 8: Test Case csv.............................................................................................................104
Table 5. 9: Test Case Upload csv................................................................................................105
Table 5. 10: Test Case Upload csv..............................................................................................106
Table 5. 11: Test Case Apply ml algorithms...............................................................................107

Table 7. 1: Comparison of datsets...............................................................................................120


Table 7. 2: Comparison of features of benchmark datasets with CICMalDroid-2020................120
Table 7. 3: Comparison of Machine Learning Algorithms..........................................................137

Table 9. 1: Project Evaluation.....................................................................................................146

xviii
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 1
Introduction

The use of Android OS has increased rapidly. Our laptops, mobiles, tablets, watches etc. are
comprised of this OS. Android being most popular among other operating systems, has
become prominent in the eyes of attackers. One of the most dangerous threat on the internet
which is been rising for the several years is Android malware. The malwares can cause severe
damages and compromise confidentiality and integrity of our data. Therefore, it’s important
to analyze and detect any malware and eliminate it from our systems.

1.1. Project Introduction

In today’s world, the Android mobile operating system, developed by Google has become most
popular among the rest of operating systems. The historical backdrop of Android starts in
October 2003. The system was created by a California-based company named Android Inc. for
mobiles and digital cameras. Android Inc. was acquired by Google in 2005 and after two years,
Google released Android as mobile Operating system [1].

On 23rd September 2008, Android 1.0, the first commercial version was released. Android 1.0
and 1.1 did not have specific code names [2]. Android version 1.5 and onwards are named
after consumables like Cupcake (1.5), Donut (1.6), Eclair (2.0), Froyo (2.2), Gingerbread (2.3)
etc. [3] . The current stable version of Android is Red velvet cake (11) and it was released on
8th September, 2020 [4]. Google no longer supports Nougat (7.0) and the versions previous to
it.

The architecture of the Android operating system involves four layers. Android applications
are present at the top of all layers. The Application Framework (second) layer offers numerous
higher-level types of assistance to applications such as Java classes. A set of libraries are
present at the third layer. These libraries include libc, SQLite database, SSL libraries etc.
Android runtime on the third layer includes some libraries and dalvik virtual machine which is
a Java virtual machine specially designed and optimized for Android. It uses core features of
Linux like memory management and multi-threading, which is inherited in the Java language.

1
Capital University of Science and Technology, Islamabad Department of Computer Science
Linux is at the bottom of all layers. Linux is privilege control so each app which runs on
android is given a process id by Linux.

Figure 1. 1: Architecture of Android OS [5]

The share of Android mobile OS is 72.73% worldwide [5]. Android can run on multiple types
of devices like TV, tablets, mobile etc. [6]. The Open Handset Alliance (OHA) is a
consortium whose goal is to develop open standards for mobile devices, promote innovation
in mobile phones and provide a better experience for consumers at a lower cost [7]So
Android was designed to run on devices of multiple manufacturers. Due to the popularity and
adversity of Android, attackers are targeting this operating system. Android is the most
heavily targeted mobile operating system by malware at a market share of 85% across the
world [8] Attackers target these devices to compromise confidentiality and integrity of user’s
data.

2
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 1.2: Trojan attack on Android

Android based malware detection is important as in today’s world, Android’s users have
increased up to 1.6bn [1]. These large numbers of users may include
non-professionals/laypeople. A layperson might know about the threats and risks but most
probably do not know about the vulnerabilities of the OS and how those vulnerabilities can
be exploited.

Different kinds of malware perform different malicious activities. Instances of some of the
popular types of Androids based malware are Trojans that run on the Android operating
system are usually either specially-crafted programs that are designed to look like desirable
software (e.g., games, system updates or utilities), or copies of legitimate programs that have
been repackaged to include harmful components [9], key loggers are the malwares which
record the keystrokes, Ransomware takes out significant data of user like photos, documents,
videos etc. and encrypts it to put up a demand of paying ransom to the malware makers,
Spyware enables attackers to access all the information on our phone, including contacts,
calls, texts, and other sensitive information, and also hijacks your microphone and camera

3
Capital University of Science and Technology, Islamabad Department of Computer Science
[10] etc. So it’s important to perform a security check-up and detect if the system is under
attack from any malware.

As android users are increasing rapidly and cybercriminals are more interested in android-
based devices, detection of android-based malware is important to make android devices
more secure. In this project we will focus on analysis and detection of android-based
malware. The malware detection process is done in three steps that are malware analysis,
feature extraction/selection and classification/detection. So, for malware analysis [11] The
following four methods are used:

● Static malware analysis


● Dynamic malware analysis
● Hybrid malware analysis
● Memory malware analysis

Static malware analysis malware analysis is done without actually running executable files.
Static analysis is basically signature identification querying cryptographic hash codes and
strings. Static analysis consumes less resources as malware is not executed on machines
during analysis. At the same time by using static analysis, we can’t detect malware with code
obfuscation because of not finding signatures.

The second analysis technique is dynamic malware analysis which is also known as behavior
analysis. In this technique malware is executed in a controlled environment like in a virtual
machine or emulator and then its behavior is analyzed for instance, analyzing API calls and
system calls. This analysis technique is better as it can detect unknown and new malwares
and can detect obfuscated code, but it takes too much resources to execute and analyze the
behavior of malware and it can’t detect zero-day malwares.

Hybrid malware analysis technique is basically the combination of both static and dynamic
analysis techniques. As both analysis techniques have their own limitations, so by using
hybrid approach, these limitations can be overcome as static analysis is cheap but can’t detect
code obfuscation and dynamic analysis is resource consuming but it can detect new variants
of malware so hybrid analysis approach is better in a way that malware can be detected with
more accuracy.

4
Capital University of Science and Technology, Islamabad Department of Computer Science
The fourth analysis technique is memory analysis which is becoming popular for android-
based malware analysis as it provides more comprehensive analysis of malware by observing
code and memory images. This technique is based on memory forensics. It executes malware
and after execution, memory images are analyzed to get information about running programs.
Features extracted from this technique provide results with more accuracy and it can detect
API hooking, DLL injections and hidden processes.

After analysis and feature selection, the next step is detection of malware. So, detection
techniques for android-based malware [12] can be categorized into several types but here we
will discuss three main categories that are:

● Signature based
● Behavior based
● Heuristic based

In signature-based detection techniques, malware is detected by pattern (signature) matching


and this is the reason that it is a fast detection technique with less error rate. But it can’t
detect unknown variants of malware and it also requires a lot of manpower.

Behavior based detection techniques detect malware by observing behavior of executable and
analyzing its functionality. In this method, behavior of executable under detection is
compared with existing malwares executable’s behavior. Thus, this method detects malware
with new variants efficiently. But it can’t detect zero-day malware efficiently and it also
requires manual work.

In heuristic-based detection techniques both behavior and signature features of executable are
used to detect malware. Other hybrid features like API calls, n-grams etc. are also used. In
this technique data mining and machine learning is used for detection and classification. This
technique is helpful in detecting new variants of malware as well as zero-day malware.

Several datasets are available publicly. The one we will be using for training and testing of
machine learning algorithms is the CIC-MalDroid-2020 dataset. This dataset is recent and
big, so it is preferable to be used for malware analysis. In this project we will implement
different machine learning algorithms on the selected feature vector. We will calculate
accuracies of these algorithms on dataset and compare the results that which ML algorithm
5
Capital University of Science and Technology, Islamabad Department of Computer Science
performing with better accuracy. We will do static and dynamic analysis of malware by
configuring virtual environment. We will also analyze which analysis technique is better
among heuristic based, behavior based and static based analysis. We will use Virus Total (a
famous tool for malware detection) for counter checking whether our detection is correct or
not.

1.2. Literature Review


Following are the topics that helped us in understanding our project.

1.2.1 Vulnerabilities in Android Applications

The vulnerabilities of Android architecture are exploited by cyber attackers to target


smartphones, smartwatches, etc. In this section, we’ll discuss the top vulnerabilities of
Android applications. The data is picked from the recent blogs and articles which show that
the below-described vulnerabilities currently exist in Android applications.

1.2.1.1 Binary protection:

The process of removing the limitations or restrictions, running the android on a tablet or
mobile is called rooting. Rooting or jailbreaking a device bypasses data safety and
encryption schemes at the system. On a standard Android configuration, no app can
access any other app’s data, no matter how many permissions the app asks for.

6
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 1. 3- Gaining root access in Android

This all changes when we run an application as root [23]. If a person has access to the
root file system he/she can disable critical apps of the system, delete critical files of the
system, and thus can prevent the normal functioning of the device. If a person has a clear
idea of how to use the device when it is rooted him/her just needs to be more careful but
for a non-technical user, lack of root in android helps him/her.

Gaining root access also requires avoiding the security restrictions put in place by the
Android operating system [24]. For example, we know millions of gamers play PUBG
mobile. The number of gamers that are using hacks to get an advantage over other gamers
are increasing rapidly and these hacks like sharpshooter only work on rooted devices and
these hack APKs are downloaded from untrusted sources. Due to this, the gamer doesn't
know how malicious those APKs would be.

1.2.1.2. Insufficient Transport Layer Protection

Applications may use TLS/SSL during authentication but they fail to regularly encrypt
network site visitors whilst it's miles vital to shield sensitive communications like plain
text session id [25]. Encryption ought to be used for all authenticated connections, mainly
Internet-accessible web pages.

7
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 1. 4 - Session hijacking [26]

The above figure shows Cross-site scripting (XSS) session hijacking where an attacker
has exploited server vulnerabilities and injected malicious client-side script into the
webpages. When a user gets authenticated on the server, the server returns page code with
an injected script. When a compromised page is loaded, the malicious code will execute
on the user's side. If the HTTP only attribute is not set by the server in session cookies,
the session key can be gained by injected script and start sending cookies of the session to
the attacker.

1.2.1.3. Cryptography-Improper Certificate Validation

Application can be both, no longer validating SSL/TLS certificate or is utilizing an


SSL/TLS certificates validation system that can no longer efficiently affirm that a trusted
provider issued the certificates. The client must be configured to drop the connection if
the certificates cannot be verified, or aren't provided [25]. Any data exchange over a

8
Capital University of Science and Technology, Islamabad Department of Computer Science
connection wherein the certificate isn’t validated can be exposed to unauthorized access
or modification.

Figure 1. 5 - Man in the Middle SSL attack

A user has requested a website which is intercepted by the attacker. Attacker established
an SSL session with the legitimate site using his own private key. On the server, the
certificate is validated through the attacker's private key. So the site will respond with an
SSL certificate. The attacker will create a fake certificate and send it to the user. If on the
user’s side, the certificate is validated, then the user will keep on communicating with the
attacker assuming it's the server end.

1.2.1.4. Brute Force – User Enumeration

Inadequate authorization occurs when an application does not perform proper


authorization checks to ensure that a user is performing a function or accessing data in

9
Capital University of Science and Technology, Islamabad Department of Computer Science
accordance with a security policy. Authorization and authentication procedures must
determine what a user, service, or application is allowed to do.

Some applications do not perform multi-factor authentication. These applications are


more vulnerable to attacks. Brute forcing using multiple usernames and passwords to
identify whether a user’s account password can be guessed or not. User enumeration is
basically getting useful information about a user's account from the app which provides
ease for the attacker [25]. So by user enumeration, attacks can be possible which can
compromise authentication and authorization

Figure 1. 6 - Compromising authentication by user enumeration using a Brute force attack

For example, an attacker enters different names and passwords into a banking app. On
a username, the application displays that the password is incorrect. Here the attacker
will know that the username was correct and he just needs to guess passwords for it.
This way user enumeration happened and his search space was reduced. Now he’ll try
different passwords through a malware which tries different combinations of
passwords. If he gets successful in this brute force attack, authentication/authorization
will be compromised and he’ll be able to gain unauthorized access.

10
Capital University of Science and Technology, Islamabad Department of Computer Science
1.2.2 Vulnerable Android Libraries

The vulnerabilities of Android libraries can be used in order to perform malware attacks on
the systems. These malwares can cause severe damage to our systems once they enter them
so their detection is important. The two mentioned vulnerabilities in Android libraries were
discovered in the past.

1.2.2.1. Android package installer

The Android package installer was unable to verify the validity of certificates as
certificate chaining verification was not done properly. Before installing, the application
certificate is verified but identity can claim to be issued by another identity so the
malicious certificate appears to be a verified one.

All applications have a unique identity but due to improper certificate validation, there
was a vulnerability that allowed applications to copy the identity of another application.
In this way, malicious applications were able to copy the identity of a legitimate
application. It was called FAKE ID vulnerability [25].

1.2.2.2. Android Browser AOSP

There was a vulnerability in the android browser AOSP through which hackers bypassed
SOP. SOP, Same Origin Policy is a security mechanism that allows scripts to access
information from the same site it originated but not the information from pages of another
site. So the web application is prevented from getting information from another tab,
currently opened by the user. Due to this vulnerability, hackers were allowed to get
sensitive information of the user present in other tabs opened by him/her. This was done
by sending a malformed JavaScript: URL handler with a null byte, which led to the SOP
not being enforced [26]. Now the AOSP browser is not part of the android devices so the
problem has been resolved.

11
Capital University of Science and Technology, Islamabad Department of Computer Science
1.3. Existing Examples / Solutions

Machine learning algorithms have been broadly used for malware detection. ML algorithms
are better for malware detection as they detect malware with high accuracy [11]. Various
machine learning algorithms have been applied on different malware datasets. Mostly ML
classification algorithms are used for malware detection as we need to classify between
malware and benign. Static and dynamic analysis has also been used for malware detection.
The recent work on our selected dataset was done in 2020 in which four machine learning
algorithms Random Forest, Decision Tree, Naïve Bayes and K nearest neighbors were
applied and accuracies were reported too.

Our contribution is that we are using a recent dataset MalDroid-2020 dataset. Its advantage
is that we know new malware keeps on being introduced by the attackers, so, this recent
dataset has records of latest malware samples. Those features will be chosen in feature
vector which are strongly correlated for malware detection. We will apply different ML to
check their accuracies and compare them. We will do signature based and behavior-based
analysis of malware by configuring different malware detection tools. We will also analyze
which analysis technique is better among machine learning techniques, dynamic and static
analysis. We will develop a desktop application which will provide users the facility to
prune dataset, apply machine learning algorithms and can optimize or tune the hyper
parameters. This application will be useful as user can do his own experiment on our
selected dataset and cn get accuracy results and visual graphs too.

1.4. Stakeholders

The stakeholders of this system can be observed as:

1.4.1. Malware analysts

Malware analyst is a person whose job is to identify, examine, and understand various
forms of malware and their delivery methods. These malwares consist of different types of
adware, bots, bugs, rootkits, spyware, ransomware, Trojan horses, viruses, and
worms. Malware analysts will disassemble and reverse engineer the malicious code after
the organization’s incident response team has identified an attack. This product will help

12
Capital University of Science and Technology, Islamabad Department of Computer Science
malware analysts in analyzing malware applications [13]. If the feature vector is the same
as one we will test and train ML algorithms on, malware analysts can detect through ML
model.

1.4.2. Android mobile users

Android mobile users can be categorized into different age groups or whether they are
professionals or non-professionals. The main category that is most vulnerable to these
malware apps are the people who download APKs from the mobile browser and don’t have
any idea that these APKs might be malicious. In future, this product can be converted into a
website which takes an APK and detects if it’s malicious or benign. Mobile users can also
verify from the website whether a particular app is malicious or not, Researchers from
the University of Cambridge [14] found that 87% of all Android smartphones are exposed
to at least one critical vulnerability. So almost all android mobile users are stakeholders
[15].

1.4.3. Researchers

Researchers in the field of malware analysis and detection from all over the world can
benefit from this research paper as we will be applying different machine learning
algorithms for detecting malware and comparing their accuracies. Moreover, we will
perform static and dynamic analysis on malware’s APKs, and will compare results of this
analysis with machine learning models, so researcher can use do research on our these
findings too.

1.5. Business scope

The business scope of this software will be

 This product will be useful to Malware analysis Labs in future as this product can be
used for detection of malware applications and those labs can buy the licensed product
as the product will be available in the market. They can also detect malware apps
through their techniques and cross check them with our techniques and verify whether
they have detected malware correctly or not.

13
Capital University of Science and Technology, Islamabad Department of Computer Science
 This product will be useful for researcher, they can do research on our selected data set
and can compare results of our experiments with their results. Moreover, we will
perform static and dynamic analysis on malware’s APKs, and will compare results of
this analysis with machine learning models, so researcher can use do research on these
findings too.

 This project can be converted into a product in future which can be provided to the
clients on the basis of monthly subscription. Client can be any android mobile user
who wants to check whether an app is malicious or not.

1.6. Useful Tools and Technologies

● Python provides machine learning algorithms libraries, many frameworks, and extensions
which makes implementation of machine learning algorithms really easy. Also, python is
hugely valued by cyber security experts as it is used in penetration testing and malware
detection, etc.[16]. So, we will use Python language for implementing machine learning
algorithms for malware detection.

● Jupyter notebook is easy to use as it provides code, output, explanations (text) in a single
document. We’ll use the Jupyter notebook for implementation of ML algorithms as it is
the best choice for code implementation.

● Kali Linux is a safe environment for testing so we’ll perform static and dynamic analysis
of APKs in Kali Linux using the APK tool, DEX2JAR, and JD-GUI.

● Spyder is an easy-to-use Python development IDE so we’ll use Spyder for the interface
development.

● Google colab note books use cloud resources so we’ll use them for the implementation of
machine learning algorithms.

● Implementation of machine learning algorithms required a lot of processing power, so the


end product of this project which will analyze Android based malwares will be supported
by Windows operating system to run on it.

14
Capital University of Science and Technology, Islamabad Department of Computer Science
1.7. Project work breakdown

15
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 1.7: Project work breakdown

16
Capital University of Science and Technology, Islamabad Department of Computer Science
1.8. Timeline

Figure 1.8: Project timeline

17
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 2

Requirement Specification and Analysis

The process of determining user expectations for a new or modified product is known as
requirement analysis. These characteristics, referred to as criteria, must be quantitative,
relevant, and specific. Functional requirements are a term used in software engineering to
describe such requirements.

2.1. Functional Requirements


Functional requirements define the functionalities of a system or its components. Functional
requirements may be calculations, technical details, data manipulation and processing, and
other specific functionality that define what a system is supposed to accomplish. The list of
functional requirements of our project is as follows:

Table 2. 1: Functional Requirements

Sr. No. Functional requirements Type Status

Administrator

1 Administrator can view the reports of Core Completed


an individual user and the role-based
reports of the users.

Malware Analyst

2 Malware analyst can view the Core Completed


dataset.

3 Malware analyst can prune the Core Completed


dataset.

4 Malware analyst can save the pruned Core Completed


dataset.

5 Malware analyst can upload CSV for Intermediate Completed


prediction.

6 Malware analyst can select Intermediate Completed


18
Capital University of Science and Technology, Islamabad Department of Computer Science
classification algorithm.

7 Malware analyst can view the Core Completed


accuracies of implemented ML
models and their accuracy graphs.

8 Malware analyst can see the Core Completed


correlation of all features with the
class label.

9 Malware analyst can view a Core Completed


correlation graph.

10 Malware analysts can view dynamic Core Completed


accuracy graphs during hyper
parameter tuning.

11 Malware analysts can take guidance Intermediate Completed


from the help manual.

12 Malware analysts can view the visual Core Completed


results of cross validation.

Researcher

13 Researcher can view the dataset. Core Completed

14 Researcher can view the details of Intermediate Completed


the dataset.

15 Researcher can see the accuracies of Core Completed


implemented ML algorithms.

16 Researcher can take guidance from Intermediate Completed


the help manual.

17 Researcher can view the visual Core Completed


results of cross validation.

Mobile User

18 Mobile users can view mobile Intermediate Completed


protection guidelines.

19 Mobile users can view information Intermediate Completed

19
Capital University of Science and Technology, Islamabad Department of Computer Science
regarding malware types.

20 Mobile users can view information Intermediate Completed


regarding mobile suspicious
behavior.

21 Mobile user can view the visual Core Completed


results of cross validation.

2.2. Non-Functional Requirements


A non-functional requirement is one that sets criteria that can be used to evaluate a system's
functioning rather than specific behavior. Functional requirements, on the other hand, define
precise behavior or functions. The following is the list of non-functional requirements:

Table 2. 2: Non-Functional Requirements

S. No. Non-Functional Requirements Category

1 Degree of the authenticity of the data. Reliability

2 Results of implemented ML algorithms should be accurate. Accuracy

3 Selected dataset should be the latest. Usability

2.3. Selected Functional Requirements


The selected functional requirements for 1st iteration were completed in FYP Final Part 1.
The following is the list of the functional requirements completed for the 1st iteration:
Table 2. 3: Selected Functional Requirements

S. No. Selected Functional Requirement Type Status

1 Malware analyst can view a correlation graph. Core Completed

2 Malware analysts can view dynamic accuracy Core Completed


graphs during hyper parameter tuning.

20
Capital University of Science and Technology, Islamabad Department of Computer Science
3 Malware analyst can view the visual results of cross Core Completed
validation.

4 Mobile user can view the visual results of cross Core Completed
validation.

5 Researcher user can view the visual results of cross Core Completed
validation.

6 Malware analyst can take guidance from the help Intermediate Completed
manual.

7 Researcher can take guidance from the help manual. Intermediate Completed

2.4. System Use Case Modeling


System use case modeling is used to demonstrate how various types of users interact with a
system. It describes the user’s goals, the interactions between the users and the system, and
the system's required behavior in order to achieve those goals. Use case diagram of our
system is shown as follows:

2.4.1. Admin Use-case Diagram

The use-case diagram in figure 2.1 for admin is shown below:

21
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 1 - Admin use case diagram

The table 2.4 is the ‘signup’ description table in which the signup details are provided.

Table 2. 4 - Use case 1 - Signup

Use Case ID: Uc1


Use Case Signup
Name:
Created By: Quratulain Tariq Last Updated By: Quratulain Tariq

Date Created: 4/11/2021 Last Revision 28/2/2022


Date:
Actors: Malware analyst, mobile user and researcher

Description: The user can sign up by the first time he/she used the system by providing a
username, password, email address and selecting a role.
Trigger: Signup button
Preconditions: The user provides a username, password email address, choose a role and
click on the sign-up button.
Post The user will be signed up to the system and now he/she will be able to use
conditions: the system.

22
Capital University of Science and Technology, Islamabad Department of Computer Science
Normal Flow: User System

1: User will be signed up to The system provides a sign-up form for the
the system and now he/she user.
will be able to use the
system.
2: User fills in the form by System signs up the user.
providing a username,
password, and address.
Alternative User cancels the current form.
Flows:
Exceptions: 1. The system is not responding.
2. The database is not responding.
3. User has not filled the form correctly.

The table 2.5 is the ‘sign in’ description table in which the sign in details are provided.

Table 2. 5 - Use case 2 - Sign in

Use Case ID: Uc2


Use Case Sign in
Name:
Created By: Abdullah Arif Last Updated Quratulain Tariq
By:
Date Created: 4/11/2021 Last Revision 28/11/2021
Date:
Actors: Admin, malware analyst, mobile user and researcher

Description: User will sign into the system by providing username and password.
Trigger: Sign-in button

Preconditions: User provides username, password and then clicks on the sign-in button.

23
Capital University of Science and Technology, Islamabad Department of Computer Science
Post User will be signed in to the system.
conditions:
Normal Flow: User System

1: User will click the sign-in The system will provide the user sign-in
button to request for sign in form.

2: The customer will fill out The system will allow users to log in to
the form by providing a the system.
username, password.
Alternative User will cancel the current form.
Flows:
Exceptions: 1. The database is not responding.
2. User has not filled the form correctly.
3. System is not responding.

The table 2.6 is the ‘View Individual report’ description table in which the details of reports
of the users are provided. These details can be viewed by the system admin only.

Table 2. 6 - Use case 3 view individual report

Use Case ID: Uc3


Use Case Name: View Individual report
Created By: Abdullah Arif Last Updated By: Safia
Mansoor
Date Created: 18/11/2021 Last Revision Date: 10/12/2021
Actors: Admin
Description: Admin can view the record of a user individually

Trigger: View individual report option

24
Capital University of Science and Technology, Islamabad Department of Computer Science
Preconditions: Admin must be logged in his/her account and select the view individual
report option.
Post conditions: Record of the user will be shown.

Normal Flow: User System


1: Admin will click on view An option will be shown to enter the name
individual report option. of the user whose record the admin wants
to see.
2: Admin will enter name of Complete record of the user will be shown
the user. in tabular form.
Alternative 1. Cancel the view individual report option.
Flows:
2. User logged out the account

Exceptions: 1. View individual report option is not responding.


2. Database is not responding.

The table 2.7 is the ‘View general report’ description table in which the details of reports of
the users are provided. These details can be viewed by the system admin only.

Table 2. 7 - Use Case 4 - View general report

Use Case ID: Uc4


Use Case View general report
Name:
Created By: Abdullah Arif Last Updated By: Safia Mansoor

Date Created: 18/11/2021 Last Revision 10/12/2021


Date:
Actors: Admin
Description: Admin can view the records of all the users of the system.
25
Capital University of Science and Technology, Islamabad Department of Computer Science
Trigger: View general report option

Preconditions: Admin will click on view general report option to view the records of
the users.
Post Records of the users will be shown.
conditions:

Normal Flow: User System

1: Admin will click on view All categories of user’s name will be


general report option. shown i.e., malware analyst, mobile user
and researcher.
2: Admin will select any of All the records for malware analyst
the category, for instance accounts shall be displayed.
malware analyst.
Alternative 1. Cancel the view records option.
Flows:
2. User logged out the account

Exceptions: 1. View records page is not responding.


2. Database is not responding.

2.4.2. Mobile-user Use-case Diagram

The use-case diagram for mobile-user is shown below in figure 2.2.

26
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 2 - Mobile user use case diagram

The table 2.8 is the ‘View mobile’s behavior information’ description table in which the
details of information shown to the mobile user are provided.
27
Capital University of Science and Technology, Islamabad Department of Computer Science
Table 2. 8 - Use Case 5 View mobile’s behavior information

Use Case ID: Uc5

Use Case View mobile’s behavior information


Name:
Created By: Quratulain Tariq Last Updated By: Safia
Mansoor
Date Created: 4/11/2021 Last Revision Date: 28/11/2021

Actors: Mobile user

Description: Mobile user can view the information of suspicious behavior of mobile
after a malware attack.

Trigger: Select the option for mobile’s behavior information

Preconditions: Mobile user will be shown the option of mobile’s behavior information.

Post conditions: Mobile user will be shown all the information regarding mobile’s
behavior.

Normal Flow: User System

1: Mobile user will select System will display all the information
option to view mobile’s about how the mobile’s behavior gets
behavior information suspicious after a malware attack
Alternative Cancel view mobile’s behavior information option
Flows:
Exceptions: 1. The view mobile’s behavior information page is not responding.

28
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.9 is the ‘View information of types of malwares’ description table in which the
details of information shown to the mobile user are provided.

Table 2. 9 - Use case 6 View information of types of malwares

Use Case ID: Uc6

Use Case View information of types of malwares


Name:
Created By: Quratulain Tariq Last Updated By: Safia
Mansoor
Date Created: 5/11/2021 Last Revision Date: 10/12/2021

Actors: Mobile user

Description: Mobile user can view the information of types of malwares


Trigger: Select the option for information of types of malwares

Preconditions: Mobile user will be shown the option of information of types of


malwares
Post Mobile user will be shown all the information of types of malwares
conditions:

Normal Flow: User System

1: Mobile user will select System will display all the common types
option to view information of Android malwares.
of types of malwares
Alternative Cancel view information of types of malwares option
Flows:
Exceptions: 1. The view information of types of malwares page is not
responding.

29
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.10 is the ‘View guidelines for mobile’s protection’ description table in which the
details of information shown to the mobile user are provided.

Table 2. 10 - Use Case 7 View guidelines for mobile’s protection

Use Case ID: Uc7

Use Case View guidelines for mobile’s protection


Name:
Created By: Quratulain Tariq Last Updated By: Safia
Mansoor
Date Created: 10/11/2021 Last Revision Date: 10/12/2021

Actors: Mobile user

Description: Mobile user can view the guidelines for mobile’s protection

Trigger: Select the option to view the guidelines for mobile’s protection

Preconditions: Mobile user will be shown the option to view the guidelines for mobile’s
protection
Post Mobile user will be shown all the guidelines for mobile’s protection
conditions:

Normal Flow: Customer System

1: Mobile user will select System will display all the guidelines for
option to view guidelines for mobile’s protection
mobile’s protection
Alternative Cancel view guidelines for mobile’s protection option
Flows:
Exceptions: 1. The view guidelines for mobile’s protection page is not
responding.

30
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.11 is the ‘View cross-validation results’ description table. This description table
is regarding how the cross-validation results are shown to mobile user, researcher and
malware analyst.

Table 2. 11: Use Case 8 View cross validation results

Use Case ID: Uc8

Use Case View cross-validation results


Name:
Created By: Quratulain Tariq Last Updated By: Safia
Mansoor
Date Created: 20/10/2022 Last Revision Date: 20/10/2022

Actors: Mobile user, researcher and malware analyst

Description: Users can view the cross-validation result in a visual form.

Trigger: Select the option ‘cross validation’.

Preconditions: User must be logged in as mobile user, researcher or malware analyst.

Post User will be shown a graph.


conditions:

Normal Flow: User System

1: User will select option System will display a graph showing the
‘cross validation’. results of static analysis, dynamic
analysis, SVM algorithm, KNN, RF and
Decision tree algorithm.
Alternative User logs out.
Flows:
Exceptions: 1. The ‘cross validation’ option is not responding.

31
Capital University of Science and Technology, Islamabad Department of Computer Science
2.4.3. Malware Analyst Use-case Diagram

The use-case diagram for malware analyst is shown below in figure 2.3.

32
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 3 - Malware Analyst use case diagram

The table 2.12 is the ‘View dataset’ description table. This description table is regarding how
malware analyst and researcher can view the complete dataset.
33
Capital University of Science and Technology, Islamabad Department of Computer Science
Table 2. 12 - Use Case 9 View dataset

Use Case ID: Uc9

Use Case Name: View dataset

Created By: Quratulain Tariq Last Updated By: Safia


Mansoor
Date Created: 2/11/2021 Last Revision Date: 28/11/2021

Actors: Malware analyst and researcher

Description: Malware analyst and researcher can view the dataset


Trigger: Select the option to view dataset.

Preconditions: The user must be logged-in as malware analyst or researcher.


Post conditions: Dataset will be shown to the malware analyst.

Normal Flow: User System

1: Malware analyst and System will display all the dataset.


researcher will select
option to view dataset.
Alternative Cancel view dataset option
Flows:
Exceptions: 1. The view dataset page is not responding.
2. Database is not responding.

34
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.13 is the ‘View correlation of features’ description table. This description table is
regarding how malware analyst can view the correlation between different features of the
dataset.

Table 2. 13 - Use Case 11 View correlation of features

Use Case ID: Uc11


Use Case Name: View correlation of features
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:
Date Created: 4/3/2022 Last Revision 28/3/2022
Date:
Actors: Malware analyst
Description: Malware analyst can see the correlation of all features with the class
label.
Trigger: Prune dataset button
Preconditions: The user must be logged-in as malware analyst and viewing the prune
dataset option.
Post conditions: The Malware analyst shall see the correlation value of all features.
Normal Flow: User System
1: Malware analyst will All the dataset shall be shown along with the
click on prune dataset correlation values for the features.
option.
Alternative Malware analyst log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. The database is not responding.

35
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.14 is the ‘Sort by correlation’ description table. This description table is regarding
how malware analyst can sort the features by correlation.

Table 2. 14 - Use Case 12 Sort by correlation

Use Case ID: Uc12


Use Case Name: Sort by correlation
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:
Date Created: 5/3/2022 Last Revision 28/3/2022
Date:
Actors: Malware analyst
Description: Malware analyst can sort all the features on the basis of their correlation
value.
Trigger: Sort by correlation button
Preconditions: The user must be logged-in as malware analyst and viewing the prune
dataset option.
Post conditions: All the features shall be sorted according to their correlation value
Normal Flow: User System
1: Malware analyst will All the features shall be sorted according to
click on sort by their correlation values.
correlation button.
Alternative Malware analyst log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. The database is not responding.

36
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.15 is the ‘Prune dataset’ description table. This description table is regarding how
malware analyst can prune the features of the dataset.

Table 2. 15 - Use Case 13 Prune dataset

Use Case ID: Uc13


Use Case Name: Prune dataset
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:
Date Created: 9/3/2022 Last Revision 28/3/2022
Date:
Actors: Malware analyst
Description: Malware analyst can prune the dataset from the given dataset.
Trigger: Prune dataset button
Preconditions: The user must be logged-in as malware analyst.
Post conditions: The dataset shall be pruned according to the choice of the Malware
analyst.
Normal Flow: User System
1: Malware analyst will System will show the Malware analyst with
click on prune dataset. complete dataset giving Malware analyst the
option to select rows and columns of his/her
choice
2: Malware analyst The pruned dataset shall be shown to the
selects the desired rows Malware analyst.
and columns.
Alternative Malware analyst doesn’t performs pruning
Flows:
Exceptions: 1. The system is not responding.
2. The database isn’t responding.

37
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.16 is the ‘Select rows’ description table. This description table is regarding how
malware analyst can select rows during pruning of the dataset.

Table 2. 16 - Use case 14 Select rows

Use Case ID: Uc14


Use Case Name: Select rows
Created By: Quratulain Tariq Last Updated By: Quratulain Tariq
Date Created: 9/3/2022 Last Revision 28/3/2022
Date:
Actors: Malware analyst
Description: Malware analyst can select rows while pruning.
Trigger: Check box
Preconditions: The user must be logged-in as malware analyst and must have selected
the prune dataset option.
Post conditions: The dataset shall be pruned according to the selected rows.
Normal Flow: User System
1: Malware analyst will System will show the Malware analyst with
click on prune dataset. complete dataset giving Malware analyst the
option to select rows and columns of his/her
choice
2: Malware analyst The pruned dataset shall be shown to the
selects the desired rows. Malware analyst.
Alternative Malware analyst doesn’t performs pruning
Flows:

38
Capital University of Science and Technology, Islamabad Department of Computer Science
Exceptions: 1. The system is not responding.
2. The database isn’t responding.

The table 2.17 is the ‘Select rows’ description table. This description table is regarding how
malware analyst can select columns during pruning of the dataset.

Table 2. 17 - Use Case 15 Select Columns

Use Case ID: Uc15


Use Case Name: Select columns
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:
Date Created: 9/3/2022 Last Revision 28/3/2022
Date:
Actors: Malware analyst
Description: Malware analyst can select columns while pruning.
Trigger: Check box
Preconditions: The user must be logged-in as malware analyst and must have selected
the prune dataset option.
Post conditions: The dataset shall be pruned according to the selected columns.
Normal Flow: User System
1: Malware analyst will System will show the Malware analyst with
click on prune dataset. complete dataset giving Malware analyst the
option to select rows and columns of his/her
choice
2: Malware analyst The pruned dataset shall be shown to the
39
Capital University of Science and Technology, Islamabad Department of Computer Science
selects the desired Malware analyst.
columns.
Alternative Malware analyst doesn’t performs pruning
Flows:
Exceptions: 1. The system is not responding.
2. The database isn’t responding.

The table 2.18 is the ‘Save pruned dataset’ description table. This description table is
regarding how malware analyst can save the dataset after pruning.

Table 2. 18 - Use Case 16 save pruned dataset

Use Case ID: Uc16


Use Case Name: Save pruned dataset
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:
Date Created: 9/3/2022 Last Revision 28/3/2022
Date:
Actors: Malware analyst
Description: Malware analyst can save/download the pruned dataset after pruning in
the form of a .csv file.
Trigger: Save dataset button
Preconditions: The user must be logged-in as malware analyst and must prune the
dataset using prune dataset option.
Post conditions: The pruned dataset shall be saved.

Normal Flow: User System


1: Malware analyst will The pruned dataset shall be saved in to

40
Capital University of Science and Technology, Islamabad Department of Computer Science
click on save dataset. his/her system in the form of .csv.
Alternative Malware analyst log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding

The table 2.19 is the ‘Upload csv’ description table. This description table is regarding how
malware analyst can upload the csv to apply ML algorithms.

Table 2. 19 - Use Case 17 Upload csv

Use Case ID: Uc17


Use Case Name: Upload csv
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:

Date Created: 9/3/2022 Last Revision 28/3/2022


Date:
Actors: Malware analyst
Description: Malware analyst can upload a csv to apply ML algorithms on it.
Trigger: Upload csv button
Preconditions: The user must be logged-in as malware analyst
Post conditions: A csv shall be uploaded.

41
Capital University of Science and Technology, Islamabad Department of Computer Science
Normal Flow: Malware analyst System
1: Malware analyst will click A csv shall be uploaded.
on upload a csv.

2: Malware analyst will select A classification algorithm shall to


a classification algorithm. selected

3: Malware analyst will select The algorithm will be tuned according


tune the algorithm option. the requirements of the Malware
analyst.
4: Malware analyst will select Results of ML algorithms shall be
the option to apply ML displayed.
algorithms.
Alternative Flows: Malware analyst log-outs.
Exceptions: 1. The system is not responding.
2. Csv is not uploaded
3. Database is not responding.

The table 2.20 is the ‘Select classification algorithm’ description table. This description table
is regarding how malware analyst can select ML algorithms to apply them on the dataset.

Table 2. 20 - Use Case 18 Select classification algorithm

Use Case ID: Uc18


Use Case Select classification algorithm
Name:
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:
Date Created: 9/3/2022 Last Revision 28/3/2022
Date:
Actors: Malware analyst
Description: Malware analyst can select the classification algorithm to be applied on the
dataset.

42
Capital University of Science and Technology, Islamabad Department of Computer Science
Trigger: Select classification algorithm button
Preconditions: The user must be logged-in as malware analyst and select any classification
algorithm to be applied on the dataset.
Post conditions: The results of the ML algorithm on the dataset shall be shown.
Normal Flow: Malware analyst System
1: Malware analyst will All the algorithm’s names provided by the
click on select system shall be displayed
classification algorithm.
2: Malware analyst shall Results of the applied machine learning
click on any classification algorithm shall be shown.
algorithm.
Alternative Malware analyst log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding.

The table 2.21 is the ‘Tune ML algorithms’ description table. This description table is
regarding how malware analyst can tune the hyper parameters of the ML algorithms.

Table 2. 21 - Use Case 19 Tune ML algorithms

Use Case ID: Uc19


Use Case Tune ML algorithms
Name:
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:
Date Created: 9/3/2022 Last Revision 28/3/2022
Date:
Actors: Malware analyst

43
Capital University of Science and Technology, Islamabad Department of Computer Science
Description: Malware analyst can tune the ML algorithms. Tuning involves changing
the default value of a variable, changing the amount of testing and
training data etc.
Trigger: Tune dataset button
Preconditions: The user must be logged-in as malware analyst and must select a
classification algorithm first.
Post The ML algorithm shall be tuned
conditions:
Normal Flow: Malware analyst System
1: Malware analyst will A classification algorithm shall to selected
select a classification
algorithm.
2: Malware analyst will The algorithm will be tuned according the
select tune the algorithm requirements of the Malware analyst.
option.
Alternative Malware analyst log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding

The table 2.22 is the ‘View accuracies’ description table. This description table is regarding
how malware analyst and researcher can view the accuracies of implemented ML algorithms.

Table 2. 22 - Use Case 20 View accuracies

Use Case ID: Uc20


Use Case Name: View accuracies
Created By: Quratulain Tariq Last Updated Quratulain Tariq
By:
Date Created: 19/3/2022 Last Revision 28/3/2022
Date:

44
Capital University of Science and Technology, Islamabad Department of Computer Science
Actors: Malware analyst and researcher
Description: Malware analyst and researcher can view the accuracies of the
implemented ML algorithms.
Trigger: View accuracies button
Preconditions: The user must be logged-in as malware analyst or researcher
Post conditions: The accuracies for the implemented ML algorithms shall be shown.
Normal Flow: User System
1: Malware analyst or The accuracies with complete results of
researcher will click on view the ML algorithms shall be shown.
accuracies option.
Alternative Malware analyst or researcher log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding.

The table 2.23 is the ‘View visual results’ description table. This description table is
regarding how malware analyst and researcher can view the visual results in the form of
graphs.

Table 2. 23 - Use Case 21 View visual results

Use Case ID: Uc21


Use Case Name: View visual results
Created By: Quratulain Tariq Last Updated Quratulain Tariq
45
Capital University of Science and Technology, Islamabad Department of Computer Science
By:
Date Created: 14/4/2022 Last Revision 19/4/2022
Date:
Actors: Malware analyst and researcher
Description: Malware analyst and researcher can view the results of implemented
algorithms in visual form like graphs and charts.
Trigger: View visual results button
Preconditions: The user must be logged-in as malware analyst or researcher.
Post conditions: The visual results for the implemented ML algorithms shall be shown.
Normal Flow: User System
1: Malware analyst or The visual results of the ML algorithms shall
researcher will click on be shown.
view visual results
option.
Alternative Malware analyst or researcher log-outs.
Flows:
Exceptions: 1. The system is not responding.
2. Database is not responding.

The table 2.24 is the ‘Take guidance from manual’ description table. This description table is
regarding how malware analyst and researcher can view the manual for help.

Table 2. 24: Use Case 22 Take guidance from manual

Use Case ID: Uc22

46
Capital University of Science and Technology, Islamabad Department of Computer Science
Use Case Take guidance from manual
Name:
Created By: Quratulain Tariq Last Updated By: Safia
Mansoor
Date Created: 20/10/2022 Last Revision Date: 20/10/2022

Actors: Researcher and malware analyst

Description: Users can view the manual for help.

Trigger: Select the option ‘cross validation’.

Preconditions: User must be logged in as researcher or malware analyst.

Post User will be shown a help manual.


conditions:

Normal Flow: User System

1: User will select option System will display a manual which shall
‘manual’. be different in case of both the users.

Alternative User logs out.


Flows:
Exceptions: 1. The ‘manual option is not responding.

The table 2.25 is the ‘View correlation graph’ description table. This description table is
regarding how malware can view the correlation graph of all the features with the class label.

Table 2. 25: Use Case 23 View correlation graph

47
Capital University of Science and Technology, Islamabad Department of Computer Science
Use Case ID: Uc23

Use Case View correlation graph


Name:
Created By: Quratulain Tariq Last Updated By: Safia
Mansoor
Date Created: 20/10/2022 Last Revision Date: 21/2022

Actors: Malware analyst

Description: Malware analyst can view the correlation graph of all the features with the
class label.
Trigger: Select the option ‘view correlation’.

Preconditions: User must be logged in as malware analyst.

Post User will be shown a correlation graph.


conditions:

Normal Flow: User System

1: User will select option System will display a drop down having
‘dataset’. further options.

2: User will click on ‘view System will show a correlation graph in a


correlation’ option. new window.

Alternative User logs out.


Flows:
Exceptions: 1. The ‘view correlation’ option is not responding.
2. The ‘dataset’ option is not responding.

48
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.26 is the ‘View dynamic accuracy graph’ description table. This description table
is regarding how malware can view the graphs during tuning of the hyper parameters.

Table 2. 26: Use Case 24 view dynamic accuracy graph

Use Case ID: Uc24

Use Case View dynamic accuracy graph


Name:
Created By: Quratulain Tariq Last Updated By: Safia
Mansoor
Date Created: 22/10/2022 Last Revision Date: 25/10/2022

Actors: Malware analyst

Description: Users can see the accuracy graphs while hyper parameter tuning.

Trigger: Select the option ‘tuning’.

Preconditions: User must be logged in as malware analyst.


Post User will be shown dynamic graphs while hyper parameter tuning.
conditions:

Normal Flow: User System

1: User will select option System will display an option to select


‘tuning’. ML algorithm.

2: User will select an ML System will display tuning option.


algorithm.

3: User will enter values of System will display an accuracy graph


the hyper parameters. for those hyper parameters.

Alternative User logs out.


Flows:
Exceptions: 1. The ‘tuning’ option is not responding.

49
Capital University of Science and Technology, Islamabad Department of Computer Science
2.4.4. Researcher Use-case Diagram

The use-case diagram for researcher is shown below in figure 2.4.

Figure 2. 4 - Researcher Use case diagram

50
Capital University of Science and Technology, Islamabad Department of Computer Science
The table 2.27 is the ‘View dataset details’ description table. This description table is
regarding how researcher can view the details of CIC-Maldroid 2020 dataset.

Table 2. 27 - Use Case 22 View dataset details

Use Case ID: Uc25

Use Case View dataset details


Name:
Created By: Abdullah Arif Last Updated By: Safia
Mansoor
Date Created: 17/11/2021 Last Revision Date: 11/12/2021

Actors: Researcher
Description: Researcher can see the details of the dataset used.
Trigger: Select View Dataset Details option.

Preconditions Researcher will click on View Dataset Details option to get the overview
: of the dataset used.

Post Dataset details will be shown to researcher.


conditions:

Normal Flow: Customer System

1: Researcher will open the Details of dataset will be shown.


software and then click on
overview dataset button.
Alternative Cancel View Dataset Details option.
Flows:
Exceptions: The page is not responding.

51
Capital University of Science and Technology, Islamabad Department of Computer Science
2.5 System Sequence Diagrams
Below are the system sequence diagrams. Figure 2.5 is ssd to view general
report.

Figure 2. 5 - SSD request to view general report

Figure 2.6 is ssd regarding request to view individual report.

52
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 6 - SSD request to view individual report

Figure 2.7 is ssd regarding admin request sign in.

Figure 2. 7 - SSD admin request sign in

Figure 2.8 is ssd regarding admin request sign out.

53
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 8 - SSD admin request sign out

Figure 2.9 is ssd regarding mobile user sign up.

Figure 2. 9: SSD for mobile user sign up

54
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.10 is ssd regarding mobile user sign in.

Figure 2. 10: SSD for mobile user sign in

Figure 2.11 is ssd regarding mobile behavior information.

Figure 2.12 is ssd regarding displaying malware types.

Figure 2. 11: SSD for mobile behavior information

55
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 12: SSD to display malware types

Figure 2.13 is ssd regarding mobile user Sign out.

Figure 2. 13: SSD mobile user Sign out

56
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.14 is ssd regarding mobile user request to view protection guidelines.

Figure 2. 14 - SSD mobile user request to view protection guidelines

Figure 2.15 is ssd regarding mobile user request to view user manual.

Figure 2. 15: SSD mobile user request to view user manual


57
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.16 is ssd regarding researcher signup.

Figure 2. 16: SSD for researcher signup

Figure 2.16 is ssd regarding researcher sign in.

58
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 17: SSD for researcher sign in

Figure 2.18 is ssd regarding researcher sign out.

Figure 2. 18: SSD Researcher Sign out

59
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.19 is ssd regarding researcher request to view results of algorithms.

Figure 2. 19: SSD Researcher request to view results of algorithms

60
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.20 is ssd regarding researcher request to view correlation graph.

Figure 2. 20: SSD Researcher request to view correlation graph

61
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.21 is ssd regarding researcher request to display dataset.

Figure 2. 21: SSD Researcher request to display dataset

Figure 2.22 is ssd regarding researcher request to view cross validation results.

62
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 22: SSD Researcher request to view cross validation results

Figure 2.23 is ssd regarding researcher request to view user manual.

Figure 2. 23 :SSD Researcher request to view user manual

Figure 2.24 is ssd regarding Malware Analyst sign up.

63
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 24: SSD Malware Analyst sign up

Figure 2.25 is ssd regarding Malware Analyst sign in.

64
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 25: SSD Malware Analyst request to sign in

Figure 2.26 is ssd regarding Malware Analyst request to display dataset.

Figure 2. 26: SSD Malware Analyst request to Display Dataset

65
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.27 is ssd regarding Malware Analyst request to display pruned
dataset.

Figure 2. 27: SSD Malware Analyst request to display pruned dataset

66
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2.28 is ssd regarding Malware Analyst signout.

Figure 2. 28: SSD Malware Analyst Sign out

Figure 2.29 is ssd regarding Malware Analyst request to display results of


algorithms.

67
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 29: SSD Malware Analyst request to display results of algorithms

Figure 2.30 is ssd regarding Malware Analyst request to view correlation

graph.

68
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 30: SSD Malware Analyst request to view correlation graph

Figure 2.31 is ssd regarding Malware Analyst request to select csv file.

Figure 2. 31: SSD Malware Analyst request to select csv file

Figure 2.32 is ssd regarding Malware Analyst request tune.

69
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 32: SSD Malware Analyst request to tune the selected algorithm

Figure 2.33 is ssd regarding Malware Analyst request apply machine learning
algorithm.

70
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 33: SSD Malware Analyst request to apply machine learning algorithm

Figure 2.34 is ssd regarding Malware Analyst request to view cross validation
results.
71
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 2. 34: SSD Malware Analyst request to view cross validation results

Figure 2.35 is ssd regarding Malware Analyst request to view user manual.

Figure 2. 35: SSD Malware Analyst request to view user manual

72
Capital University of Science and Technology, Islamabad Department of Computer Science
2.6. Domain Model
The domain model contains the main entities of the system. The entities of our system are
user, dataset, APK, and Static & dynamic results. .These entities have different relations
amongst them.

Figure 2. 36: Domain Model

73
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 3

System Design

In this chapter we will define our system’s modules, processes, data, interface and
architecture of software. We will discuss the software's architecture, communication of
external entities (users) with our system and flow of data between database, process and
users. Moreover, the database design is also decided according to the selected functional
requirements.

3.1. Software Architecture


In our software architecture we have four layers. Each layer performs its own specific task.
All four layers are following.

 Presentation layer

 User interface layer

 Business logic layer

 Data layer

3.1.1. Presentation layer

This layer is about the interface display and presentation. This layer shows the services
provided by the system. Some services of our system are a facility to prune the dataset and
view the whole dataset etc.

3.1.2. User interface layer

This layer is all about the Implementation related to user interface like page transition and
control using different buttons etc. In this layer we will handle page transition of different
users after login and on other actions or queries.

3.1.3. Business logic layer

This layer actually handles the logic behind different actions. Poper processes are worked for
the implementation of business logic and specific modules are written for this purpose.
74
Capital University of Science and Technology, Islamabad Department of Computer Science
3.1.4. Data layer

This deals with the data storage, access and distribution among different users. For the whole
system activities data layer is used for retrieval or access of data. Like to view dataset we
have to use data layer to access data from database.

75
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 3. 1: Software Architecture

76
Capital University of Science and Technology, Islamabad Department of Computer Science
3.2. Data Flow Diagrams
We will use a modular approach to design the software. We have designed data flow
diagrams which are user to show the flow of data among external users and system modules.
We have designed data flow diagram up to three levels.

3.2.1. Level-0

In this level we treat our system as a black box and user interaction with the system is shown.
Our system can be used by four different types of user malware analyst, researcher, mobile
user, and administrator. Their interaction with the system is shown in the diagram below.

Figure 3. 2: Level 0 Data flow diagram

77
Capital University of Science and Technology, Islamabad Department of Computer Science
3.2.2. Level-1

In this level of DFD we have opened the system to some extent and shown a brief view of the
system. Here we have explained how a specific user will interact with the system and how the
system will respond after performing different procedures and activities.

Figure 3. 3: Level 1 Data flow diagram

3.2.3. Level 2

In this level we have explain system working in detail. In this diagram communication of
process with users is shown quite deeply. At this level we have shown almost the whole
procedure by which the user will get specific information. For example, the first malware
analyst will perform an authentication procedure then he will select the option that he wants
to do pruning or view dataset and will be shown results respectively.

78
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 3. 4: Level 2 Data flow diagram

79
Capital University of Science and Technology, Islamabad Department of Computer Science
3.3. Entity Relation Diagram

The entity relationship diagram is the first step in designing the schema of database and
construction of database. On later stages this ERD is refined and normalized with proper
steps and final database schema is generated. The entity-relationship model (or ER model) is
shown in figure 3.4. In our database, there will be three tables. The first will be maintained
for users which will contain all the information of the user like username, id, password, and
email. The second table is of feature vector which contains the selected features. Similarly,
the third table contains the results of static and dynamic analysis.

Figure 3. 5: Entity Relationship Diagram

80
Capital University of Science and Technology, Islamabad Department of Computer Science
3.4. Database Schema
Now the final design of our database or database schema is shown in figure 3.5 which is
generated after normalization. We will generate three separate tables in our database. Here
User and static & dynamic analysis tables have a one-to-many relationship with each other.
Similar is the case with Users and dataset.

Figure 3. 6: Database schema

81
Capital University of Science and Technology, Islamabad Department of Computer Science
3.5. User Interface Design
The user interface is quite important in software as user interaction with system all depend on
interface. Because of this reason it's important to design such a system which is quite user
friendly and simple. We will design quite a simple and user-friendly interface and make sure
to give proper guide lines to users for system interaction.

Home Page
On the home page we have displayed the way this site would be useful for different users.
Other than that, the user can Login to his account or can sign up to generate his account if he
is new user.

Figure 3. 7: Home page

82
Capital University of Science and Technology, Islamabad Department of Computer Science
Signup:

Every new user has to sign up to the system. Afterwards the user can login to the system and
can use the system. The details required to sign up are username, password, email and role of
the user.

Figure 3. 8: Signup

83
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 4

Software Development

In this chapter we will explain all standards, protocols and modules we have used for
development of our system. We have explained the way naming conventions, comments,
indentation have been used in our system. Moreover tools and database used for development
of system. We have discussed all the modules used in our system.

4.1. Coding Standards

During the development of software, different coding standards are being followed to make
code more understandable and editable if required in future. The indentation, declaration,
naming convention, and statement standards used while coding the project are described as
follows:

4.1.1. Indentation

It refers to whitespaces (single tab or four spaces) that signify the beginning of a block of
code. Indentation is very important in Python as it serves more purposes than just code
readability. Statements which have the same indentation are treated as the same block by
Python. So, a group of statements having the same indentation level (same number of leading
whitespaces) is considered as a block by Python unlike languages like C, C++, etc. where
curly brackets represent the block of code.
Python uses colons along with indentation. Colon is placed when a new block of code is
introduced that must be intended to right. An error is thrown by the interpreter if we forget to
intend the statements after the colon. The end of the block is specified by unindenting the
next line of code (which is not part of your block).
Example from code:

84
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 1: Indention example from code

4.1.2. Statement Standards

For writing a function, the colon is used and the block of code which can be single or
compound statements are intended right so Python doesn’t require the use of curly braces for
that. Similar is the case with ‘if statements’, after writing the expression in parentheses, the
colon is placed and the block of code is right intended.
Here in the below-taken part of our code for designing a push-button, compound statements
are used. Each line contains one statement. Note the start and end of this design contain
opening and closing parentheses respectively.

Figure 4. 2: Statement standard examples

Python Comments

Comments are considered good practice in coding. The reason is that comments are human
readable language thus provide a better understanding of the code. Moreover, well
commented code makes bug finding easier and helpful for editing. We have used comments
for easy understanding, editing, and debugging of code. We have added comments above of
each module and on other places. We have followed the pattern shown below

85
Capital University of Science and Technology, Islamabad Department of Computer Science
4.1.3. Naming Convention

Naming convention is a set of rules for choosing characters for naming variables. They also
reduce the effort needed to understand the code. We used full English descriptors that
accurately describe the variable, method, or class. For example, using view_dataset and user
instead of names like a1, b1. Mixed cases are used to make names readable with lower case
letters, in general, capitalizing the first letter of class names and interface names. The table
below will show our naming style.

Table 4. 1: Naming Style

Item Name Description

Push Button btn_name Name of all push buttons are like the key word btn and then
underscore and after that mention name i.e. btn_login,
btn_logout etc.

Label lb_name Labels are used to label input text and their names are like
lb_userName.

Input box Inp_name Input boxes are there for getting input from user. Sample

86
Capital University of Science and Technology, Islamabad Department of Computer Science
input box names are inp_passw, inp_HyperParam1 etc.

Combo box Combo_name Combo box is used for different purposes and their sample
names are combo_selectAlgo, combo_role.

Table Tbl_name Tables are used to view dataset, reports etc. sample names
for tables are tbl_prune, tbl_indiviReport.

Tab Widget T_name Tab widget is used for multiple pages on same window. So
tab widget is named like T_Admin, T_Reser etc.

Horizontal HorizonL_na Layout is used to manage all selected items to be insame


layout me shape independent of windows size. Horizonal layouts were
named as HoizonL_malw etc.

Vertical VerL_name Vertical layout were used to layout items in vertical manner
layout and were names as VerL_Mob.

Grid layout gridL_name Grid layout is used to layout all items in kind of a tabular
form. These layouts were named as gridL_Admin etc.

Frame F_name Frame is used to design side bar and for graph display.
Naming convention used for this purpose is like this
F_ReserManue, F_AdminManue

Function UPPERCAS All function were named as uppercase like LOGIN(),


names E TUNE_ALGO() etc.

4.2. Development Environment

We have used Python Spyder IDE as our development environment. It is the most
comprehensible IDE and the best choice for machine learning. As most of our project is
based on machine learning and deep learning, Spyder is the most suitable for coding.
Moreover, Spyder is lighter and comes with the facility of a lot of libraries. The other
alternatives were PyCharm and Jupyter Notebook. We didn’t use PyCharm as it is quite
heavy and Spyder is better for machine learning algorithms implementation. Jupyter
87
Capital University of Science and Technology, Islamabad Department of Computer Science
Notebook is not a good option for software development even though it is good for machine
learning algorithms. For interface designing of our software, we have used library PyQt with
its latest version 6. This is famous for interface design, that’s why we preferred it over other
options like Tkinter, Kivy etc.

4.3. Database management System

In our system, database is required to store user’s details and datasets so good database
management is required for this purpose. We have selected the MySQL database
management system.

MySQL:

MySQL is an open-source and most popular relational database management system.


According to DB Engines, MySQL is the world's second most popular relational database
management system. Its compatibility with a wide range of languages and easy management
made it more preferable. Moreover, its reliability and fast speed made it prominent. Because
of all these features, we have used the MySQL database management system. We have used
the WAMP server for the database as it is user-friendly and easy to manage. We have used
MySQL for managing user’s data and our dataset which we will use for training machine
learning algorithms.

4.4. Software Description

Main modules of our project are:


● Authentication Module
● Exploratory data analysis Module
● Report Generation Module
● Get Dataset Module
● Apply ML algorithm Module
● ML algorithm tuning Module

88
Capital University of Science and Technology, Islamabad Department of Computer Science
4.4.1. Authentication module

Figure 4. 3: Authentication module

Sign-up

A user needs to sign up if he/she is using the system for the first time.
Input:
To sign up, a user needs to enter his name, password, email, and select a role from the
drop-down menu. A user cannot use the same name which was already used by
another user so he must enter a unique username. The email entered by a user should
be valid for instance, it shouldn’t be like “ali@” or “ali.com”. Another constraint is

89
Capital University of Science and Technology, Islamabad Department of Computer Science
that a username, password, email cannot exceed the lengths 20, 16, and 30
respectively.
Output:
If a user enters a user name that is already registered on the database, a data server
error will be shown on the screen, indicating that the username is not uploaded on the
server.
If a user enters an email that is not valid like it does not contain “@” or “.com”, a data
server error will be shown for it.
While entering the username, password, email if their lengths exceed 20, 16, and 30
respectively a data server error will be shown. No error will be shown if the username
is unique, the email is valid and the length of username, password, email do not
exceed 20, 16, and 30 respectively. Hence, the system will show the login page
afterward. The below-pasted image of the interface shows the error as an output if an
already registered name is used as a user name.

Figure 4. 4: Dataset Error

LOGIN

If the user is not visiting the system for the first time and already has an account on the
system then for authentication, he/she has to log in to access the system. So, for this purpose,
we have implemented an option login for the user, as the user is directed to the login page
after signing up too.
Input:
90
Capital University of Science and Technology, Islamabad Department of Computer Science
As for the log-in, the user has to enter his username and password. According to the
role of the user, he will be directed to the corresponding page.
Output:
In case some user enters wrong credentials like the wrong password or username that
does not exist then in that case an error message will pop up showing “Invalid
credentials”. We will not inform the user whether the username is incorrect or
password because it is harmful to the user from a security perspective.

Figure 4. 5: Login Function

4.4.2. Exploratory Data Analysis Module

Prune dataset

91
Capital University of Science and Technology, Islamabad Department of Computer Science
We have provided the option of dataset pruning to malware analysts. A malware analyst will
apply machine learning algorithms on the original dataset or the dataset pruned by himself.
So for this purpose, we have provided pruning functionality to malware analysts, in which
malware analysts can prune the dataset by column or by row.

Input:
Malware analyst will enter the number of rows by which he/she wants to prune the
dataset. The number of rows will be selected by spin box and to prune the dataset by
column malware analyst will select column names and the dataset will be pruned by
features of his own choice.

Figure 4. 6 Prune Dataset

Output:
92
Capital University of Science and Technology, Islamabad Department of Computer Science
The output of this module will be a dataset table which shows the dataset after
pruning. It will be displayed in the form of a table. We have deployed constraints like
the user can’t select the number of rows more than the size of the dataset.

View dataset

An option to view the dataset is provided to the malware analysts and the researchers. So that
they can view the dataset they will be using to perform machine learning algorithms on.
Input:
After selecting the role, the user (Malware analyst/Researcher) will click on the view
dataset button.

Figure 4. 7: view dataset

Output:

The output of this module will be a dataset table opened in a new window after the
user has pressed the view dataset button. Dataset will be fetched from the database
and if the database is not linked in the backend, then an error message will appear.

4.4.3. Report Generation Module

In this module, the admin will be provided with the option to generate reports to get an
idea of how many people are using the system. Two different types of reports will be
generated. Only the administrator can access this level of information.
Input:
The input required to generate reports are different as initially the admin has to
select whether he required a report of an individual user or of a specific group of
users or all users.

93
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 8: Individual report generation

Output:
According to the selected option, a list of users with all details will be displayed. In
the figure report of all users who are Malware analysts are displayed.

94
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 9: Report of selected users

Figure 4. 10: Report of all users

95
Capital University of Science and Technology, Islamabad Department of Computer Science
4.4.4 Get dataset Module

As for different tasks like view dataset, implementation of machine learning algorithms,
pruning dataset is required. So instead of repetition of same code again and again for all these
modules a separate module has been created for this purpose. This module returns the dataset
after fetching it from database.

Figure 4. 11: Get dataset module

Output:

Figure 4. 12: Dataset

96
Capital University of Science and Technology, Islamabad Department of Computer Science
4.4.5. Apply ML algorithm module

This module is created to get input from user and call the related machine learning algorithm
module with parameters. This module handles implementation of ML algorithm. This module
will get machine learning algorithm from user and call the function with default parameters.

Figure 4. 13: Apply ML Algorithm module

SVM module

97
Capital University of Science and Technology, Islamabad Department of Computer Science
SVM module is created for implementation of support vector machine. This module takes
three parameters which are dataset, C, gamma. Dataset is the dataset for implementation, C
and gamma are hyper parameters used for tuning of algorithms.

Figure 4. 14: SVM module

Random Forest module

RF module is created for implementation of random forest. This module tasks seven
parameters. One of them is dataset and all others are related to hyper parameters of random
forest.

Figure 4. 15: Random Forest Module

Decision Tree module

98
Capital University of Science and Technology, Islamabad Department of Computer Science
DT module is created for implementation of decision tree. It takes seven parameters which
are dataset, max_depth, min_split, max_leaf, min_leaf, n_estimators and max_feature. These
parametrs are passed for tunning of decision tree algorithims.

Figure 4. 16: Decision Trees module

4.4.6. Tune ML algorithm module

This module give user the opportunity to tune hyper parameters of machine learning
algorithm. Initially user have to select machine learning algorithm and then give values of
hyper parameters and click apply button. Accuracy of you tuned model will be displayed
along with line graph. Tuning of Decision tree algorithm is shown below in figure.

Figure 4. 17: Selection of Algorithm in tuning

99
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 4. 18: Editing Hyper Parameters in Tuning

Chapter 5

Software Testing

It is important to check the performance and usability of a system. Because of this reason we
have performed software testing. Testing is performed for all modules of our system like
signup, sign in, prune dataset, etc.

5.1. Testing Methodology


For testing of systems, different testing techniques are available, but among all these
methodologies we will do black-box testing. As in black box testing, testers are not supposed

100
Capital University of Science and Technology, Islamabad Department of Computer Science
to be aware of the source code or development method.

5.1.1. Black Box Testing

We are using black-box testing because it is better to check the performance of the system as
this testing is not concerned with the code logic or development method. It is concerned with
what users input to the system and what output is produced against the given input. In black-
box testing, we test the system against the pre-defined requirements.

The parameters checked in black box testing are:

● The actions performed by the user are accurate


● On the given inputs, how the system interacts
● The time taken by the system to respond
● Usability issues and performance issues
● Failure of systems (if it is unable to start or finish a task)

Test Case1: Sign Up


Table 5. 1: Test case sign up

Date: 27 February 2022

System: Menu Drive

Objective: Sign Up Test ID: 1

Version: 1 Test Type: Unit testing

Inputs:

101
Capital University of Science and Technology, Islamabad Department of Computer Science
1- username= Fatima, password= fatima12, email= [email protected], role= researcher
2- username= Ahsan, password=12345, email= ahsan@, role= malware analyst
3- username= Fatima, password= fatima32, email= [email protected], role= mobile user
4- username= Alia, password= alia123Custuni1234567789, email= [email protected], role=
researcher

Expected Results:
1. Successfully signed up
2. Data server error as email is invalid
3. Data server error as name Fatima is not unique
4. Data server error as the password exceeds 16

Actual Results:
1. Passed
2. Passed
3. Passed
4. Passed

Description:
The first input is valid as it has a unique name Fatima, the password is not exceeding 16
digits of length and the email is in the correct format so it is valid. Thus, the user will be
successfully signed in. The record is successfully stored in the database and the user will be
directed to the sign-in page.

102
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 1: Successful test case signup

In the 2nd input, the email is invalid so a data server error will be shown.

Figure 5. 2: Test case signup Invalid Email

Name, Fatima is already taken, it is not unique so data server error will be shown.

Figure 5. 3: Test case signup Username already taken

103
Capital University of Science and Technology, Islamabad Department of Computer Science
In the 4th input, the password is exceeding the length of 16 so a data server error will be
shown.

Figure 5. 4: Test case signup Password error

Test Case: Sign-in


Table 5. 2: Test Case Sign in

Date: 27 February 2022

System: Malware Analyser

Objective: Sign In Test ID: 2

Version: 1 Test Type: Unit testing

Inputs:
1- User name= Nida, password=12345
2- User name= Fatima, password= fatma45

Expected Results:

1) Invalid credentials error as the user didn’t sign up


2) Invalid credentials as the password is incorrect

Actual Results:
1- Passed
2- Passed

Description:
The user Nida did not sign up and she is trying to directly sign-in that’s why an error is
shown.

104
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 5: Test case Login invalid Credentials Error

The password for the user Fatima is actually “fatima12” but here the wrong password is
entered that’s why invalid credentials error is shown.

Figure 5. 6: Test case Invalid Credentials Error 2

Test Case: View Information


Table 5. 3: Test case view information

Date: 27 February 2022

System: Malware Analyser

Objective: View Information Test ID: 3

Version: 1 Test Type: Unit testing

Inputs:
1- Button “Malware Types” pressed
2- Button “Mobile’s Behavior” pressed
3- Button “Mobile’s Protection” pressed
4- Button “Dataset details” pressed

Expected Results:
1- Information of types of malwares will be displayed
2- Information of suspicious behavior of mobile will be displayed

105
Capital University of Science and Technology, Islamabad Department of Computer Science
3- Information of five simple ways to protect the phone will be displayed
4- Information of types of malwares will be displayed

Actual Results:
1- Passed
2- Passed
3- Passed
4- Passed

Description:

A Mobile user can view different types of information. When the button “Malware Types” is
pressed by him/her, the system should display the information on types of malware. If the
system correctly shows the information of types of malware, the test is passed. Similar is the
case for viewing information on mobile behavior, mobile protection, and dataset details.

106
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 7: Test case View information 1

107
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 8: Test case View information 2

108
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 9: Test case View information 3

Test Case: View Dataset


Table 5. 4: Test Case View Dataset

Date: 27 February 2022

System: Malware Analyzer

Objective: View Dataset Test ID: 5

Version: 1 Test Type: Unit testing

Input:
1- Button “View Dataset” pressed

Expected Result: Dataset will be displayed on the new window.

109
Capital University of Science and Technology, Islamabad Department of Computer Science
Actual Result: passed

Description

After pressing the view dataset button, the dataset will be displayed in a new window.

Figure 5. 10: Test case View Dataset

Test Case: Individual report generation


Table 5. 5: Test Case Individual Report Generation

Date: 27 February 2022

System: Malware Analyser

Objective: Individual report generation Test ID: 6

Version: 1 Test Type: Unit testing

Inputs:
1- username = Ali
2- username = Ahmed

110
Capital University of Science and Technology, Islamabad Department of Computer Science
Expected Results:
1- Details of user “Ali” will be displayed
2- Error will occur as no such user exist

Actual Results:
1- Passed
2- Passed

Description:

Individual reports will be generated by the admin. Admin will provide the username of the
individual and its details will be displayed. So as for the first input user’s details will be
displayed as shown in the figure.

Figure 5. 11: Test case individual report, user exist

As for the second input, an error will popup displaying that “User Not Exist” as any user with
this username does not exist in our system. The output displayed by our system is shown in
the figure below.

111
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 12: Test case individual report, user not exist error

Test Case: Role based report generation


Table 5. 6: Test Case Role based report generation

Date: 27 February 2022

System: Malware Analyser

Objective: Role-based report generation Test ID: 7

Version: 1 Test Type: Unit testing

Inputs:
1- option = Malware Analyst
2- option = Researcher
3- option = Mobile User

Expected Results:
1- List of all Malware analysts will be displayed
2- List of all Researcher will be displayed
112
Capital University of Science and Technology, Islamabad Department of Computer Science
3- List of all Mobile user will be displayed

Actual Results:
1- passed
2- passed
3- passed

Description:

According to the selected option relevant details of individuals will be displayed. For the
first, input all malware analysts will be displayed as shown in figure. For the second input all
Researchers data will be displayed the as shown in figure. For the input Mobile User, all
records related to mobile user will be displayed shown in figure.

Figure 5. 13: Test case group report generation 1

113
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 14: Test case group report generation 2

Figure 5. 15: Test case group report generation 3

Test Case: Save csv


Table 5. 7: Test Case csv

Date: 10 May 2022

System: Malware Analyser


114
Capital University of Science and Technology, Islamabad Department of Computer Science
Objective: Save pruned feature vector Test ID: 7

Version: 1 Test Type: Unit testing

Inputs:
1- Request to save csv by clicking prune button.

Expected Results:
1-Confirmation message that file has been saved

Actual Results:
1- passed

Description:

Malware analysts have the option to save .csv file after performing the whole pruning so that
he can apply ML algorithms later. When malware analyst will request to save csv file if file is
saved a message will display "file saved successfully” confirming file is saved.

Figure 5. 16: Test Case save csv

Test Case: Upload csv


Table 5. 8: Test Case Upload csv

Date: 10 May 2022

115
Capital University of Science and Technology, Islamabad Department of Computer Science
System: Malware Analyser

Objective: Upload csv Test ID: 7

Version: 1 Test Type: Unit testing

Inputs:
1- browse file

Expected Results:
1-Confirmation message that file has been uploaded

Actual Results:
1- passed

Description:

Malware analysts have the option to upload .csv file which he has already pruned and saved.
So, for this option as soon as a user clicks the browse file option file explorer window will
appear to select csv file.

116
Capital University of Science and Technology, Islamabad Department of Computer Science
Table 5. 9: Test Case Upload csv

Test Case: Apply ML algorithms


Table 5. 10: Test Case Apply ml algorithms

Date: 10 May 2022

System: Malware Analyser

Objective: Implementation of ML Test ID: 7


algorithms

117
Capital University of Science and Technology, Islamabad Department of Computer Science
Version: 1 Test Type: Unit testing

Inputs:
1- option = KNN
3- option = SVM
4- option = Decision Tree
5- option = Random Forest

Expected Results:
1- Confusion metrics for KNN
2- Confusion metrics for Naive Bayes
3- Confusion metrics for SVM

Actual Results:
1- passed
2- passed
3- passed

Description:

Malware analysts have the option to implement machine learning algorithms on dataset they
have pruned themselves. So, as they select the algorithm and click the button to apply it
confusion metrics will appear.

118
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 5. 17: Test case implement ML algorithms Random Forest

119
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 6

6. Static and Dynamic Analysis

In this chapter we have discussed the tools and techniques used for static and dynamic
analysis. We have performed static and dynamic analysis on 500 APKs which are subset of
selected dataset. The purpose of this analysis is to cross validate these results with one we
will get from machine learning algorithms.

6.1. Static Analysis of Android Applications


The static analysis of Android APK is viewing the source code of the APK. For that purpose,
the APK is reverse engineered to analyze excessive permissions, services, intents, hardcoded
passwords, weak cryptographic functions, hardcoded passwords, etc.

6.1.1. APK file and it’s Structure

It is the Android Package format used by the Android operating system and a number of other
Android-based operating systems for the distribution and installation of mobile applications,
mobile games, and middleware. These have a apk extension.

Figure 6. 1: APK structure files

120
Capital University of Science and Technology, Islamabad Department of Computer Science
These files are typically downloaded from the Google Play store and saved in ZIP format.
The contents found in APK files include AndroidManifest.xml, classes. dex, and resources.
arsc file; as well as a  META-INF and res folder.

 AndroidManifest.xml – This XML file contains the Mata data of the Android
application. This includes the package name, activity names, main activity (the entry
point to the app), Android version support, hardware features support, permissions,
and other configurations.
Android manifest file describes how each component of the application interacts. The
four main components are activity- handles user interaction with the screen of
smartphone, service – handles background processes associated with an application,
broadcast receiver – handles communications between OS and applications and
content providers- handles database. The communications between these components
are done using messages called intents.
 Classes. dex – These are the files containing Java code that is converted from Java
Virtual Machine-compatible .classfiles to Dalvik-compatible .dex (Dalvik Executable)
files before installation on a device. Thus executed by Android runtime.
 Resources. arsc – This binary file contains the list of the program’s compiled
resources and their IDs. These resources include layouts, images, strings, styles, etc.
 Assets – This directory contains application assets. For accessing data (text, music,
XML, fonts) in raw form, assets are the only way.
 Res – It contains all the resources that are not compiled into Resources. arsc.
 META-INF – It is a directory with APK Mata data such as the signatures.

6.1.2. Activity Diagram

Following is the activity diagram for static analysis of Android, showing all the activities
performed.

121
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 6. 2: Activity diagram for static analysis

6.1.3. Tools usage

1. APK Tool

It is a command-line tool used to reverse engineer Android applications by decoding them. It


allows us to make changes in the decoded files and rebuild the applications.

In an attempt to analyze the manifest.xml file when we double click on it, it either gets open in
an unreadable format or it doesn’t get open at all and an error message is popped up. So the
APK file needs to be decoded by the APK tool.

APK tool isn’t present in Kali Linux by default so it has to be installed using the command
“apt-get install apktool”.

Figure 6. 3: apktool installation

Once the tool is installed, write the command “apktool d nameofapkfile” on the terminal.

122
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 6. 4: apk tool command

A new folder will be created having the same name as that of the APK file. By clicking on the
manifest.xml file present in the newly created folder, we can view it in a readable format.

2. JD-GUI

JD-GUI is a standalone graphical utility that displays Java source codes of “.class” files. The
APK file has classes. dex which contains the compiled java code. We need to decompile the
file in order to view the java classes’ code. For that write command on the terminal “d2j-
dex2jar nameofapkfile”.

Figure 6. 5: jdgui tool command

A new file will be created having a .jar extension. Open the file in the JD-GUI tool (java
decompiler) to view all the classes in a human-readable format.

6.1.4. Techniques

1. Dangerous Keywords search

We can search for dangerous keywords in the java classes to see if the application is
malicious. Following is the list of malicious keywords.
123
Capital University of Science and Technology, Islamabad Department of Computer Science
● admin
● camera
● GET
● POST
● https
● HTTP
● audio
● SQL
● address
● monitor
● send
● ACTION_CALL
● MMS
● ACTION_SEND
● ftp
● SMS
● socket
Opening the java file in JD-GUI, we can search if there are any malicious keywords in the
code.

Figure 6. 6: Java file in JD-GUI

124
Capital University of Science and Technology, Islamabad Department of Computer Science
For instance, the above .jar file has the malicious keyword address along with other malicious
keywords.
2. Dangerous Permissions search

Android permissions can pose a huge threat if they are granted to malicious applications.
Following is the list of permissions that can be used to perform malicious activities. During
static analysis, the following permissions were searched in the Manifest.xml file of an APK.

 ACCESS_BACKGROUND_LOCATION
 ACCESS_COARSE_LOCATION
 ACCESS_FINE_LOCATION
 ACCESS_MEDIA_LOCATION
 ACTIVITY_RECOGNITION
 ANSWER_PHONE_CALLS
 BODY_SENSORS
 CALL_PHONE
 CAMERA
 GET_ACCOUNTS
 MODIFY_PHONE_STATE
 INSTALL_PACKAGES
 PROCESS_OUTGOING_CALLS
 READ_CALENDAR
 READ_CALL_LOG
 READ_CONTRACTS
 READ_EXTERNAL_STORAGE
 READ_PHONE_NUMBERS
 READ_PHONE_STATE
 READ_SMS
 RECEIVE_MMS
 RECEIVE_SMS
 RECEIVE_WAP_PUSH
 RECORD_AUDIO
 SEND_SMS
 USE_SIP
 WRITE_CALENDAR
 WRITE_CALL_LOG
 WRITE_CONTRACTS
125
Capital University of Science and Technology, Islamabad Department of Computer Science
 WRITE_EXTERNAL_STORAGE
 WRITE_APN_SETTINGS
 WRITE_SETTINGS

The folder created by the APK tool has a readable manifest.xml file. we can double click on the
manifest file and read the permissions.

Figure 6. 7: Permissions in Manifest file

In the above manifest file of an APK, we can see malicious permissions


“READ_EXTERNAL_STORAGE”, “READ_LOGS” “CAMERA” etc so the APK is malware
according to the static analysis.

6.2. Dynamic analysis of android APKs

Dynamic analysis is a technique that evaluates an application by running it in a real


environment. The main benefit of this technique is that it detects the behavior of the
application at runtime. The disadvantage of dynamic analysis is that sometimes the
application fails to execute and this technique is hard to implement as compared to the static
analysis technique. Usually, we perform dynamic analysis after the static analysis for a better
underof standing the behavior of the application.

6.2.1. Tools Used

MobSF

126
Capital University of Science and Technology, Islamabad Department of Computer Science
MobSF is a malware analysis and security assessment tool that performs dynamic analysis on
android applications. Advantage of mobsf is that it is the latest and better than all the
conventional dynamic analysis tools. It is more powerful because almost all dynamic analysis
techniques can be applied through this tool. The developers of MobSF have maintained the
documentation in which all steps are mentioned from the installation process to the dynamic
analysis part.

Figure 6. 8: Activity diagram for dynamic analysis

Demonstration

After downloading all the software required as mentioned in the documentation we will have
to download a virtual machine to run Android software. For this purpose many VMs are
available but we choose Genymotion as it is recommended by the MobSF developers.

First, we opened mobsf and then we uploaded the APK we want to perform dynamic analysis
on.

127
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 6. 9: Dynamic Analyzer in MobSF

MobSF loaded the APK and then we click on the dynamic analyzer. After clicking a new
window will be opened where you can see the live screen of the android mobile device. The
main thing that we are concerned within dynamic analysis is the activity section. We can
choose the activity from the activity option and then click on start activity.

Figure 6. 10: Activities shown in apk

After starting the main activity nothing showed up on the screen but when we will click on
the second activity a window will be appeared asking the user to enter his credit card details
and from here we can conclude that it is a banking malware app.

128
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 6. 11: Activity Running

129
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 7

ML Implementation and Cross Validation

As heuristic techniques perform better than static and dynamic techniques for malware
analysis. So for implementation of heuristic based techniques we need a malware dataset for
training of machine learning algorithms. ML algorithms perform better on large and good
dataset. So, to get a desired dataset we have researched a lot of datasets and after their
comparison we have selected a dataset CIC-malDroid-2020. The whole work on this is
discussed in this chapter.

7.1. Selection of dataset

We will use machine learning algorithms for the analysis of android-based malware. For
implementation and good results of machine learning algorithms, a good and big dataset is
always required so that algorithm learns fast and improves its results [17].

The dataset selected for this work is CICMalDroid-2020 which has malware and benign
samples. The dataset is publicly available on the University of New Brunswick site [18].

In the selected dataset, data has been generated by collecting APKs from several sources
including Virus Total service, Contagio security blog, AMD, and other datasets used by
recent research contributions and then running them on a VMI-based dynamic analysis
system known as CopperDroid [19]. Initially the collected number of APKs were 17850 but
later on when these samples were analyzed inside virtual environment i.e. virtual machines
around 5000 samples were damaged or failed to run so that they can be analyzed. Because of
this reason size of remaining dataset is of 13,077 samples from which 9803 samples are
malware and 1795 samples are benign. Samples are categorized as follows:

1. Adware (1,253)
2. Banking (100)
3. SMS malware (3,904)
4. Riskware (2,546)
130
Capital University of Science and Technology, Islamabad Department of Computer Science
5. Benign (1,795)

Figure 7. 1: Classes of Dataset

The publishers have provided us with three different types of files. These files are as follows

 Capturing logs
 APK files
 Csv files

Csv files were of three different types static records analysis, dynamic analysis records,
binder calls. We have used binder calls csv file which contain combined static and dynamic
records. It has 470 features and and around 15000 records.

In our project we have used all files. APK files were used for static and dynamic analysis. We
have downloaded 500 APKs and performed static and dynamic analysis on them using
different analysis tools like JD-GUI. This task was quite time taking as it is manual task.

131
Capital University of Science and Technology, Islamabad Department of Computer Science
Capturing logs were also being used for generating data again. As already provided dataset
do not have refrence with APKs so we need to generate it again as we need to add name of
APK files in csv file too, so that we can cross validate the results of static and dynamic
analysis with machine learning results.

Some other latest datasets are also publicly available. Their comparison is given in table 2.1
which shows that the CIC-MalDroid dataset is the latest compared to other datasets and
APKs are also available for it. The inves-AndMal dataset is a year older and the latest one
(CIC-AndMal) doesn’t have APKs. So, considering these reasons, we are selecting the CIC-
MalDroid dataset.

Table 7. 1: Comparison of datsets

Dataset Publish Year of Size Cited APK Malware Malware Benign


year samples papers categories samples samples
Name

Inves- 2019 2017 5491 49 ✔ 4 426 5065


AndMal [20]

CIC- 2020 2018 13,077 7 ✔ 4 9803 1795


MalDroid

[19]

CIC-AndMal 2020 - 400K 2 - 12 200K 200K


[21] [22]

Malware categories present in three latest datasets are shown in table 7.2 which shows that
adware is present in all datasets while banking malware, SMS malware, Riskware,
ransomware scareware, and Trojans aren’t present in all categories. It shows that Adware is
most common in Androids. Because these are in form of advertisements so its easy to be
target a large community without any risk.

Table 7. 2: Comparison of features of benchmark datasets with CICMalDroid-2020.

Dataset Name Adware Banking SMS Riskware Ransomware scareware Trojan

132
Capital University of Science and Technology, Islamabad Department of Computer Science
malware

Inves- ✔ - ✔ - ✔ ✔ -

AndMal [20]

CIC- ✔ ✔ ✔ ✔ - - -

MalDroid [19]

CIC- ✔ ✔ - ✔ ✔ ✔ ✔

AndMal [21]
[22]

7.2. Extraction of dataset


In the selected dataset CICMalDroid-2020 csv files of dataset and APK as well as data in raw
form saved in .json files are also provided. As in the given csv file APK's name or id were not
available because of which it’s impossible to detect which record belongs to which APK. To
get referenced dataset with APK collection we extracted dataset from json files using python
script and achieved csv file again.

133
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 3: Extraction of Dataset

134
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 4: Extraction of Dataset

This extracted dataset will be provided to malware analyst to perform experiments and to get
predictions for new APK’s data. We will implement different machine learning algorithms
and provide users the opportunity to implement and tune them according to the requirements.

7.3. ML algorithms
A machine learning algorithm is the technique by which the Artificial Intelligence systems
perform their tasks, generally predicting output values from the data given to them. There are
mainly four types of ML algorithms: Supervised ML algorithms- they have input data and

135
Capital University of Science and Technology, Islamabad Department of Computer Science
class labels. Unsupervised ML algorithms- they are do not have class labels. Transfer
learning- uses data of previous task to complete a new but related task. Reinforcement
learning- rewards the desired behaviors and punish the undesired ones to direct unsupervised
machine learning.

7.3.1. Environment Specification

For implementation of machine learning algorithms base paper [17] has used using a PC with
3.60 GHz Core i7-4790 CPU and 32 GB RAM. Our system is 3.20 GHz Core i3-2600 4 GB
RAM so we can’t use our systems for implementation of these algorithms. The
implementation of ML algorithms is done on Google colab notebooks which are Jupyter
notebooks that run on cloud.

7.3.2. Implementation of ML algorithms

We have implemented various supervised machine learning algorithms on CIC-Maldroid


dataset. The selected algorithms are among those which are commonly used for Android
malware detection according to the research. For the purpose of training the model, 70% of
the data is used and 30% for testing the model.

During implementation of these algorithms, we have performed hyper parameter tuning.


We’ve compared our results with the results in Maldroid-2020 research paper to reproduce
their experiments and get better accuracies/results as mentioned in the research paper.

7.3.3. Decision Tree Algorithm

A decision tree is a tree-like structure that is used as a model for classifying data. A decision
tree is consists of three types of nodes

 Decision Nodes: These nodes have two or more branches


 Leaf Nodes: The nodes at the lowest level which represents decision. These do not
have any further children.
 Root Node: It locates at the topmost level and it is also a decision node.

Let’s understand the working of decision trees with example [18]:

136
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 5: Decision tree example

Step 1:

 Determine the Decision Column


 Determine the class which provides the base for a decision. It’s PlayGolf in this case.

Step 2:

 Calculate the entropy for the Play Golf column

Entropy(PlayGolf) = E(5,9) =-((9/14log 9/14) - (5/14 log 5/14) = 0.94

Step 3:

 Calculate Entropy for Other Attributes after split

To calculate E(PlayGolf, Outlook), we would use the formula below:

E (PlayGolf,Outlook) = P(Sunny)E(Sunny) + P(Overcast)E(Overcast) +P(Rainy)E(Rainy)

Which is the same as:

E(PlayGolf, Outlook) = P(Sunny) E (3,2) + P(Overcast) E (4,0) + P(rainy) E (2,30)

E(PlayGolf, Outlook) = 5/14 E (3,2) + 4/14 E (4,0) +5/14 E (2,3)

137
Capital University of Science and Technology, Islamabad Department of Computer Science
Just like we calculated E for Sunny, we’ll calculate for Outcast and Rainy and add them
to get entropy for Outlook.

E (Sunny) =E (3,2) =-((3/5 log 3/5) - (2/5log2/5) =0.971

E(PlayGolf,Outlook) = 5/14E(3,2)+4/14E(4,0)+5/14E(2,3) = 0.693

 After finding entropy of all 4 attributes, find information gain using this formula

Gain(S,T) = Entropy(S) – Entropy(S,T)

 Choose the attribute that gives the highest information gain after the split. i.e. 0.247

Step 5:

 Perform the first split


 Overcast gives homogeneous groups so it’ll become leaf node.

Step 6:

 Perform further splits

The Sunny and Rainy attributes need to be split. They can split using, Temperature,
Humidity or windy. Let’s consider Rainy first. Humidity produces homogeneous group.

Step 7:

 Complete the tree


 Split Sunny using Windy as it gives homogeneous groups

138
Capital University of Science and Technology, Islamabad Department of Computer Science
.

Figure 7. 6: Complete Decision Tree

Implementation of Decision Tree algorithm

We have performed tuning of the hyper parameters by finding their optimal values. The
hyper parameters we’ve tuned are:

Criterion:

It is the function to measure the quality of a split. Supported criteria are “gini” for the
Gini impurity and “entropy” for the information gain. Gini Impurity measures the
divergences between the probability distributions of the target attribute’s values and
splits a node such that it gives the least amount of impurity. Information gain uses the
entropy measure as the impurity measure and splits a node such that it gives the most
amount of information gain.

139
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 7: Criterion

So criterion = entropy is the optimal value for this hyper parameter.

Min_sample_split:

It denotes the minimum number of samples required to split an internal node.


min_samples_split is used to control over-fitting. Higher values prevent a model from
learning relations which might be highly specific to the particular sample selected for
a tree. Too high values can also lead to under-fitting. Hence depending on the level of
under fitting or overfitting, we can tune the values for min_samples_split.

140
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 2: minimum sample split

So the optimal value for this parameter is ‘2’.

Min_sample_leaf:

The minimum number of samples required to be at a leaf node. A split point at any
depth will only be considered if it leaves at least min_samples_leaf training samples
in each of the left and right branches. It is used to control over-fitting by defining that
each leaf has more than one element. Thus ensuring that the tree cannot overfit the
training dataset by creating a bunch of small branches exclusively for one sample
each.

141
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 8: minimum sample leaf

The optimal value for this parameter is ‘1’.

Max_features:

It is the number of features to consider when looking for the best split. Every time
there is a split, our algorithm looks at a number of features and takes the one with the
optimal metric i.e. accuracy using entropy, and creates two branches according to that
feature. Another use of max_features is to limit overfitting. By choosing a reduced
number of features, we can increase the stability of the tree and reduce variance
(variability in the model prediction) and over-fitting.

142
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 9:max features

The optimal value for this parameter is ‘311’.

Results:

The accuracy we have obtained for the Decision Tree algorithm is 92.16% while that
mentioned in the research paper is 90.75%.

7.3.2. K- Nearest Neighbor (KNN):

· K-nearest neighbors (KNN) is a type of supervised learning algorithm used for


classification.

· KNN calculates the distance of test data point with all the points of the training data to
make a prediction.

· Select k number of points having minimum distance.

· The test data point shall be assigned the class which majority of nearest training data
points shall have.

· In KNN, k refers to number of nearest neighbors.


143
Capital University of Science and Technology, Islamabad Department of Computer Science
Working:

Figure 7. 10: KNN Example [17]

· Select K.

· Calculate the Euclidean distance for those K number of neighbors

· Take the K nearest neighbors having least Euclidean distance.

· In these k neighbors, count the number of the data points in each category.

· Assign the category to the new data point that the maximum number of neighbors
possess.

· Model is ready.

How to select an optimal K value?

 There are no pre-defined way to find the most optimal K.

 Draw a plot between K and error rates defining a range. Select K having least error
rate.

Let’s understand the working of KNN through a mathematical example. A tissue


making company has produced a new kind of tissue paper. They want to test the
quality of the new tissue paper to be good or bad so they can set the price accordingly.
The quality can be tested based upon acid durability and strength which are of value 3

144
Capital University of Science and Technology, Islamabad Department of Computer Science
and 7 for the new tissue, respectively. Taking the value of k=3. We shall use the data
of already existing tissue papers (categorized good or bad) to apply KNN algorithm.

Figure 7. 3: KNN mathematical example

We shall apply the Euclidian distance for the X1 and X2 values of new tissue paper with all
the existing records. It’s basically finding the distance with all the points.
Afterwards, we shall select the 3 neighbors having the minimum distance and see their label.
As two of them have the label ‘Good’ so new tissue shall be categorized as good.

Implementation of K- Nearest Neighbors:

We have performed tuning of the hyper parameters by finding their optimal values. The
hyper parameters we’ve tuned are:

 n_neighbors:
 It represents the number of neighbors to use for kneighbors queries. Its default value is 5.
 leaf_size:
This parameter is passed to BallTree or KDTree (both are algorithms which can be used in
kNN). This (parameter) can affect the speed of the construction and query, as well as the
memory required to store the tree. The optimal value depends on the nature of the problem.
It's default value is 30.
 P:
Power parameter i.e. p is for the Minkowski metric. When p = 1, this is equivalent to using
manhattan_distance (l1), and euclidean_distance (l2) for p = 2. Its default value is 2.
145
Capital University of Science and Technology, Islamabad Department of Computer Science
Grid Search CV:

Rather than using Greedy approach, we are using Grid Search CV for finding the optimal
values of the hyper parameters.

GridSearchCV is a meta-estimator. It takes a dictionary that describes the parameters that


could be tried on a model to train it. The grid of parameters is defined as a dictionary, where
the keys are the parameters and the values are the settings to be tested.

GridSearchCV takes an estimator like KNeighborsClassifier (K-Nearest Neighbor


Classifier) and creates a new estimator that behaves exactly the same.

We added the parameter refit with the value 'True'. This parameter is used for refitting an
estimator using the best found parameters on the whole dataset. First, it runs the same loop
with cross-validation, to find the best parameter combination. Once it has the best
combination, it runs fit again on all data passed to fit (without cross-validation), to build a
single new model using the best parameter setting.

Verbose just means the text output describing the process. The higher the number, the more
verbose means more messages.

When verbose > 1: Computation time for each fold and parameter candidate is displayed.

When verbose > 2: Score is also displayed.

When verbose > 3: Fold and candidate parameter indexes are also displayed together with the
starting time of the computation.

146
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 4: Grid Search CV

Results:

The best values of hyper parameters found by the GridSearchCV are:

'leaf_size': 1, 'n_neighbors': 1, 'p': 1. the accuracy we have obtained is 88.65% and


that mentioned in research paper is 85.25%.

7.3.3. Random Forest Algorithm

It is a supervised ML algorithm that is widely used in classification problems. It takes vote of


the number of decision trees to make a prediction in a classification problem.

Why Random Forest?

Decision trees are highly sensitive to training data means changing the data a little can result
in a completely different decision tree which could result in high variance so our model might
fail to generalize. Random forest is a collection of multiple random trees hence, less sensitive
to the training data.

Working:

In Random Forest, we randomly make the subsets of the original data keeping the number of
rows equal. We make a decision tree for each subset independently but we randomly select
subset of features for making these decision trees. For prediction, the test data point shall be

147
Capital University of Science and Technology, Islamabad Department of Computer Science
passed to each decision tree. The class/category given by maximum decision trees shall be
assigned to the test data point.

Implementation of Random Forest Algorithm

We have performed tuning of the hyper parameters by finding their optimal values. The
hyper parameters we’ve tuned are:

 n_estimators: This hyper-parameter is used to control the number of trees in the forest.


Its default value is 100.
 criterion: It is the function to measure the quality of a split. Supported criteria are “gini”
for the Gini impurity and “entropy” for the information gain. Gini Impurity measures the
divergences between the probability distributions of the target attribute’s values and splits
a node such that it gives the least amount of impurity. Information gain uses
the entropy measure as the impurity measure and splits a node such that it gives the most
amount of information gain. The default function is gini.
 max_features: It is the number of features to consider when looking for the best split.
Every time there is a split, our algorithm looks at a number of features and takes the one
with the optimal metric i.e. accuracy using entropy, and creates two branches according to
that feature. Another use of max_features is to limit overfitting. By choosing a reduced
number of features, we can increase the stability of the tree and reduce variance
(variability in the model prediction) and over-fitting. Its default value is sqrt which
means max_features = sqrt (n_features).
 class_weight: If this parameter is not given, all classes are supposed to have weight one
(or equal weight). The balanced mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data as n_samples /
(n_classes * np.bincount(y)).

Grid Search CV is used for finding the optimal values of the hyper parameters.

Results:

The best found parameters are 'class_weight': 'balanced', 'criterion': 'gini', 'max_features':
'sqrt', 'n_estimators': 100. The accuracy we’ve obtained is 94.94% and that mentioned in
research paper is 93.44%.

148
Capital University of Science and Technology, Islamabad Department of Computer Science
7.3.4. Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm commonly used for classification problems.

● The data points in SVM algorithm are plotted in n-dimensional space (n is for the
number of features) where the value of each feature being the value of a particular
coordinate.

● Afterwards, hyperplane is defined to differentiate the two classes well for the purpose
of classification.

● Support Vectors are simply the coordinates of individual observation. The SVM
classifier is a frontier that best segregates the two classes (hyper-plane/ line). [18]

Working

SVM identifies the right hyperplane by considering the data points. Here, in this scenario
hyperplane A is not differentiating the two classes well. Similar is the case with
hyperplane C. We can see that the hyperplane B is differentiating the two classes well so
here hyperplane B shall be selected.

Figure 7. 5: SVM Example 1

Let’s consider another scenario in which all the hyperplanes are differentiating
between the classes well so which hyperplane shall we choose?

149
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 6: SVM Example 1.1

By maximizing the distance between the nearest data points of both the classes we can
identify the right hyperplane and this distance is called margin. So the right hyperplane shall
be C.

An important point is that we don’t need to add this feature of selecting hyperplane manually
as SVM do that itself through its kernel technique.

Implementation of Support Vector Machine Algorithm

We have performed tuning of the hyper parameters by finding their optimal values. The
hyper parameters we’ve tuned are:

 C: Regularization parameter. The strength of the regularization is inversely


proportional to C. Must be strictly positive. To put it simple, this hyper-parameter is
used to control error (tolerence).
 kernel: Specifies the kernel type to be used in the algorithm. Like, when the dataset is
non-linear, it is recommended to use a kernel function other than that of 'linear'. The
kind of kernal being used basically depends on the distribution of the data.
Oversimplified, kernal is basically a mathematical function.
 gamma: Kernel coefficient for rbf, poly and sigmoid. It is basically used to give
curvature weight of the decision boundary.
 class_weight: If this parameter is not given, all classes are supposed to have weight
one (or equal weight). The balanced mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data as n_samples /
(n_classes * np.bincount(y)).

150
Capital University of Science and Technology, Islamabad Department of Computer Science
Grid Search CV has been used to find the optimal values of the parameters.

Figure 7. 7: grid search cv

Results:

The optimal values of the hyper parameters found by the Grid Search CV are: C=100000 and
gamma=1e-08. The accuracy obtained for SVM is 86.52% and that mentioned in the research
paper is 78.1%.

The comparison of machine learning algorithms accuracies of our experiment with research
paper [19] is shown in the table below.

Table 7. 3: Comparison of Machine Learning Algorithms

Sr. Name Research Paper Our Experiment


No Accuracy Error rate Accuracy Error rate
1. Random Forest 93.44 6.56 94.94 5.06
2. Decision Tree 90.75 9.25 92.16 7.8
3. SVM 78.1 21.9 86.52 13.48
4. K Nearest Neighbour 85.25 14.75 88.65 11.35

151
Capital University of Science and Technology, Islamabad Department of Computer Science
7.4. Cross validation

In this project other than other module one module was to compare malware analysis and try
to analyze which technique is better. We aimed to cross validate static and dynamic analysis
techniques. We have performed cross validation to analyze the difference of accuracies
between static, dynamic analysis and the accuracies of machine learning algorithms.
Moreover we have compared results of both types of techniques.

For this purpose we have selected 500 APK’s, 100 from each category according to our time
span. We first perform static and dynamic analysis and analyzed the five hundred APK’s one
by one using different tools like APK tool, JD-GUI, Mob-SF etc. The results of this analysis
was maintained in an excel file. It contained APK’s name, prediction from each tool
(malware, benign), final class label which was decided according to the majority label of
tools and the original label of the APK. The excel file containing results looks like this

Figure 7. 8: Excel file of APKs results

After static and dynamic analysis we was supposed to apply machine learning algorithms on
the dataset of same 500 APK’s. As the dataset available don’t have APK name in it so it was
impossible to select records of same APKs, for this purpose we extracted data from .json files
(available for each APK). We did extraction of data from self-written python script which
152
Capital University of Science and Technology, Islamabad Department of Computer Science
initially unzip the folder and then export data in .csv file. After that we applied all four
machine learning algorithms we have used in this whole research which are KNN, SVM, RF,
and DT. A code snippet to show ML algorithm implementation is shown below in image.

Figure 7. 9: Code snippet

After getting results of machine learning algorithms in form of accuracies. Now we were
supposed to calculate accuracy of static and dynamic analysis from the maintained excel file.
For this purpose we decided to use python built-in library Sklearn. The default library
function to calculate accuracy is as follows

Figure 7. 10: Library Function

The compulsory parameters for this functions are actual labels, predicted labels. As we have
maintained both of these in our excel file of static and dynamic analysis results, so we used
it like this

153
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 7. 11: use of compulsory parameters

After getting all the accuracy we draw a bar graph to compare accuracies of all malware
analysis techniques. The results are shown in the graph

Figure 7. 12: Bar Graph

The above graph show accuracy of all machine learning algorithm along with accuracy of
static and dynamic analysis. From the results it is quite clear that machine learning algorithm
perform very well as compared to static dynamic analysis. Machine Learning Algorithm
Random Forrest performed the best with the error rate of 4.7% while error rate of static and
dynamic analysis is 24.2%. Even though machine learning algorithm also have error to some

154
Capital University of Science and Technology, Islamabad Department of Computer Science
extent but still it is quite less as compared to static and dynamic analysis as Random forest
perform almost 20% better than static and dynamic analysis. This shows that for malware
analysis machine learning perform better than static dynamic analysis. The major reason of
bad performance of static and dynamic analysis is that in this technique human is
involvement is present to a great extent, because of human error accuracy also effect.

Chapter 8

Software Deployment

The software is windows based and its setup for windows will be provided.

8.1. Installation and Deployment Process Description


For deployment of our system initially check the system requirements.

Recommended system requirements

 CPU: Intel Core i3 10100F 3.6GHz


 RAM: 8 GB
 HDD: 5 GB
 OS: Windows 8 or above

155
Capital University of Science and Technology, Islamabad Department of Computer Science
We will provide user with a setup file (.exe). User will right click on the exe and click run as
administrator.

Figure 8. 1: Run as Administrator

Installation of software will be done smoothly only by clicking next on the setup window. If
wanted to change location of installation of software then browse otherwise go ahead. After
installation of setup now find out directory named Database in the directory in which you
have installed the setup. If you haven’t change directory during installation then it will be
C: /Users/”username”/Program Files(x86)/Malware Analyser/Database. The file inside
Database folder named ‘malware.sql’ will be used in next steps.

8.1.1. Setup Dependency

User needs to download and install XAMP setup on his machine. After installation

 Start XAMP and start services MySQL & Apache.

156
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 8. 2: XAMPP 1

 Click on the Button “Admin” present next to Stop in the MySQL service row
(highlighted in the image below)

Figure 8. 3: XAMPP 2

 Local host site will be open like shown below on your browser.

157
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 8. 4: php my admin

 Create new database with name malware by following the steps shown below.

Figure 8. 5: New Database

 After creating empty database we need to import database into this. Follow following
steps to import database.
158
Capital University of Science and Technology, Islamabad Department of Computer Science
Figure 8. 6:Import Database

 After the import make sure if database has been imported or not by clicking on the
database in left side pane. It will look like this.

Figure 8. 7: verify import

159
Capital University of Science and Technology, Islamabad Department of Computer Science
 Now came to the initially extracted folder and double click the exe file and wait for
some time a black screen will appear like shown in image bellow but don’t cancel it
until you want to close the software.

Figure 8. 8: Waiting Screen

 In sometime software will start and now you can use it. Keep in mind while using
software XAMP will remain in start mode like shown in step 1.

160
Capital University of Science and Technology, Islamabad Department of Computer Science
Chapter 9
9. Project Evaluation
In the FYP - 1 Final evaluation of the project some amendments were suggested. We have
applied the suggested changes accordingly. Following were the suggestions given by the
respected teachers:
Table 9. 1: Project Evaluation

Sr.No. Suggestions

1 Add progress bar while loading data as it takes a lot of time to load.

2 Add progress bar pruning is time consuming so that user will be aware of time
required for this task.

3 Provide guide to user about hyper parameters.

4 Maintain history of hyper parameters of tuning.

5 Malware analysts can view dynamic accuracy graphs during hyper parameter
tuning.

161
Capital University of Science and Technology, Islamabad Department of Computer Science
References

[1] S. O'Dea, 20 5 2021. [Online]. Available:


https://www.statista.com/topics/876/android/.
[2] wikipedia, "Android_version_history," [Online]. Available:
https://en.wikipedia.org/wiki
/Android_version_history.
[3] S. H. J. A. M. M. Q. a. T. S. K. Dunham, ANDROID MALWARE AND
ANALYSIS,
Boca Raton, CRC Press Taylor & Francis Group, 2015.
[4] computer world, " www.computerworld.com"
[Online].Available: https://www.computerworld.com/article/3235946/android-
versions-a-living-history-from-1-0-to-today.html.
[5] statcounter.com, "statcounter.com," [Online]. Available:
https://gs.statcounter.com/os-market-share
/mobile/worldwide.
[6] developer.android.com, "developer.android," [Online]. Available:
https://developer.android.com
/guide/practices/compatibility.
[7] T. Contributor, "searchmobilecomputing.techtarget.com," 3 2008.
[Online]. Available:
https://searchmobilecomputing.techtarget.com/definition/Open-Handset-
Alliance.
[8] techxplore.com, "techxplore.com," 5 3 2021. [Online]. Available:
https://techxplore.com/news/2
021-03-non-intrusive-method-cyber-android-devices.html.
[9] f-secure.com, "f-secure.com," [Online]. Available: https://www.f-secure.com/v-
descs/trojan_
android.shtml.
[10] B. Soare, "android-malware," [Online]. Available:
162
Capital University of Science and Technology, Islamabad Department of Computer Science
https://heimdalsecurity.com/blog/android-
malware/.
[11] K. O. a. K. A. Z. A. R. Sihwail, "A Survey on Malware Analysis Techniques:
Static, Dynamic,
Hybrid and Memory Analysis," International Journal on Advanced Science,
Engineering and
Information Technology, vol. 8, pp. (4-2), 2018.
[12] B. M. Mehtre, "Advances In Malware Detection-An Overview," 2021.
[13] S. Bowcut, "malware-analyst," [Online]. Available:
https://cybersecurityguide.org/careers/
malware-analyst/.
[14] D. R. B. A. R. &. R. A. Thomas, "Security metrics for the android ecosystem," In
Proceedings of
the 5th Annual ACM CCS Workshop on Security and Privacy in Smartphones and
Mobile Devices, pp.
87-98, 2015.
[15] kaspersky.com, "resource-center/threats/mobile," [Online]. Available:
https://www.kaspersky.com/
resource-center/threats/mobile.
[16] A. Faizan, "best-programming-languages-cyber-security," 02 02 2021. [Online].
Available:
https://flatironschool.com/blog/best-programming-languages-cyber-security.
[17] algorithmia, "the-importance-of-machine-learning-data," 26 03 2020. [Online].
Available:

https://algorithmia.com/blog/the-importance-of-machine-learning-data.
[18] "maldroid-2020," [Online]. Available: https://www.unb.ca/cic/datasets/maldroid-
2020.html.
[19] S. K. A. F. A. F. R. A. D. &. G. A. A. Mahdavifar, "Dynamic Android Malware
Category

163
Capital University of Science and Technology, Islamabad Department of Computer Science
Classification using Semi-Supervised Deep Learning," 2020 IEEE Intl Conf on
Dependable,
Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and
Computing, Intl
Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and
Technology Congress,
pp. 515-522, 08 2020.
[20] L. K. A. F. A. &. L. A. H. Taheri, "Extensible android malware detection and
family classification
using network-flows and API-calls," 2019 International Carnahan Conference on
Security
Technology (ICCST), pp. 1-8, 2019.
[21] D. S. L. B. K. G. L. A. H. G. F. &. M. F. Keyes, "EntropLyzer: Android Malware
Classification
and Characterization Using Entropy Analysis of Dynamic Characteristics," 2021
Reconciling
Data Analytics, Automation, Privacy, and Security: A Big Data Challenge
(RDAAPS), pp. 1-12, 2021.
[22] A. L. A. H. K. G. T. L. G. F. &. M. F. Rahali, "DIDroid: Android Malware
Classification and
Characterization Using Deep Image Learning," 2020 the 10th International
Conference on
Communication and Network Security, pp. 70-82, 2020.
[23] C. HOFFMAN, "the-case-against-root-why-android-devices-dont-come-rooted,"
20 6 2017. [Online]. Available: https://www.howtogeek.com/132115/the-case-
against-root-
why-android-devices-dont-come-rooted/.
[24] bullguard.com, "android-rooting-risks," [Online]. Available:
https://www.bullguard.com/bullguard-security-center/mobile-security/mobile-
threats

164
Capital University of Science and Technology, Islamabad Department of Computer Science
/android-rooting-risks.aspx.
[25] K. Casey, "top-7-vulnerabilities-in-android-applications-2019," 20 09 2019.
[Online].
Available: https://codersera.com/blog/top-7-vulnerabilities-in-android-
applications-2019/.
[26] S. Srivatsa, "android_security," 15 12 2014. [Online]. Available:
https://www.cse.wustl.edu/~jain/cse571-14/ftp/android_security.pdf.
[27] tutorialspoint, "tutorialspoint.com," [Online]. Available:
https://www.tutorialspoint.com/android/android_architecture.htm.
[28] Z. Banach, "session-hijacking," 22 08 2019. [Online]. Available:
https://www.netsparker.com/blog/web-security/session-hijacking/.

165
Capital University of Science and Technology, Islamabad Department of Computer Science

You might also like