0% found this document useful (0 votes)
38 views51 pages

Malware - Detection - Using - Neural - Networks (Main Paper)

The document presents a thesis submitted to the Department of Computer Science and Engineering at Brac University in partial fulfillment of the requirements for the degree of B.Sc. in Computer Science and Engineering. The thesis was submitted by 5 students and compares the performance of convolutional neural networks (CNN), long short-term memory networks (LSTM), and gated recurrent unit (GRU) for malware detection using secondary data to gather information about malware activity. The thesis includes sections on background, the proposed model, experimental setup, and result analysis.

Uploaded by

alim.aldin.rohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views51 pages

Malware - Detection - Using - Neural - Networks (Main Paper)

The document presents a thesis submitted to the Department of Computer Science and Engineering at Brac University in partial fulfillment of the requirements for the degree of B.Sc. in Computer Science and Engineering. The thesis was submitted by 5 students and compares the performance of convolutional neural networks (CNN), long short-term memory networks (LSTM), and gated recurrent unit (GRU) for malware detection using secondary data to gather information about malware activity. The thesis includes sections on background, the proposed model, experimental setup, and result analysis.

Uploaded by

alim.aldin.rohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Malware Detection Using Neural Networks

by

Syed Irfan Kayum


17101272
Humaira Hossain
17101395
Nafisa Tasnim
17101143
Arja Paul
17301006
Alim Aldin Rohan
17101202

A thesis submitted to the Department of Computer Science and Engineering


in partial fulfillment of the requirements for the degree of
B.Sc. in Computer Science and Engineering

Department of Computer Science and Engineering


Brac University
January 2021

© 2021. Brac University


All rights reserved.
Declaration
It is hereby declared that

1. The thesis submitted is my/our own original work while completing degree at
Brac University.

2. The thesis does not contain material previously published or written by a


third party, except where this is appropriately cited through full and accurate
referencing.

3. The thesis does not contain material which has been accepted, or submitted,
for any other degree or diploma at a university or other institution.

4. We have acknowledged all main sources of help.

Student’s Full Name & Signature:

Syed Irfan Kayum Humaira Hossain


17101272 17101395

Nafisa Tasnim Arja Paul


17101143 17301006

Alim Aldin Rohan


17101202

i
Approval
The thesis titled “Malware Detection Using Neural Networks” submitted by

1. Syed Irfan Kayum (17101272)

2. Humaira Hossain (17101395)

3. Nafisa Tasnim (17101143)

4. Arja Paul (17301006)

5. Alim Aldin Rohan (17101202)

Of Fall, 2020 has been accepted as satisfactory in partial fulfillment of the require-
ment for the degree of B.Sc. in Computer Science and Engineering on January 11,
2021.

Examining Committee:

Supervisor:
(Member)

Moin Mostakim
Lecturer
Department of Computer Science and Engineering
Brac University

Co-Supervisor:
(Member)

Muhammad Iqbal Hossain,PhD


Assistant Professor
Department of Computer Science and Engineering
Brac University

Thesis Coordinator:
(Member)

MD. Golam Rabiul Alam,PhD


Associate Professor
Department of Computer Science and Engineering
Brac University

ii
Head of Department:
(Chair)

Mahbubul Alam Majumdar,PhD.


Professor and Dean, School of Data and Sciences
Department of Computer Science and Engineering
Brac University

iii
Abstract
One of the great and major issues facing the Internet today is a large amount of
data and files that need to be analyzed for possible malicious purposes. Malicious
software also referred to as an attacker’s malware is polymorphic and metamorphic
in design. It has the potential to modify their code as it spreads. Increased malware
and sophisticated cyber attacks are becoming a serious issue. Unknown malware
that has not been identified by security vendors is often used in these attacks,
making it difficult to protect terminals from infection. As of now, there is a lot of
research being performed to identify and monitor malware. After acknowledgment
of the deep learning area, several researchers have tried to detect malware using
neural networks and deep learning methods. This paper contrasts the performance
of three different neural networking models: Convolutional Neural Networks (CNN),
Long-Short Term Memory (LSTM) Network, and Gated Recurrent Unit (GRU) for
malware detection. Besides, we used secondary data to gather information about
malware activity.

Keywords: Convolutional Neural Network, Long-Short Term Memory Network,


Gated Recurrent Unit, secondary data, malware, threats

iv
Dedication
We dedicate this thesis paper to those Researchers prior to us who have done an
amazing work and made our concept about this topic more clear and helped us with
the proper documentation we needed for conducting our research.

v
Acknowledgement
First of all, all gratitude to the Great Allah, for whom our report has been concluded
without any significant interference.
Moreover, in this research paper we have acknowledged about the integrity of the
research papers that we observed and which are mentioned in the Bibliography
Section. Moreover, we want to thank our respective thesis supervisor and co-advisor
who have helped us in every step in conducting this thesis research. Lastly, we are
grateful to the contribution of our University which have provided us with all the
necessary things required for our thesis research.

vi
Table of Contents

Declaration i

Approval ii

Abstract iv

Dedication v

Acknowledgment vi

Table of Contents vii

List of Figures ix

List of Tables x

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 4
2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Neural Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Convolutional Neural Network (CNN) . . . . . . . . . . . . . 8
2.2.2 Long short-term memory (LSTM) . . . . . . . . . . . . . . . . 10
2.2.3 Gated Recurrent Unit (GRU) . . . . . . . . . . . . . . . . . . 13
2.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Proposed Model 17
3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 GRU model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Experimental Setup 23
4.1 Basic libraries and functions needed for model setup . . . . . . . . . . 23
4.2 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vii
5 Result Analysis 27
5.1 Basic terminologies needed for result analysis . . . . . . . . . . . . . . 27
5.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Result analysis for CNN model . . . . . . . . . . . . . . . . . 29
5.2.2 Result analysis for GRU model . . . . . . . . . . . . . . . . . 31
5.2.3 Result analysis for LSTM model . . . . . . . . . . . . . . . . 33
5.3 Accuracy analysis for all three models . . . . . . . . . . . . . . . . . . 35

6 Conclusion and Future Plan 36

Bibliography 40

viii
List of Figures

2.1 Structure of CNN [38] . . . . . . . . . . . . . . . . . . . . . . . . . . 9


2.2 Structure of feed-forward neural network [31] . . . . . . . . . . . . . . 10
2.3 Structure of recurrent neural network [1] . . . . . . . . . . . . . . . . 11
2.4 Structure of LSTM [34] . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Structure of GRU [42] . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Diagram showing how GRU works[42] . . . . . . . . . . . . . . . . . . 14
2.7 Graph showing Sigmoid Activation function [25] . . . . . . . . . . . . 15
2.8 Graph showing ReLU Activation function [27] . . . . . . . . . . . . . 16

3.1 Images of Benign and Malware . . . . . . . . . . . . . . . . . . . . . 17


3.2 Augmented Benign images . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Flowchart for CNN model . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Flowchart for LSTM model . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Flowchart for GRU model . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Brief working process of how CNN model works with training, vali-
dation and testing data . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Brief working process of how LSTM model works with training, vali-
dation and testing data . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Brief working process of how GRU model works with training, vali-
dation and testing data . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Diagram showing confusion matrix [39] . . . . . . . . . . . . . . . . . 27


5.2 Graph showing CNN model result on training and validation datasets 29
5.3 Confusion Matrix of CNN model . . . . . . . . . . . . . . . . . . . . 30
5.4 Graph showing GRU model result on training and validation dataset 31
5.5 Confusion Matrix of GRU model . . . . . . . . . . . . . . . . . . . . 31
5.6 Graph showing LSTM model result on training and validation dataset 33
5.7 Confusion Matrix of LSTM model . . . . . . . . . . . . . . . . . . . . 33
5.8 Histogram showing accuracy percentage of all three models . . . . . 35

ix
List of Tables

5.1 A table showing performance measure for CNN model on the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 A table showing performance measure for GRU model on the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 A table showing performance measure for LSTM model on the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

x
Chapter 1

Introduction

1.1 Motivation
One of the biggest threats to security in the new technology industry is malware.
This malware can steal sensitive data from the computer. Hackers, therefore, take
advantage of these malwares to steal information from users. Day by day, the
number of malicious applications is growing at a rapid rate, and the identification of
malware is becoming increasingly difficult and it can also be found on smartphones.
Besides, the open nature of the Android platform acts as a breeding ground for
malware development and can be used for cyber attacks. Moreover, defense from
cyber-attacks by small and medium-sized companies is of considerable significance
and a demanding area, since it impacts them financially and technically. Also, there
are so many malwares all over the internet that can be a major threat to user safety.
There is therefore an urgent need to enhance the malware detection approach and
protect users from security threats.

1.2 Problem Statement


Nowadays, malware has had a huge effect on cyber operations. This causes harm
to the computer, server, client, or computer network. Not only does the malware
attack cause these harms, but it also involves banking trojans, ransomware, viruses,
worms, adware, and more. These kinds of attacks pose a risk to business operations,
as well as a threat to customers.
Instant message attacks via IM connections near to email links: File
transferring is another way to hack malware that attacks via file-sharing programs.
The key issue is that these malicious software programs are designed to propagate
through the system. Writing malware is not a challenging job, and thousands of
them crawl through almost every device according to [41] .
Some of the other ways that malware spreads: Including computer lab sys-
tems can be hacked, and when you transfer files from an affected system to your
USB stick system, the malware enters your device also, claimed by [41]. According
to [36], there are some popular forms of malware which are described below:
Virus: Viruses are able to sabotage your computer system by ruining your files,
deleting your hard disk, or stopping down your computer. A computer virus needs
human intervention to distribute to other machines and is also shared through email
links and internet loads.

1
Worms: Worms expand through web servers by harnessing the operating system’s
debt, which is one of the most familiar forms of malware. Payloads can delete host
system files, encrypt blockchain attack details, illegal access, destroy data, and con-
struct botnets.
Trojan Horse: You are allowing cybercriminals access to your machine as soon as
you mount a Trojan. Trojan malware does not reproduce on its own but the harm
Trojans will have on hosts and services is permanent when aligned with a worm.
Spyware: Downloaded and installed despite your comprehension, spyware is forged
to capture your scanning and activities on the world wide web.
Ransomware: This is a kind of malware that gathers your details and asks for
money for the details to be handed back to you. The only way to stay safe or delete
a malware infection is by using anti-malware or anti-virus software.
There is really no end to the channels from which malware approaches your system,
but once inside your machine, it progresses automatically and often disrupts internet
traffic.

1.3 Objectives and Contributions


Our research paper primarily focuses on the cybersecurity vulnerability in Bangladesh.
Bangladesh has been among the biggest unprotected countries in cyber world in
modern times. Cyber-attacks have also emerged, leading to a loss of infrastructure in
recent years. With such a development of internet technology, the number of strikes
is also on the increase and also Bangladesh ranks second between many countries in
terms of the level of attack [29]. In addition, the overall number of affected IPs in
Bangladesh was 34552 in the recent two-hour trial run by the Bangladesh Computer
Council, and the IPs of powerful brands such as Grameen Phone, Banglalion and
Link 3 also were even included in the count.
Here our research output successfully detects malware amongst our selected dataset
using different Neural Network models. By conducting the research, we hope to in-
troduce the characteristics of different neural networks and how they can contribute
to solving the cyber vulnerability of Bangladesh.

1.4 Thesis Structure


To begin with, the thesis paper has an introduction section that describes our re-
search along with our motivation to do the research and what problem arose that
we turned our focus on this kind of research. Moreover, the next section titled
Background describes the different Neural Networks and detailed information about
their working structures, characteristics, and uses and introduces different terms
of Neural Networks. Moreover, in the subsequent chapter, the paper describes in
detail the dataset used in the research with the proper description of how the data
was structured and preprocessed to be used in our research and how the models
are working with our preprocessed dataset. Besides, the next section explains the
experimental setup that we used to build our models to get the required output.
The following section explains the essence of our findings and the feasibility of our
proposed models and it also shows the comparison between each proposed model
and the amount of accuracy achieved in each model. Finally, in the last section,

2
we decided to do some further works with our research which is described in the
conclusion part accordingly.

3
Chapter 2

Background

2.1 Literature Review


We have gone through a number of research papers to find out how others have
found malware based on certain machine learning theorems and neural networks.
The first research paper, by Kim, Kabanga and Kang proposed a convolutional gated
recurrent neural network model that contains malware in their respective families
as it partitions malware into 9 different families. In addition, their model has four
prominent levels: a convolutional neural network (CNN), a layer of gated recurrent
units (GRUs), a layer of deep neural networks (DNNs), and a sigmoid layer. This
was done by Microsoft during the 2015 malware categorization competition. Be-
sides, the model showed an accuracy of 92.6 percent using the proposed method
[15].
Another research paper proposed by Yeo et al. represents an automated malware
detection approach using CNN and other machine learning algorithms.Rather than
just port numbers and protocols, 35 unique variations derived from packet flow were
collected, in which CNN, MLP, SVM, and random forest were implemented for cat-
egorzation. However, the paper only concentrated on detecting malware as quickly
as possible with large quantities of data, and the result showed only 85 percent ac-
curacy [17].
The article by Shukla, Kolhe, PD and Rafatirad inaugurates a two way procedure
that can successfully detect both traditional and stealthy malware. The traditional
malware in their paper is backdoor, Rootkit, Trojan, Viruses, and Worms, and
stealthy malwares are hidden malwares that produce many kinds of mutation tech-
niques. They also used machine learning to detect traditional and stealthy malware,
amid sophisticated crafting techniques, with high performance, both HPC-based and
localized feature-based techniques along with LSTM and RNN for malware catego-
rization. In addition, the accuracy of their model on typical malware samples was
94 percent and for stealthy malware was almost 90 percent with an Fl-score of 92
percent and a recall score of 91 percent [22].
We have reviewed several papers related to the identification of Android malware
only. Firstly, the research paper by Oulehla, Oplatkova and Malanik addresses
the most dangerous form of mobile malware known as botnets. Unlike typical mo-
bile malware, botnets typically have complicated behavior patterns because they
are not managed using predictable algorithms but are managed through the CC
servers and/or peer-to-peer networks. The paper suggested a parallel architecture

4
using neural networks for the useful identification of mobile botnets. In addition,
the study also stated that the current architectural form of the Android function-
ing framework does not allow antivirus programs to obtain information required
to detect malware, including botnets on non-rooted mobile devices. However, the
paper also claimed that they did an early phase of research as the parameters and
structure used were optimized [6]. Another similar paper by Jin, Liu, Qu and Chi
proposed a new way to detect Android malware and the new one is a exclusive
machine learning classifier on CNN with Adaptive Selection of Classifiers (ASC) to
speed up the accuracy of malware categorization. In the meantime, it makes a quick
static analysis that is not very reservoir consuming. However, in the manner of
accuracy, their suggested approach is 4.27 percent higher than the traditional CNN.
Also, the paper plans to fuse these two approaches for further study in the future
[14]. Besides, the research paper by Hasegawa and Iyatomi established an accurate
and lightweight, one-dimensional Android malware detection process. Convolutional
Neural Network (1-D CNN) and its purpose are to treat a very small part of the
raw APK . The most significant benefit of CNN 1-D is its large connection to nu-
merous successful and renowned methods for improving the basic learning standard
acquire from CNN and CNN has acquired key features for the categorization in the
learning activity. They could not confirm why it was unable to detect other features
because the APK file is being squeezed and string explication is unworkable and
expected to be further studied in the future [12]. Another research paper by Zhao,
Li, Zheng and Shi focused on malware detection in Android smartphones. Day af-
ter day, the number of malicious applications is growing at a rapid rate, making
it increasingly difficult to detect malware. Their paper represented a new malware
detection technique that is based on deep neural networks and used the optimized
Convolutional Neural Network to acquire knowledge from opcode sequences. Their
method uses optimized Convolutional Neural Network to be learned several times
by the raw operation code sequence which is retrieved from the decompiled android
file. The results of the conducted experiment have an accuracy of 99 percent which
is 2 percent - 11 percent higher than the other ML algorithms using the same dataset
[18]. In addition, a research article by Li, Wang and Xue focused on the privacy
and security issues of Android smartphones. Due to its flexible nature, the Android
platform is prone to malware. Third-party applications can be installed by users,
and there are too many non - certified and authorized apps that may be a threat
to the privacy of the user. In this article, a deep-learning technique has been used
to overcome this privacy issue to recognize Android malware and created an engine
that can detect a bunch of malicious files. The outcomes of the research show that
the engine was successful in detecting 97 percent of the malware at a false positive
rate of 0.1 percent [16] . The papers, however, only studied the Android malware
using some different techniques, and all the papers planned to extend their research
using precise and maximum parameters shortly.
Another type of research paper on the protection of small and medium-sized enter-
prises from cyberattacks, which is proposed by Chamou et al. is very critical and
difficult field , as it affects them both economically and conceptually. The scien-
tific community has therefore shown an interest in implementing and optimizing the
performance of intervention detection systems. Here, with the help of a deep neural
learning method, an effort was made to build a system that can detect malicious
behavior in the context of DDOS and malware cyber-threats. This paper aims to

5
use flow-based statistical data to implement an anomaly-based intrusion prevention
method and reap the benefits of a deep learning model. The result of the research
showed a high 99.97 percent accuracy rate for DDOS detection and 99.44 percent
for malware detection with a low FPR for malevolent and regular network traffic
differentiations [19].
Research article by Tobiyama, Yamaguchi, Shimada, Ikuse and Yagi focused on
counteracting post-infection measures, but since the malware data is very much be-
nign traffic initiative, they suggested detecting malware processes on the infected
terminals and suggested training a RNN and training a CNN to classify white fea-
ture images. They proposed that the Neural Network be used to modify the dif-
ferent characteristics of the individual operation flow by recording and constructing
API call sequences and features extractor by learning the LST language model. In
conclusion, they validated the classifier using 150 log files of process behavior and
compared the validation result that was performed under several conditions, and
obtained the best AVC result = 0.96 when the feature image was 30X30 [7].
Besides, the paper by Wang, Zhu, Zeng and Sheng introduced a new typology of
classical classification from an artificial intelligence point of view and implemented
a traffic classification system using CNN, taking traffic as portraits and in this
research, raw traffic was used as input and the first aim to apply representation
learning. The four-traffic port-based classification method and the DPI box method
are used to classify traffic by retrieving empirical data using a set of specific features.
The approach to machine learning is better than a rule-based approach because it
solves many problems and the paper also shows the efficacy of using representation
learning [9].
However, the research paper by Saxe and Berlin established a deep neural network
system created by Invincea Labs that attain a useful detection accuracy at an ex-
tremely low error rate and brought up to a real-world training system achieving a
95 percent detection rate at a 0.1 percent FPR based on more than 400,000 software
binaries explicitly collected from their customers and internal malware database sys-
tems. Also, they described a non-parametric method for adjusting classifier scores
that adequately explain the expected accuracy in the application area. In regards,
according to the paper, deployable deep-network malware detectors using unvarying
features that have the finest prediction performance for any previously published
identification engine [5].
The paper by Alsulami and Mancoridis focused on improving the amount of malware
dataset, the wide range of malware families, as well as the broad array of labelling
strategies provided to malware by anti-virus providers, pose problems for cognitive
malware classification models. They serve a behavioural patterns classifier that uses
a Convolutional Recurrent Neural Network and data from Microsoft Windows Data
Retrieval Files. Besides, as their model minimizes training time and overhead and
as a result shows the capacity to constantly learn the behavior of new families of
malware [11].
Another research paper by Poonguzhali, Rajakamalam, Uma and Manju focused on
the growth of the Internet, has increased significantly malicious code attacks with
malicious code variants being a primary threat to Internet security. For security
breaches, data theft, and other hazards, it is essential that malicious code variants
can be detected. This paper offers a method for the detection of malware variants
using deep learning with the CNN. Static detection and dynamic detection are the

6
two traditional methods used in this paper. Static detection is performed by de-
taching the malware code, analyzing it, and dynamical detection by performing it
in a safe, virtual environment or sandbox, to analyze malicious code behavior. Once
the features are extracted, they are categorized according to their variants [21] .
The paper by Gonzalez and Vazquez focuses on a vector malware in which each fea-
ture consists of the number of APIs called from the Dynamic Link Library. The only
API-related techniques used for categorization are depend on the sequence of func-
tions to identify malicious program behavior or the frequency of certain API calls,
and also analyzes the ability for malware categoriztion depending on the number of
API calls per dynamic link library. The examination of the data that could have
been obtained from the samples brought to observe the patterns of the imported
functions and the Dynamic Link Libraries. The benefit of this approach is based on
the elegance of the vector features, given that the function counting method can be
written in any coding language [3].
In addition, the article by Zhang et al. proposed that Ransomware is a specific
kind of malware that proceeds to unredeemable data loss and causes immense loss
of knowledge and economic costs. Some ransomware, such as ransomware for finger-
printing, can print the climate of run-time and avoid complex analysis. To detect
and speed up this kind of ransomware they recommend a method of static anal-
ysis based on handling in contrast to dynamic studies. N-gram opcodes are used
for deep learning since the opcode sequences obtained from executable files have
great insight and procedural information, they view the opcode sequence from the
point of view of natural language sentences. They divide the N-gram sequence into
several patches and feed each patch to the neural called the SA-CNN convolution
network which is based on self-attention. They are the first to take advantage of
the opcode sequence self-care mechanism for ransomware classification to the best
of their knowledge. The system gathers from the system rich meaning and seman-
tic knowledge from the extremely long series with partition strategy and network
power of self-interest to eliminate ransomware. Even after elimination, the impact
of ransomware is irrevocable without the assistance of ransomware writers. Such
widespread intervention results in enormous financial losses and adverse effects on
the operations of the company [24].
Another paper by Abdelsalam, Krishnan, Huang and Sandhu focuses on the detec-
tion of malware found in Cloud Infrastructure using CNN. Over the years, the cloud
infrastructure has become more susceptible to malware attacks. The attacker usu-
ally infuses malware to manipulate the victim’s virtual machine. Within the data
center, malware can spread quickly and can cause massive crises to cloud service
providers and their customers. As a result, the need to detect malware on virtual
machines is very important. This report introduces and portrays an effective cloud
infrastructure malware detection strategy using the Convolutional Neural Network
(CNN). At first, a standard 2d CNN is used to train the metadata available for each
of the virtual machine (VM) processes collected through a hypervisor. Then, by
using a new 3D CNN, the precision of the CNN classifier is improved, which notice-
ably helps to minimize misidentified samples during data collection and training.
The malware used to perform the experiments was selected randomly. Their model
showed that the 2nd CNN model used has an accuracy of 79 percent and the other
model used significantly better the accuracy to 90 percent [10].
Finally the paper by Hsien-De Huang and Kao proposed a common approach for

7
detecting Android malware and this needs ongoing learning through pre-extracted
ways to ensure proper malware detectability. A color-inspired CNN based Android
malware detection (R2-D2) model has been preferred to eliminate the firepower of
feature engineering before the condition of not retrieving pre-selected features.Their
research has adopted an in-depth approach to constructing an end-to-end learning-
based Android malware detection method and proposes a color-inspired CNN [13].

2.2 Neural Network Models


2.2.1 Convolutional Neural Network (CNN)
One of the key categories for image classification, is the Convolutional Neural Net-
work. Detections of objects, faces of identification, etc., are some of the places where
CNNs are commonly used. Though, it works best with image classification tasks.
An input image is taken, analyzed, and categorized by CNN image classifications
within some categories. Computers see an input image as a pixel array and this
depends on the resolution of the image. The shape of the array is h x w x d (h =
Height, w = Width, d = Dimension) based on the image resolution. For a normal
image (RGB) there are 3 different channels for Red, Green, Blue and each channel
has corresponding pixel values. For example, an image of a 6 x 6 x 3 is an RGB
matrix array (3 refers to RGB values) and an image of a 4 x 4 x 1 is a matrix array
of the grayscale image. Basically, there are four layers in this model which are as
follows [38]:
• Convolutional Layer

• ReLU Layer

• Pooling Layer

• Fully Connected Layer


Convolutional Layer: There are many types of filters in this layer. Each filter
does the filtering in an image that is passed to the model. Each image is passed as
a matrix of pixel values e.g. an image of size 26x26 has 676 pixels and each of them
contains a value. The images are converted into a grayscale format for the sake of
easier and faster calculations inside the CNN model earlier. These filters extract
different features of the object in the image. The filters are moved to all the areas
of an image for feature extraction. This process is known as “Filtering”. Basically
what happens here is pixel values of the filters are multiplied to the corresponding
pixel values of the image. The multiplied values are then added together and di-
vided by the number of pixels in the matrix filter. The final values then form a new
matrix. Like this, different filters form different matrices. The window size of filters
is specified beforehand.
ReLU Layer: The job of this layer is quite simple. All the negative values in the
matrix that we obtained after traversing each filter throughout an image become
zero. This works exactly the same as the “ReLU activation function” where the
dependent variable is linearly increasing with the value of the independent variable
after a certain point.

8
If x < 0, f(x)=0 and,
if x >= 0, f(x)=x

That is how the ReLU activation function triggers a neuron.

Pooling Layer: This layer shrinks the size of the image. In this layer, there is a
window size that is specified. This window is moved according to the stride size
throughout the whole matrix that we obtained from ReLU layer and the maximum
pixel value from this window is taken out and is placed in a new matrix. After this,
we get a new matrix full of pulled out maximum values. This layer does the same to
all the matrices we obtained for different filters. These pulled out maximum pixel
values are in matrix form.
Fully Connected Layer: It is the final layer where the real identification occurs.
At first, we flatten our matrix into a vector and feed it into a fully connected layer
just like neural network. To be more precise, we take our filtered and shrank images
and put them into a single list. Now, this vector contains the pulled out maximum
values from all the matrices we have accomplished for all the filters.Therefore, ba-
sically with the fully connected layers, we combined features of an image together
for creating a model. For different object’s image, there are some certain values in
the vector which are high which the CNN model learns after finishing the training.
Finally, it comprises an activation function for instance Softmax or Sigmoid for clas-
sification.

Figure 2.1: Structure of CNN [38]

To sum up, in deep learning, to train and evaluate CNN models, each input image
travels through a sequence of convolution layers with filters, pooling, completely
linked layers (FC) and apply Softmax or Sigmoid activation functions to identify an
object with probabilistic values in between 0 and 1 [38].

9
2.2.2 Long short-term memory (LSTM)
To begin with, LSTM which is abbreviated as Long short-term memory is one of
the most interesting breakthrough in the field of Deep learning and data science.
LSTM is an improved version of the generalized recurrent neural network. If we
compare LSTM with the basic RNN we can see that unlike RNN, LSTM models
have the capability of learning long ranged sequence data. This ability is very
helpful in various dynamic problem areas such as generating translation of different
languages, captioning a given image, generating the next possible text before a text.
Even to this day, they are used towards illustrating world-class outcomes. In an
LSTM model there is a cell and in that cell state there are three different gates.
The cell state keeps track of data over intervals of time and the LSTM gates in
that cell controls information transfer to and from the cell. LSTMs are somewhat
distinct from other methods of deep learning, such as Multilayer Perceptron (MLPs)
and convolutional neural Networks (CNNs), in that they are primarily designed for
problems with sequence prediction. In recurrent neural networks, LSTM is a better
way to solve the vanishing or exploding gradient problem.
To learn more about LSTM we need to be familiar to the basic concept of feed-
forward neural network model. And also visualize why it is irrelevant to be used
in sequence processing. For example, if we take a feed forward neural network that
is being trained for image classification, and if we feed some input the network
will provide us with an output f(x1) again, and also if we feed another input the
network will provide the output f(x2). But here the previous output is not being
used or has no relation with the new output. That shows that solving problems like
text generation, text-to-text conversion and sequence forecasting are of significant
disadvantage.

Figure 2.2: Structure of feed-forward neural network [31]

In order to solve such problem Recurrent Neural Network model was proposed. It
has many nodes and amongst them, there are connection which form a directed
graph to generate a sequence with respect to time and thus making the network to
display dynamics behavior. The recurrent neural network utilizes its own internal
memory in order to process sequences of the inputs achieved from feed- forward
neural networks which makes them applicable to tasks discussed above and it has
one or more layers which are hidden layers [1]. Each layer of an RNN computation
has feedback for the case of an RNN with two hidden layers. The first hidden layer,
the second hidden layer and the output layer of the RNN. The RNN described

10
herein sums the single hidden layer multilayer perceptron and the state-space model
resulting in a loop where the previous hidden layer output can be used to predict or
generate the later hidden layer outputs, generating a sequence of data that can be
fed to a feed-forward neural network for classification.

Figure 2.3: Structure of recurrent neural network [1]

However, there is a slight disadvantage in the generic RNN structure due to the
vanishing gradient problem the model suffers from a problem best known as “short
term memory”. Vanishing gradient problem is also familiar in other neural network
models. As a neural network model trains and learns by processing further steps
it faces difficulty recalling data. As a consequence, data from initial time steps
seems like it does not exist in the final stage. This mainly is shown when the
model performs back-propagation in the model. Back-propagation is an algorithm
for training and optimizing various neural networks. Here, firstly, the model makes
a forward pass which give a certain output then the output is compared with the
targeted output and by doing so we get an error value by the loss function, which is
an approximation of the network’s last results. The value of the error is then used
to do back propagation. From the back propagation a gradient is generated to each
and every node in that neural network model. Moreover, using this gradient value
the model adjusts the internal weights and learns over time. If the gradient value is
high then the adjustments made in the network will also be higher and vice versa.
But the issue creates when each node adjusts its own weights with respect to the
immediate prior layer nodes. If it is seen that the changes made in the previous
layer are minimal, the adjustments in the new layer would be much smaller. This
action causes the gradients to exponentially shrink. Since the initial weights are
scarcely changed due to the extremely limited gradient, the earlier layers do not
do any learning and that is the vanishing gradient problem. For this reason, the
Recurrent Neural Network model faces difficulties learning long ranged sequence
processing with respect to time because of the vanishing gradients [37]. To combat
these problems LSTM model was proposed.
LSTM has the same control flow as a recurrent neural network, which processes
information and passes data sequentially as it propagates. The main difference
between LSTM and RNN lies in the cell operations. The cell operations are used
to make it easier for the LSTM to forget or hold data. The cell state and gates
are the core concept of LSTM. The cell state functions as a communication bridge
that transfers relative information all the way through the sequence chain. This

11
is also referred to as a network memory. The cells can carry data throughout the
sequence processing and can bring even data from initial time to the last level, thus
minimizing the effects of short-term memory. The activation function used in the
LSTM are “sigmoid” and “tanh” which are the most popular activation functions
used in neural network models. The tanh function normalizes the output values into
a range of (-1,1) later the sigmoid function even further limits the output values
ranged within (0,1). Lastly, the distinct gates in LSTM regulates the flow of data
in the layer those distinct gates are-

• Forget gate,

• Input gate and

• Output gate

Figure 2.4: Structure of LSTM [34]

Forget gate determines what data is meant to be held away. The current input
and the immediate prior hidden state value are passed on to the sigmoid activation
function. The sigmoid functions then normalize all the values between 0 and 1, and
that value would be then used to update the cell state. Moreover, the same inputs
are also passed on to the input gate where in the input gate, the same values are
being passed on to both sigmoid function and tanh function. The two-output value
gotten from two activation function are then multiplied. The finalized value will
represent the output from the input gate. The output value of the forget gate and
the input gate are then used to update the cell state. First of all, the cell state
value of the immediate prior hidden state gets multiplied with the forget gate out-
put. Afterwards, the new cell state value gets added with the output from the input
gate thus updating the cell state. Lastly, in the output gate the current input and
the immediate prior hidden state value is also passed when in the output gate first
the values get normalized by the sigmoid activation function and the new cell state
gets normalized by the tanh function in the output gate. Then, both results are
multiplied thus creating the new hidden state which is later passed on to the next
time step.

12
2.2.3 Gated Recurrent Unit (GRU)
Gated Recurrent Unit (GRU) is a type of recurrent neural network which has similar-
ities with LSTM [35]. The GRU model is capable of representing a very complicated
system due to its specifically built structure.
GRU has two gates which are, a reset gate and an upgrade gate, but it lacks an
output gate in particular. These are basically two dimensions that make the decision
which data should be transmitted to the output. They can be skilled to preserve
data from earlier. In general, fewer parameters suggest that GRUs are easier or
faster to train than their LSTM equivalents.
Gated recurring unit (GRU) to allow each recurring unit to dynamically seize the
dependencies of contrasting time scales that modulate the information flow within
the unit, but without providing separate memory cells. The function of the two
gates of a GRU are-

• Update Gate: It decides how much information has to be moved into the
future from the past. In LSTM , it is similar to the Output Gate.

• Reset Gate: Decides how much to forget about past knowledge. In LSTM,
it is equivalent to the combination of the Input Gate and the Forget Gate [33].

The Gated Recurrent Neural Network has shown success in a variety of applications
involving sequential or temporal data. They have been commonly used, for example,
in speech recognition, natural language processing, etc. Furthermore, their output
is mainly because of the gating network signals that decide how the current input
and previous memory are used to adjust the current activation and produce the
current state. In addition, these gates have their own weight sets that are adapted
throughout the learning process (that is, the training and evaluation process) [8].
In GRU, the gates help us to determine which information has to pass or which
information has to be dropped. The values of the gates will be between 0 and 1. If
the value is very close to 0 then the information has to be dropped and if it is close
to 1, then the information will be retained.

Figure 2.5: Structure of GRU [42]

Here, input state and the hidden state will together get us to the GRU. Also, the

13
reset gate and update gate will be working together which will generate the output
and the new hidden state.

Figure 2.6: Diagram showing how GRU works[42]

So, let us see how the above diagram works:


During the 1st input, we may not have a hidden state. So, the input gets into the
GRU cell which will generate output and a hidden state. For the next GRU Cell, we
have the hidden state as well the 2nd input and the 2nd output, and a new hidden
state will be generated. This process keeps on going until the last state.

14
2.3 Activation Functions
Activation functions are the most important component of any neural network. As
it includes very complicated tasks such as image recognition, language transforma-
tion, object detection. Therefore without these functions, these tasks are incredibly
difficult to execute. They basically choose to disable or activate neurons to get the
required outcome. However, weight and bias will only have a linear transformation
without activation functions. Conversely, by using the activation function in the
neural network, non-linear transformation is carried out into data, allowing com-
plex problems such as linguistic translations and image recognition to be resolved.
Besides, the activation functions are distinguishable since they can easily implement
back propagations and an efficient strategy for calculating gradient loss functions
in neural networks when performing back propagations. We are using two types of
activation functions for our three models which are Sigmoid and ReLU.
1. Sigmoid: The sigmoid is used mainly because it does its job with great
performance. Moreover, it is essentially a stochastic decision-making method
and ranges from 0 to 1. So we use this activation function when we have to
come to a decision or estimate an output because the range is the smallest and
as a result prediction will be more accurate. In a model, it is used to introduce
non-linearity as it determines which value to pass as output and what not to
pass. It is also known as a logistic function.
The equation for the sigmoid activation is-
1
S(x)=
1 + e−x
here input is the x and output is the S(x).

Figure 2.7: Graph showing Sigmoid Activation function [25]

15
2. ReLU: The ReLU also known as rectified linear activation function is a linear
piecewise form that precisely outputs the input whether it is positive, or else it
outputs zero . For several forms of neural networks, it has become the primary
activation function because it is better to train a model that uses it and often
yields better results.
The equation of ReLU function-

ReLU = max(0, x)

Figure 2.8: Graph showing ReLU Activation function [27]

16
Chapter 3

Proposed Model

3.1 Dataset Description


We have used a collected dataset and it contains information from Static Analy-
sis (Using the Nearest Neighbor Interpolation algorithm, the Raw PE byte stream
was resized to a 32 x 32 grayscale image and then stretched to a vector of 1024
bytes ). The samples of PE malware were retrieved from virusshare.com. From
portableapps.com and Windows 7 x86 directories, examples of PE goodware are ob-
tained [20]. In our dataset, there is a column “hash” which contains the value of MD5
hash of the example. Furthermore, there are also columns named pix0,pix1,. . . . . . ,
pix1023 which contains the grayscale pixel values of the malware and benign image.
Finally, there is a class column named “malware” that indicated whether the file is
malware or benign (0→Goodware,1→Malware).
Data preprocessing technique: At first, we manually converted each row of our
collected dataset into images as shown below:

(a) Benign (b) Malware

Figure 3.1: Images of Benign and Malware

17
Figure 3.2: Augmented Benign images

Since we had less amount of benign row in the collected dataset so we had to do
data augmentation of benign images only. It is a technique that can be used to
expand an image-based dataset. Moreover, in this augmentation technique, we used
many properties to generate benign images and the properties are rotation, height
shift, width shift, zoom, and horizontal flip and to each of these properties, we used
a fixed amount of ranges to generate augmented images accordingly. Here some of
the augmented benign images are shown above in figure 3.2.
We made two directories i.e. ”malware”, and ”benign” and placed these images into
respective directories.
All the images now we have are not of the same size. But to normalize everything
we need to keep the images in the same shape. So we resized all the images of both
directories into 50x50. Then we read the images one by one as 2D array where
each value is a pixel value and passed it into our training set along with the correct
label(0 for benign, 1 for malware). To ensure the proper learning of our model we
shuffled our training dataset so that it has a good balance in learning malware and
goodware. To feed the data to our neural network we took two different lists. One
is the feature set and the other is the label set. From the training set, we stored the
2D array into the feature set and labels into the label set of all the images. Lastly,
we converted the feature set list into numpy array and reshaped it into a size of
50x50 because neural networks can not work with lists. Also, we have normalized
all the values of it through dividing each pixel value by 255. Furthermore, for testing
purpose we have created two directories i.e “malware”, ”benign” and pre-processed
our testing dataset the same way we did for our training dataset.

18
3.2 Model Description
3.2.1 CNN model
Firstly, we have added a convolutional layer that has 64 neurons and the window
size of each filter here is 3x3. Pixel matrix of each image is being passed to the
convolutional layer and here the convolution occurs. Various features of the image
are extracted by different filters. Secondly, there is a ReLU layer that removes all
the negative value we got from the matrices after finishing the convolution operation
by all the filters. Then we have a pooling layer that pulls out the maximum value
within a window size from the matrixes achieved from different filters after passing
through the Relu layer. In our model, we have set this pool size 2x2. Again we have
these same three layers i.e. a convolutional layer, a ReLU layer, and a pooling layer
all with the same window size as previous. Next, we do a Flattening before passing
the data to the fully connected feed-forward network. By doing so we obtain a vector
that contains the maximum pulled-out pixel values by the pooling layer. Also, we
flatten here because a fully connected network works with a 1D array only. Our
fully connected feed-forward layer has 64 neurons and the activation function we
have used here is ReLU. After that, there is another densely connected layer that
has 128 neurons and also has a ReLU activation function to trigger the neurons.
Finally, we have a single neuron output layer and it uses the Sigmoid activation
function. We used “adam” as our model optimizer and set “accuracy” as metrics.
To train the model we used 64 batch size and 20 epochs. Also, to check out for
sample accuracy we declared a validation split of 30 percent.

Figure 3.3: Flowchart for CNN model

19
3.2.2 LSTM model
To begin with, we have used the same data and pre-processing method in all our
Neural Network models. From our data set all the images were resized to both
directories into 50x50. Then we read the images one by one as 2D array where each
value is a pixel value and passed it into our training set along with the correct label (0
for benign, 1 for malware). Here, unlike convolutional neural network (CNN), where
the input data is taken as a four-dimensional input which are number of samples,
number of rows, number of columns, and number of channels respectively but LSTM
layers takes inputs in a three-dimensional input in STF manner, where S represents
the number of samples being used in LSTM, T represents the number of time steps
to be traced to generate sequences and F represents the features of the data. So,
to implement the data into our model, we processed the input data accordingly and
later added an LSTM layer with 64 neurons. Here the 64 neurons represent how
many hidden states there are for this layer and also represent the output dimension
since we get an output hidden state at the end of each LSTM. Moreover, since we
resized our image by 50x50, each neuron in the LSTM layer is being given a length
vector representing 50 features over 50 timesteps. Besides, we set the activation as
“ReLU” which is a very popular activation function in different neural networks.
Here the ReLU layer eliminates all the negative values obtained after the LSTM
layer operation. Afterward, the sequential output we got from the LSTM network
is being fed into a Dense layer containing 100 neurons with activation function as
ReLU for further extraction. Lastly, the output received from the first Dense layer
is being fed onto another dense layer with one neuron having the activation function
called “Sigmoid” and the sigmoid function generalizes the output into a range of
(0,1). Therefore, the value which is close to zero will be recognized as benign and
the value close to 1 will be recognized as malware by the LSTM model. Moreover,
we used “adam” as our model optimizer and set “accuracy” as metrics and to train
the model we used 64 batch size and 20 epochs just like the CNN model. Also, to
check out for sample accuracy we declared a validation split of 40 percent.

20
Figure 3.4: Flowchart for LSTM model

3.2.3 GRU model


In our GRU model, we have used the same pre-processed data we have used in other
models. All the images were resized from our data collection to 50x50 in both fold-
ers. We then read the images one by one as a 2D array where each value is a pixel
value and passed it along with the correct label into our training set (0 for benign,
1 for malware).
The malware images turn into machine-readable vectors and then processed into a
sequence of vectors one by one. During processing, the prior hidden state transfers
towards the succeeding stage of the chain. As a neural network memory, the hidden
state operates. It stores knowledge about past data previously accessed by the sys-
tem.
GRU is the latest technology of LSTM like RNN. GRU managed to dispose of the
cell state and used the hidden state to pass data. It has only two gates, a reset
gate and an update gate. The update gate functions in a corresponding manner to
the forget LSTM gate and the input gate. It gets to decide what content to discard
and what new data to add. The reset gate is another gate used to assess how far to
forget about prior encounters. GRU’s operations would be less tensor, so they are a
little easier to practice than LSTM. To draw conclusions, our GRU model will learn
how to keep only necessary details and forget about non-relevant information.
First, in order to form a vector, combination occurs between input and the previous
hidden state. The vector follows the tanh which is the activation function and the
fresh, hidden state is the result. This is how it defines a fresh hidden state. Tanh
activation accustomed to map the values that circulate across the system. The tanh
operation squirts the quantities from the range -1 to 1. Because of several math
functions, multiple transformations occur when vectors move through a neural net-
work.
The Gate consists of sigmoid an the function of a sigmoid is close to that of tanh. It
squirts values between 0 and 1 instead of squirting values between -1 and 1 and this

21
is useful for restoring or deleting information instead of squirting values between
-1 and 1, as every value multiplied with 0 becomes 0, allowing numbers to vanish
and being ”discarded.” Every value multiplied through 1 will be the identical value,
so that the value is the identical and it is ”preserved.” The system learns whatever
information is not necessary so that it can be ignored or which data is vital to retain.
We processed the input data accordingly to integrate the data into our model and
later added a GRU layer with 64 neurons. Here, the 64 neurons represent how
many hidden states there are for this layer and also represent the output dimension
as we get a hidden output state at the end of each GRU. In addition, because we
resized our image by 50x50, a longitude vector representing 50 characteristics over
50 timesteps is given to each neuron in the GRU layer. In addition, we set the ac-
tivation as ”ReLU”, which is a very common feature of activation in various neural
networks. The ”ReLU” layer here excludes all negative values obtained after the
activity of the GRU layer. The sequential output we receive from the GRU network
is then fed into a thick layer containing 100 neurons. With the ”ReLU” activation
feature for further extraction just like LSTM.
Finally, the output obtained from the first dense layer is fed to another dense layer
with one neuron having the activation function called Sigmoid, which generalizes
the output into a range of (0,1). The value close to zero will therefore be recognized
as a benign image and the value close to 1 will be recognized as GRU type malware
images.Moreover, we used “adam” as our model optimizer and set “accuracy” as
metrics and to train the model we used 64 batch size and 20 epochs just like the
LSTM model. Also, to check out for sample accuracy we declared a validation split
of 40 percent.

Figure 3.5: Flowchart for GRU model

22
Chapter 4

Experimental Setup

4.1 Basic libraries and functions needed for model


setup
1. Jupyter notebook: We build our models in Jupyter Notebook which is a
very popular IPython notebook used for data analysis.

2. NumPy: We used the NumPy library which is used for working with arrays
as while data-preprocessing we converted the feature set list into a numpy
array that we have already described in the data description part.

3. Keras: To create our models, we used Keras, an unbound non-proprietary


Python library, to develop and test deep learning models. It finishes and
combines up the effective ”Theano” and ”TensorFlow” arithmetic computing
collections and allows us to describe and qualify neural network models in a
several touches of code [4].

4. Sequential() : We imported Sequential from Keras, which helps us to con-


veniently pile sequential network layers from input to output in sequence [23].

5. Binary cross-entropy: For binary classification problems, binary cross-


entropy is the preferred loss function to use. It is designed to be used with a
binary classification model for which target values are set to (0, 1). Besides,
binary cross-entropy can be defined as a loss function in Keras by defining
’binary crossentropy’ when the model is compiled [26].

6. Adam Optimizer: Adam is an optimization technique that can be passed


down to modify network weights recursively based on training data rather than
the traditional stochastic gradient descent method [28].

7. Accuracy Metric: The accuracy metric measures the accuracy rate of all
outcomes.

8. Epoch: An epoch is one single pass over the full training set to the network.
The full dataset needs to be transferred to the same neural network many
times so we use more than one epoch to optimize the learning process since
we use a limited dataset [40].

23
9. Batch size: The batch size is the number of samples that will be transmitted
through the network at one time. However, batch size and epoch are not the
same things.

batches in epoch = training set size/batch size

Generally, the larger the batch size, the faster the model can complete each
epoch during training. This is because, based on the computing resources,
computers might be capable of processing more than one sample at a time
[32].
10. model.compile() : Keras platform gives a compile() method for compiling a
model where loss function, optimizer and metrics are passed.
11. model.fit() : It is used to transfer training and validation data where batch
size and epochs are also specified.
12. model.predict() : Predict() function takes an array of one or more contexts
of data. It can be used to assess performance of the models on the test dataset.
13. Scikit-learn (sklearn module): Scikit-learn is a Python library that offers
both unsupervised and supervised learning algorithms. It is based on some of
the technologies that we may already know, like NumPy, Pandas, and Mat-
plotlib [2].
14. Train-test split: Train-test split is a method used to assess the output of a
model. It is used to divide the data.

4.2 Model setup


We implemented our models in Jupyter Notebook and used Keras to create our
models which is a free open-source python library. Besides, we used the NumPy
library to convert the feature set list into a numpy array as we have discussed
earlier in the data description section.
1. For CNN model: From Keras, we imported Sequential which allows us to
sequentially stack the layers and also imported Conv2D, MaxPooling2D, Flat-
ten, Activation, Dense, Dropout for building the Convolution Neural Network
(CNN) model and we created our CNN model as described earlier in the model
description part. After that, we compiled our model using the model.compile()
by defining loss function as “binary crossentropy”, optimizer as “adam” and
metrics as “accuracy”. Lastly, we used model.fit() to pass the training and
validation data along with batch size 64 and 20 epochs.
We pre-processed the testing dataset the same way the training dataset was
pre-processed. Then, we used the model.predict() function where we passed
an array of test set to evaluate model accoplishment on the testing dataset. To
analyze our model results we used a confusion matrix which we imported from
sklearn. The obtained results are shown and described in the Result Analysis
part in the next coming chapter.

24
Figure 4.1: Brief working process of how CNN model works with training, validation
and testing data

2. For LSTM model: Unlike CNN, here, in this model, we imported train-test
split method from sklearn and used it to divide the data into two segments
which are train set and test set. However, the test set here was passed to
validation data when we used to fit our model. From Keras, we imported
Sequential which allows us to sequentially stack the layers, and also imported
LSTM and Dense for building the Long short-term memory (LSTM) model
and we created our LSTM model as described earlier in the model description
part. After that, we compiled our model using the model.compile() by defining
loss function as “binary crossentropy”, optimizer as “adam” and metrics as
“accuracy”. Lastly, we used model.fit() to pass the training and validation
data along with batch size 64 and 20 epochs.
We pre-processed the testing dataset the same way the training dataset was
pre-processed. Then, we used the model.predict() function where we passed
an array of test set to evaluate model accomplishment on the testing dataset.
To analyze our model results we used a confusion matrix which we imported
from sklearn. The obtained results are also shown and described in the Result
Analysis part in the next coming chapter.

Figure 4.2: Brief working process of how LSTM model works with training, valida-
tion and testing data

3. For GRU model: Same as the LSTM model, we again imported train-test
split method from sklearn and used it to divide the data into two parts which
are train set and test set. However, just like the LSTM, the test set here was
also passed to validation data when we used to fit our model. From Keras, we
imported Sequential which allows us to sequentially stack the layers, and also
imported GRU and Dense for building the Gated Recurrent Unit (GRU) model
and we created our GRU model as described earlier in the model description
part. After that, we compiled our model using the model.compile() by defining

25
loss function as “binary crossentropy”, optimizer as “adam” and metrics as
“accuracy”. Lastly, we used model.fit() to pass the training and validation
data along with batch size 64 and 20 epochs.
We pre-processed the testing dataset the same way the training dataset was
pre-processed. Then, we used the model.predict() function where we passed
an array of test set to evaluate model accomplishment on the testing dataset.
To analyze our model results we also used a confusion matrix here which we
imported from sklearn. The obtained results are also shown and described in
the Result Analysis part in the next coming chapter.

Figure 4.3: Brief working process of how GRU model works with training, validation
and testing data

26
Chapter 5

Result Analysis

5.1 Basic terminologies needed for result analysis


1. Confusion Matrix: It is a method for illustrating a classified algorithm’s
results. It gives a clearer understanding of what the classification model is
doing correctly and what sorts of mistakes it is making by computing a con-
fusion matrix. Since it is a binary classification problem, it also attempts to
differentiate between observations with a particular result, from standard ob-
servations. Therefore, there will be two rows and two columns. Rows portray
actual classes and columns portray predicted classes. Actual rows are divided
into two parts negative and positive and as well as in the same manner pre-
dicted rows are also divided respectively. In the diagram below it is clearly
labeled-

Figure 5.1: Diagram showing confusion matrix [39]

Here in the diagram above-


• True positive (TP) determines appropriately predicted positive values.
• False-positive (FP) determines inappropriately predicted positive values.
• True negative (TN) determines appropriately predicted negative values.
• False-negative (FN) determines inappropriately predicted negative val-
ues.
2. Accuracy: Accuracy is one technique to monitor how well the model cate-
gorises the dataset. It is the amount of accurately measured data points from
every bit of data points. More traditionally, this is measured as the proportion
of true positive and true negative divided by the total number of true positive,

27
true negative, false positive, and false negatives. [30].
TP + TN
accuracy=
TP + TN + FP + FN
3. Precision: Precision is the extent of correctly observed positive cases to all
forecast positive cases, that is,the accurate and falsely predicted positive cases.
Precision is the fraction of the recovered records that are important to the
query. Moreover, it is the closeness of measurements to each other [39].
TP
precision=
TP + FP
4. Recall: Recall illustrates the percentage of total relevant outcomes correctly
classified by the model. Moreover, it is the ratio of correctly defined positive
cases to all real positive cases that is the sum of the false-negatives and the
true positives [39].
TP
recall=
TP + FN
5. F1-score: The F1-score expresses the equity between precision and recall.
The more nearer the score is to 1 or 100 percent, the better the model is
performing [39].
2 ∗ precision ∗ recall
F 1=
precision + recall
6. Sample Loss: Sample loss is measured on the training dataset and it portrays
how the model is learning on the known dataset. Loss is not a percentage but
it is a aggregation of errors made for each example in training. Moreover, the
smaller the loss the better the model is learning.
7. Sample Accuracy: Sample accuracy is calculated on the training datasets
and it represents how the model is learning on the known dataset. Accuracy
is measured in terms of percentage and the more the percentage, the better
the model is learning.
8. Validation Loss: Validation loss is calculated on validated datasets that are
held back from training datasets and it represents how the model is performing
on the unknown portion of the training datasets. Loss is not a percentage but
it is a aggregation of errors made for each example while validating. Moreover,
the smaller the loss the better the model is ready for classifying among real-
world dataset or testing dataset.
9. Validation Accuracy: Validation accuracy is calculated on the validated
datasets that are held back from training datasets and it represents how the
model is performing on the unknown portion of the training datasets. Ac-
curacy is measured in terms of percentage and the more the percentage, the
better the model is ready for classifying among real-world dataset or testing
dataset.
10. Overfitting: It occurs when a model grasp the training dataset well enough
and performs effectively on the training dataset too, but does not perform
effectively on a holdout dataset also which is known as the validation dataset.

28
11. Underfitting: It occurs when a model stalls to properly learn the problem
and performs imperfectly on a training dataset and as a result does not perform
effectively on a holdout dataset also which is known as the validation dataset.

5.2 Model Results


5.2.1 Result analysis for CNN model

Figure 5.2: Graph showing CNN model result on training and validation datasets

From the above graph, as the epoch is proceeding validation loss is decreasing and
at the same time validation accuracy is increasing. Therefore, it can be noted that
the CNN model is not overfitting nor underfitting, as the model is performing well
on the validation dataset as much as it is doing well on the training dataset through
which the learning process is optimized in each epoch. Moreover, throughout every
last epoch, the model is accomplishing consistent results.
The average validation accuracy for the CNN model is 84 percent.

29
Figure 5.3: Confusion Matrix of CNN model

In the above diagram, the confusion matrix of CNN is shown. We are classifying
malware as positive events (1) and benign as negative events (0). The rows represent
actual values and the column represents predicted values.
Here, 2,418 malwares are correctly predicted by the model, and hence it is the true
positive (TP) value, and 546 benigns are correctly predicted and so it is the true
negative (TN) value. Also, the value 537 are actually benigns (negative events)
but the model predicted it as malwares (positive events) so it is the false positive
(FP) value, and the value 82 are malwares (positive events) but the model predicted
it as benigns (negative events) so it is the false negative (FN) value. Therefore,
it means that the left diagonal represents the correctly determined values and the
right diagonal represents incorrectly determined values by the CNN model.

Events Precision (percent) Recall (percent) F1-score (percent)


Malware 82 97 90
Benign 87 50 60

Table 5.1: A table showing performance measure for CNN model on the testing
dataset

We used a total of 3,583 images in testing dataset (including malware and benign
which is the unseen dataset) to evaluate our model performance and got performance
measures (precision, recall, and F1-score) as shown in the table above.
For malware (positive events), we got a precision of 82 percent which means out of
total positive predicted values, 82 percent is the actual positive result. Also, we got
a recall of 97 percent which means out of total positive actual values, 97 percent is
the actual positive result which is much better in determining malwares. Therefore,
the F1-score is 90 percent which represents the balance between precision and recall
and it is close to 100 percent which means that the CNN model is performing better
in determining malware (positive events).
For benign (negative events), we got a precision of 87 percent which means out of
total negative predicted values, 87 percent is the actual negative result. Also, we got
a recall of 50 percent which means out of total negative actual values, 50 percent is
the actual negative result. Therefore, the F1-score is 60 percent which represents the
balance between precision and recall which means CNN is moderately performing
better in determining benign (negative events) as well.

30
5.2.2 Result analysis for GRU model

Figure 5.4: Graph showing GRU model result on training and validation dataset

From the above graph, as the epoch reached number 14 and onwards, the validation
loss is decreasing and at the same time validation accuracy is increasing. Also, the
model is doing better on the first half epoch as well. Therefore, it can be clearly
observed that the GRU model is not overfitting nor underfitting as well, as the
model is performing well on the validation dataset as much as it is doing well on the
training dataset through which the learning process is optimized in each epoch.
The average validation accuracy for the GRU model is 79 percent.

Figure 5.5: Confusion Matrix of GRU model

31
In the above diagram, the confusion matrix of GRU is shown. As we have already
mentioned, we are classifying malware as positive events (1) and benign as negative
events (0). The rows represent actual values and the column represents predicted
values.
Here, 2,248 malwares are correctly predicted by the model, and hence it is the true
positive (TP) value, and 460 benigns are correctly predicted and so it is the true
negative (TN) value. Also, the value 623 are actually benigns (negative events) but
the model predicted it as malwares (positive events) so it is the false positive (FP)
value, and the value 252 are malwares (positive events) but the model predicted
it as benigns (negative events) so it is the false negative (FN) value. Moreover, it
means that the left diagonal represents the correctly determined values and the right
diagonal represents incorrectly determined values by the GRU model.

Events Precision (percent) Recall (percent) F1-score (percent)


Malware 78 90 84
Benign 65 42 51

Table 5.2: A table showing performance measure for GRU model on the testing
dataset

As mentioned earlier a total of 3,583 images in testing dataset (including malware


and benign which is the unseen dataset) also used to evaluate our GRU model
performance and we got performance measures (precision, recall, and F1-score) as
shown in the table above.
For malware (positive events), we got a precision of 78 percent which means out of
total positive predicted values, 78 percent is the actual positive result. Also, we got
a recall of 90 percent which means out of total positive actual values, 90 percent
is the actual positive result which is quite better for the GRU model. Therefore,
the F1-score is 84 percent which represents the balance between precision and recall
and the GRU model is performing almost better in determining malware (positive
events).
For benign (negative events), we got a precision of 65 percent which means out of
total negative predicted values, 65 percent is the actual negative result. Also, we got
a recall of 42 percent which means out of total negative actual values, 42 percent is
the actual negative result. Therefore, the F1-score is 51 percent which represents the
balance between precision and recall which means GRU is moderately performing
almost better in determining benign (negative events) as well.

32
5.2.3 Result analysis for LSTM model

Figure 5.6: Graph showing LSTM model result on training and validation dataset

From the above graph, as epoch reached number 12 and onwards, the validation loss
is decreasing and at the same time validation accuracy is improving. Therefore, it
can clearly discover that the LSTM model is also not overfitting nor underfitting
like the other models, as the model is performing well on the validation dataset as
much as it is doing well on the training dataset through which the learning process
is optimized in each epoch.
The average validation accuracy for the LSTM model is 74 percent.

Figure 5.7: Confusion Matrix of LSTM model

33
In the above diagram, the confusion matrix of LSTM is shown. As we have already
mentioned, we are classifying malware as positive events (1) and benign as negative
events (0). The rows represent actual values and the column represents predicted
values.
Here, 2,197 malwares are correctly predicted by the model, and hence it is the true
positive (TP) value, and 122 benigns are correctly predicted and so it is the true
negative (TN) value. Both the true positive and true negative values are lowest
among the other two models. Also, the value 961 are actually benigns (negative
events) but the model predicted it as malwares (positive events) so it is the false
positive (FP) value, and the value 303 are malwares (positive events) but the model
predicted it as benigns (negative events) so it is the false negative (FN) value. Be-
sides, the false positives and false negative values are much higher than the other
two models. To sum up, the left diagonal represents the correctly determined values
and the right diagonal represents incorrectly determined values by the LSTM model.

Events Precision (percent) Recall (percent) F1-score (percent)


Malware 70 88 78
Benign 29 11 16

Table 5.3: A table showing performance measure for LSTM model on the testing
dataset

Since the same amount of testing dataset were also used to evaluate our LSTM model
performance and we got performance measures (precision, recall, and f1-score) as
shown in the table above.
For malware (positive events), we got a precision of 70 percent which means out of
total positive predicted values, 70 percent is the actual positive result. Also, we got
a recall of 88 percent which means out of total positive actual values, 88 percent is
the actual positive result which is not that bad for the LSTM model. Therefore,
the F1-score is 78 percent which represents the balance between precision and recall
and the LSTM model is doing average in determining malware (positive events).
However, all the measure values of positive events for the LSTM model is less than
the other two models.
For benign (negative events), we got a precision of 29 percent which means out of
total negative predicted values, 29 percent is the actual negative result. Also, we got
a recall of 11 percent which means out of total negative actual values, 11 percent is
the actual negative result which is quite low than the other two models. Therefore,
the F1-score is 16 percent which represents the balance between precision and recall,
which is very much low and that means LSTM is performing worst in determining
benign (negative events).

34
5.3 Accuracy analysis for all three models

Figure 5.8: Histogram showing accuracy percentage of all three models

First of all, the CNN model gives an accuracy of 83 percent on the testing dataset,
which means the model correctly classifies the true positive (malware) and true
negative (benign) events 83 percent accurately which is better among all the other
models. Since CNN performs better in recognizing images as both CNN layers have
several convolutional filters that function and examine the full matrix of features
and minimize spatial size. This makes it possible for CNN to be a very convenient
and appropriate network for classifying and manipulating images.
Secondly, the GRU model gives an accuracy of 76 percent on the testing dataset,
which means the model correctly classifies the true positive (malware) and true neg-
ative (benign) events 76 percent accurately and is performing better than the LSTM
model as shown in the histogram above since GRU performs better and train faster
than LSTM as it has little operations compared to LSTM. Lastly, the LSTM model
gives an accuracy of 65 percent on the testing dataset, which means the model cor-
rectly classifies the true positive (malware) and true negative (benign) events 65
percent accurately and it is lowest among all the other models since LSTM is hard
to train because it involves memory-bandwidth-bound computation, and eventually
reduce the use of neural network implementations.
To sum up, CNN is performing better among all the other models in detecting mal-
ware as well as it has shown better performance measures which are precision, recall,
and f1-score among the other models for both positive and negative events which
again proves that CNN is best for image-based classification problems. However, we
are not getting almost 100 percent accuracy in our research due to the limitation
of the dataset, and also we are using only one feature which is the static feature as
our dataset contains static analysis data.

35
Chapter 6

Conclusion and Future Plan

In this current world, the amount of malware is widening each year and new forms
of threats are more disruptive and complicated than ever. Hackers are continuing
to accelerate the advancement of malware development by implementing methods
such as mutations at a startling pace. Evidently, automated detection using highly
accurate models could be the only option in the future to fix this problem. There-
fore, we proposed three kinds of Neural Network models for malware detection and
compared their performance measures as shown in the result analysis section. Also,
we achieved better results with all the models and none of the models performed
abnormally as there was no underfitting or overfitting at all. Convolutional Neu-
ral Network (CNN) provided the best results among the other models since CNN
performs better in image classification tasks. As a result, we hope that our models
can contribute to solve cyber vulnerability of Bangladesh. However, we want to do
further research on this topic as our future plan is to concatenate the neural net-
work models together, for example, concatenating CNN with LSTM or GRU and at
the same time performing various cross-validation techniques to make the accuracy
results much higher than it is now.

36
Bibliography

[1] S. Haykin, Neural Networks and Learning Machines, ser. Neural networks and
learning machines v. 10. Prentice Hall, 2009, p. 794, isbn: 9780131471399.
[Online]. Available: https://books.google.com.bd/books?id=K7P36lKzI%
5C QC.
[2] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine
learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–
2830, 2011.
[3] L. E. Gonzalez and R. A. Vazquez, “Malware classification using euclidean
distance and artificial neural networks,” in 2013 12th Mexican International
Conference on Artificial Intelligence, IEEE, 2013, pp. 103–108.
[4] F. Chollet et al., Keras, https://keras.io, 2015.
[5] J. Saxe and K. Berlin, “Deep neural network based malware detection using
two dimensional binary program features,” in 2015 10th International Confer-
ence on Malicious and Unwanted Software (MALWARE), IEEE, 2015, pp. 11–
20.
[6] M. Oulehla, Z. K. Oplatková, and D. Malanik, “Detection of mobile botnets
using neural networks,” in 2016 Future Technologies Conference (FTC), IEEE,
2016, pp. 1324–1326.
[7] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, “Malware
detection with deep neural network using process behavior,” in 2016 IEEE
40th Annual Computer Software and Applications Conference (COMPSAC),
IEEE, vol. 2, 2016, pp. 577–582.
[8] R. Dey and F. M. Salemt, “Gate-variants of gated recurrent unit (gru) neural
networks,” in 2017 IEEE 60th international midwest symposium on circuits
and systems (MWSCAS), IEEE, 2017, pp. 1597–1600.
[9] W. Wang, M. Zhu, X. Zeng, X. Ye, and Y. Sheng, “Malware traffic classifica-
tion using convolutional neural network for representation learning,” in 2017
International Conference on Information Networking (ICOIN), IEEE, 2017,
pp. 712–717.
[10] M. Abdelsalam, R. Krishnan, Y. Huang, and R. Sandhu, “Malware detection
in cloud infrastructures using convolutional neural networks,” in 2018 IEEE
11th International Conference on Cloud Computing (CLOUD), IEEE, 2018,
pp. 162–169.

37
[11] B. Alsulami and S. Mancoridis, “Behavioral malware classification using con-
volutional recurrent neural networks,” in 2018 13th International Conference
on Malicious and Unwanted Software (MALWARE), IEEE, 2018, pp. 103–111.
[12] C. Hasegawa and H. Iyatomi, “One-dimensional convolutional neural networks
for android malware detection,” in 2018 IEEE 14th International Colloquium
on Signal Processing & Its Applications (CSPA), IEEE, 2018, pp. 99–102.
[13] T. Hsien-De Huang and H.-Y. Kao, “R2-d2: Color-inspired convolutional neu-
ral network (cnn)-based android malware detections,” in 2018 IEEE Interna-
tional Conference on Big Data (Big Data), IEEE, 2018, pp. 2633–2642.
[14] Y. Jin, T. Liu, A. He, Y. Qu, and J. Chi, “Android malware detector exploiting
convolutional neural network and adaptive classifier selection,” in 2018 IEEE
42nd Annual Computer Software and Applications Conference (COMPSAC),
IEEE, vol. 1, 2018, pp. 833–834.
[15] C. H. Kim, E. K. Kabanga, and S.-J. Kang, “Classifying malware using con-
volutional gated neural network,” in 2018 20th International Conference on
Advanced Communication Technology (ICACT), IEEE, 2018, pp. 40–44.
[16] D. Li, Z. Wang, and Y. Xue, “Fine-grained android malware detection based
on deep learning,” in 2018 IEEE Conference on Communications and Network
Security (CNS), IEEE, 2018, pp. 1–2.
[17] M. Yeo, Y. Koo, Y. Yoon, T. Hwang, J. Ryu, J. Song, and C. Park, “Flow-
based malware detection using convolutional neural network,” in 2018 Interna-
tional Conference on Information Networking (ICOIN), IEEE, 2018, pp. 910–
913.
[18] L. Zhao, D. Li, G. Zheng, and W. Shi, “Deep neural network based on android
mobile malware detection system using opcode sequences,” in 2018 IEEE 18th
International Conference on Communication Technology (ICCT), IEEE, 2018,
pp. 1141–1147.
[19] D. Chamou, P. Toupas, E. Ketzaki, S. Papadopoulos, K. M. Giannoutakis, A.
Drosou, and D. Tzovaras, “Intrusion detection system based on network traffic
using deep neural networks,” in 2019 IEEE 24th International Workshop on
Computer Aided Modeling and Design of Communication Links and Networks
(CAMAD), IEEE, 2019, pp. 1–6.
[20] A. Oliveira, Malware analysis datasets: Raw pe as image, 2019. doi: 10.21227/
8brp-j220. [Online]. Available: https://dx.doi.org/10.21227/8brp-j220.
[21] N. P. Poonguzhali, T. Rajakamalam, S. Uma, and R. Manju, “Identification
of malware using cnn and bio-inspired technique,” in 2019 IEEE International
Conference on System, Computation, Automation and Networking (ICSCAN),
IEEE, 2019, pp. 1–5.
[22] S. Shukla, G. Kolhe, S. M. PD, and S. Rafatirad, “Rnn-based classifier to
detect stealthy malware using localized features and complex symbolic se-
quence,” in 2019 18th IEEE International Conference On Machine Learning
And Applications (ICMLA), IEEE, 2019, pp. 406–409.
[23] F. Chollet et al., Keras, https://keras.io/guides/sequential model/, 2020.

38
[24] B. Zhang, W. Xiao, X. Xiao, A. K. Sangaiah, W. Zhang, and J. Zhang, “Ran-
somware classification using patch-based cnn and self-attention network on em-
bedded n-grams of opcodes,” Future Generation Computer Systems, vol. 110,
pp. 708–720, 2020.
[25] K. Bolton, A quick introduction to artificial neural networks (part 2), ”http:
//krisbolton.com/a-quick-introduction-to-artificial-neural-networks-part-2”,
Accessed: June 5, 2018.
[26] J. Browlee, Machine learning mastery: A gentle introduction to cross-entropy
for machine learning, ”https://machinelearningmastery.com/cross- entropy-
for-machine-learning/”, Accessed: October 21, 2019.
[27] J. Browlee, Machine learning mastery: A gentle introduction to the rectified
linear unit (relu), ”https : / / machinelearningmastery. com / rectified - linear -
activation-function-for-deep-learning-neural-networks/”, Accessed: January
9, 2019.
[28] J. Browlee, Machine learning mastery: Gentle introduction to the adam opti-
mization algorithm for deep learning, ”https://machinelearningmastery.com/
adam-optimization-algorithm-for-deep-learning/”, Accessed: July 3, 2017.
[29] Common vulnerabilities in cyber space of bangladesh, ”https://www.cirt.gov.
bd/common-vulnerabilities-in-cyber-space-of-bangladesh”.
[30] Deepai: Accuracy (error rate), ”https://deepai.org/machine-learning-glossary-
and-terms/accuracy-error-rate”.
[31] Deepai: Feed forward neural network, ”https://deepai.org/machine-learning-
glossary-and-terms/feed-forward-neural-network”.
[32] Deeplizard: Machine learning deep learning fundamentals, ”https://deeplizard.
com/learn/video/U4WB9p6ODjM”.
[33] Geeksforgeeks: Gated recurrent unit networks, ”https://www.geeksforgeeks.
org/gated-recurrent-unit-networks”, Accessed: July 14, 2019.
[34] I2tutorials: Long short-term memory: From zero to hero with pytorch, ”https:
//www.i2tutorials.com/long- short- term- memory- from- zero- to- hero- with-
pytorch/”, Accessed: June 20, 2019.
[35] S. Kostadinov, Towards data science: Understanding gru networks, ”https :
//towardsdatascience.com/understanding- gru- networks- 2ef37df6c9be”, Ac-
cessed: December 16, 2017.
[36] A. Mersch and E. Nealis, 6 common types of malware, ”https://blog.totalprosource.
com/5-common-malware-types”, Accessed: August 17, 2020.
[37] M. Phi, Towards data science: Illustrated guide to recurrent neural networks,
”https : / / towardsdatascience . com / illustrated - guide - to - recurrent - neural -
networks-79e5eb8049c9”, Accessed: September 20, 2018.
[38] Prabhu, Understanding of convolutional neural network (cnn) deep learning,
”https : / / medium . com / @RaghavPrabhu / understanding - of - convolutional -
neural-network-cnn-deep-learning-99760835f148”, Accessed: March 4, 2018.
[39] Python machine learning tutorial, ”https://www.python-course.eu/metrics.
php”.

39
[40] S. Sharma, Towards data science: Epoch vs batch size vs iterations, ”https:
//towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9”,
Accessed: September 23, 2017.
[41] The daily swig: Cybersecurity and views, ”https : / / portswigger . net / daily -
swig”.
[42] S. Vasudevan, Gru explained (gated recurrent unit), ”https://www.youtube.
com/watch?v=xLKSMaYp2oQ”, Accessed: May 3, 2020.

40

You might also like