Malware - Detection - Using - Neural - Networks (Main Paper)
Malware - Detection - Using - Neural - Networks (Main Paper)
by
1. The thesis submitted is my/our own original work while completing degree at
Brac University.
3. The thesis does not contain material which has been accepted, or submitted,
for any other degree or diploma at a university or other institution.
i
Approval
The thesis titled “Malware Detection Using Neural Networks” submitted by
Of Fall, 2020 has been accepted as satisfactory in partial fulfillment of the require-
ment for the degree of B.Sc. in Computer Science and Engineering on January 11,
2021.
Examining Committee:
Supervisor:
(Member)
Moin Mostakim
Lecturer
Department of Computer Science and Engineering
Brac University
Co-Supervisor:
(Member)
Thesis Coordinator:
(Member)
ii
Head of Department:
(Chair)
iii
Abstract
One of the great and major issues facing the Internet today is a large amount of
data and files that need to be analyzed for possible malicious purposes. Malicious
software also referred to as an attacker’s malware is polymorphic and metamorphic
in design. It has the potential to modify their code as it spreads. Increased malware
and sophisticated cyber attacks are becoming a serious issue. Unknown malware
that has not been identified by security vendors is often used in these attacks,
making it difficult to protect terminals from infection. As of now, there is a lot of
research being performed to identify and monitor malware. After acknowledgment
of the deep learning area, several researchers have tried to detect malware using
neural networks and deep learning methods. This paper contrasts the performance
of three different neural networking models: Convolutional Neural Networks (CNN),
Long-Short Term Memory (LSTM) Network, and Gated Recurrent Unit (GRU) for
malware detection. Besides, we used secondary data to gather information about
malware activity.
iv
Dedication
We dedicate this thesis paper to those Researchers prior to us who have done an
amazing work and made our concept about this topic more clear and helped us with
the proper documentation we needed for conducting our research.
v
Acknowledgement
First of all, all gratitude to the Great Allah, for whom our report has been concluded
without any significant interference.
Moreover, in this research paper we have acknowledged about the integrity of the
research papers that we observed and which are mentioned in the Bibliography
Section. Moreover, we want to thank our respective thesis supervisor and co-advisor
who have helped us in every step in conducting this thesis research. Lastly, we are
grateful to the contribution of our University which have provided us with all the
necessary things required for our thesis research.
vi
Table of Contents
Declaration i
Approval ii
Abstract iv
Dedication v
Acknowledgment vi
List of Figures ix
List of Tables x
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Neural Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Convolutional Neural Network (CNN) . . . . . . . . . . . . . 8
2.2.2 Long short-term memory (LSTM) . . . . . . . . . . . . . . . . 10
2.2.3 Gated Recurrent Unit (GRU) . . . . . . . . . . . . . . . . . . 13
2.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Proposed Model 17
3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 GRU model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Experimental Setup 23
4.1 Basic libraries and functions needed for model setup . . . . . . . . . . 23
4.2 Model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vii
5 Result Analysis 27
5.1 Basic terminologies needed for result analysis . . . . . . . . . . . . . . 27
5.2 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Result analysis for CNN model . . . . . . . . . . . . . . . . . 29
5.2.2 Result analysis for GRU model . . . . . . . . . . . . . . . . . 31
5.2.3 Result analysis for LSTM model . . . . . . . . . . . . . . . . 33
5.3 Accuracy analysis for all three models . . . . . . . . . . . . . . . . . . 35
Bibliography 40
viii
List of Figures
4.1 Brief working process of how CNN model works with training, vali-
dation and testing data . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Brief working process of how LSTM model works with training, vali-
dation and testing data . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Brief working process of how GRU model works with training, vali-
dation and testing data . . . . . . . . . . . . . . . . . . . . . . . . . . 26
ix
List of Tables
5.1 A table showing performance measure for CNN model on the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 A table showing performance measure for GRU model on the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 A table showing performance measure for LSTM model on the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
x
Chapter 1
Introduction
1.1 Motivation
One of the biggest threats to security in the new technology industry is malware.
This malware can steal sensitive data from the computer. Hackers, therefore, take
advantage of these malwares to steal information from users. Day by day, the
number of malicious applications is growing at a rapid rate, and the identification of
malware is becoming increasingly difficult and it can also be found on smartphones.
Besides, the open nature of the Android platform acts as a breeding ground for
malware development and can be used for cyber attacks. Moreover, defense from
cyber-attacks by small and medium-sized companies is of considerable significance
and a demanding area, since it impacts them financially and technically. Also, there
are so many malwares all over the internet that can be a major threat to user safety.
There is therefore an urgent need to enhance the malware detection approach and
protect users from security threats.
1
Worms: Worms expand through web servers by harnessing the operating system’s
debt, which is one of the most familiar forms of malware. Payloads can delete host
system files, encrypt blockchain attack details, illegal access, destroy data, and con-
struct botnets.
Trojan Horse: You are allowing cybercriminals access to your machine as soon as
you mount a Trojan. Trojan malware does not reproduce on its own but the harm
Trojans will have on hosts and services is permanent when aligned with a worm.
Spyware: Downloaded and installed despite your comprehension, spyware is forged
to capture your scanning and activities on the world wide web.
Ransomware: This is a kind of malware that gathers your details and asks for
money for the details to be handed back to you. The only way to stay safe or delete
a malware infection is by using anti-malware or anti-virus software.
There is really no end to the channels from which malware approaches your system,
but once inside your machine, it progresses automatically and often disrupts internet
traffic.
2
we decided to do some further works with our research which is described in the
conclusion part accordingly.
3
Chapter 2
Background
4
using neural networks for the useful identification of mobile botnets. In addition,
the study also stated that the current architectural form of the Android function-
ing framework does not allow antivirus programs to obtain information required
to detect malware, including botnets on non-rooted mobile devices. However, the
paper also claimed that they did an early phase of research as the parameters and
structure used were optimized [6]. Another similar paper by Jin, Liu, Qu and Chi
proposed a new way to detect Android malware and the new one is a exclusive
machine learning classifier on CNN with Adaptive Selection of Classifiers (ASC) to
speed up the accuracy of malware categorization. In the meantime, it makes a quick
static analysis that is not very reservoir consuming. However, in the manner of
accuracy, their suggested approach is 4.27 percent higher than the traditional CNN.
Also, the paper plans to fuse these two approaches for further study in the future
[14]. Besides, the research paper by Hasegawa and Iyatomi established an accurate
and lightweight, one-dimensional Android malware detection process. Convolutional
Neural Network (1-D CNN) and its purpose are to treat a very small part of the
raw APK . The most significant benefit of CNN 1-D is its large connection to nu-
merous successful and renowned methods for improving the basic learning standard
acquire from CNN and CNN has acquired key features for the categorization in the
learning activity. They could not confirm why it was unable to detect other features
because the APK file is being squeezed and string explication is unworkable and
expected to be further studied in the future [12]. Another research paper by Zhao,
Li, Zheng and Shi focused on malware detection in Android smartphones. Day af-
ter day, the number of malicious applications is growing at a rapid rate, making
it increasingly difficult to detect malware. Their paper represented a new malware
detection technique that is based on deep neural networks and used the optimized
Convolutional Neural Network to acquire knowledge from opcode sequences. Their
method uses optimized Convolutional Neural Network to be learned several times
by the raw operation code sequence which is retrieved from the decompiled android
file. The results of the conducted experiment have an accuracy of 99 percent which
is 2 percent - 11 percent higher than the other ML algorithms using the same dataset
[18]. In addition, a research article by Li, Wang and Xue focused on the privacy
and security issues of Android smartphones. Due to its flexible nature, the Android
platform is prone to malware. Third-party applications can be installed by users,
and there are too many non - certified and authorized apps that may be a threat
to the privacy of the user. In this article, a deep-learning technique has been used
to overcome this privacy issue to recognize Android malware and created an engine
that can detect a bunch of malicious files. The outcomes of the research show that
the engine was successful in detecting 97 percent of the malware at a false positive
rate of 0.1 percent [16] . The papers, however, only studied the Android malware
using some different techniques, and all the papers planned to extend their research
using precise and maximum parameters shortly.
Another type of research paper on the protection of small and medium-sized enter-
prises from cyberattacks, which is proposed by Chamou et al. is very critical and
difficult field , as it affects them both economically and conceptually. The scien-
tific community has therefore shown an interest in implementing and optimizing the
performance of intervention detection systems. Here, with the help of a deep neural
learning method, an effort was made to build a system that can detect malicious
behavior in the context of DDOS and malware cyber-threats. This paper aims to
5
use flow-based statistical data to implement an anomaly-based intrusion prevention
method and reap the benefits of a deep learning model. The result of the research
showed a high 99.97 percent accuracy rate for DDOS detection and 99.44 percent
for malware detection with a low FPR for malevolent and regular network traffic
differentiations [19].
Research article by Tobiyama, Yamaguchi, Shimada, Ikuse and Yagi focused on
counteracting post-infection measures, but since the malware data is very much be-
nign traffic initiative, they suggested detecting malware processes on the infected
terminals and suggested training a RNN and training a CNN to classify white fea-
ture images. They proposed that the Neural Network be used to modify the dif-
ferent characteristics of the individual operation flow by recording and constructing
API call sequences and features extractor by learning the LST language model. In
conclusion, they validated the classifier using 150 log files of process behavior and
compared the validation result that was performed under several conditions, and
obtained the best AVC result = 0.96 when the feature image was 30X30 [7].
Besides, the paper by Wang, Zhu, Zeng and Sheng introduced a new typology of
classical classification from an artificial intelligence point of view and implemented
a traffic classification system using CNN, taking traffic as portraits and in this
research, raw traffic was used as input and the first aim to apply representation
learning. The four-traffic port-based classification method and the DPI box method
are used to classify traffic by retrieving empirical data using a set of specific features.
The approach to machine learning is better than a rule-based approach because it
solves many problems and the paper also shows the efficacy of using representation
learning [9].
However, the research paper by Saxe and Berlin established a deep neural network
system created by Invincea Labs that attain a useful detection accuracy at an ex-
tremely low error rate and brought up to a real-world training system achieving a
95 percent detection rate at a 0.1 percent FPR based on more than 400,000 software
binaries explicitly collected from their customers and internal malware database sys-
tems. Also, they described a non-parametric method for adjusting classifier scores
that adequately explain the expected accuracy in the application area. In regards,
according to the paper, deployable deep-network malware detectors using unvarying
features that have the finest prediction performance for any previously published
identification engine [5].
The paper by Alsulami and Mancoridis focused on improving the amount of malware
dataset, the wide range of malware families, as well as the broad array of labelling
strategies provided to malware by anti-virus providers, pose problems for cognitive
malware classification models. They serve a behavioural patterns classifier that uses
a Convolutional Recurrent Neural Network and data from Microsoft Windows Data
Retrieval Files. Besides, as their model minimizes training time and overhead and
as a result shows the capacity to constantly learn the behavior of new families of
malware [11].
Another research paper by Poonguzhali, Rajakamalam, Uma and Manju focused on
the growth of the Internet, has increased significantly malicious code attacks with
malicious code variants being a primary threat to Internet security. For security
breaches, data theft, and other hazards, it is essential that malicious code variants
can be detected. This paper offers a method for the detection of malware variants
using deep learning with the CNN. Static detection and dynamic detection are the
6
two traditional methods used in this paper. Static detection is performed by de-
taching the malware code, analyzing it, and dynamical detection by performing it
in a safe, virtual environment or sandbox, to analyze malicious code behavior. Once
the features are extracted, they are categorized according to their variants [21] .
The paper by Gonzalez and Vazquez focuses on a vector malware in which each fea-
ture consists of the number of APIs called from the Dynamic Link Library. The only
API-related techniques used for categorization are depend on the sequence of func-
tions to identify malicious program behavior or the frequency of certain API calls,
and also analyzes the ability for malware categoriztion depending on the number of
API calls per dynamic link library. The examination of the data that could have
been obtained from the samples brought to observe the patterns of the imported
functions and the Dynamic Link Libraries. The benefit of this approach is based on
the elegance of the vector features, given that the function counting method can be
written in any coding language [3].
In addition, the article by Zhang et al. proposed that Ransomware is a specific
kind of malware that proceeds to unredeemable data loss and causes immense loss
of knowledge and economic costs. Some ransomware, such as ransomware for finger-
printing, can print the climate of run-time and avoid complex analysis. To detect
and speed up this kind of ransomware they recommend a method of static anal-
ysis based on handling in contrast to dynamic studies. N-gram opcodes are used
for deep learning since the opcode sequences obtained from executable files have
great insight and procedural information, they view the opcode sequence from the
point of view of natural language sentences. They divide the N-gram sequence into
several patches and feed each patch to the neural called the SA-CNN convolution
network which is based on self-attention. They are the first to take advantage of
the opcode sequence self-care mechanism for ransomware classification to the best
of their knowledge. The system gathers from the system rich meaning and seman-
tic knowledge from the extremely long series with partition strategy and network
power of self-interest to eliminate ransomware. Even after elimination, the impact
of ransomware is irrevocable without the assistance of ransomware writers. Such
widespread intervention results in enormous financial losses and adverse effects on
the operations of the company [24].
Another paper by Abdelsalam, Krishnan, Huang and Sandhu focuses on the detec-
tion of malware found in Cloud Infrastructure using CNN. Over the years, the cloud
infrastructure has become more susceptible to malware attacks. The attacker usu-
ally infuses malware to manipulate the victim’s virtual machine. Within the data
center, malware can spread quickly and can cause massive crises to cloud service
providers and their customers. As a result, the need to detect malware on virtual
machines is very important. This report introduces and portrays an effective cloud
infrastructure malware detection strategy using the Convolutional Neural Network
(CNN). At first, a standard 2d CNN is used to train the metadata available for each
of the virtual machine (VM) processes collected through a hypervisor. Then, by
using a new 3D CNN, the precision of the CNN classifier is improved, which notice-
ably helps to minimize misidentified samples during data collection and training.
The malware used to perform the experiments was selected randomly. Their model
showed that the 2nd CNN model used has an accuracy of 79 percent and the other
model used significantly better the accuracy to 90 percent [10].
Finally the paper by Hsien-De Huang and Kao proposed a common approach for
7
detecting Android malware and this needs ongoing learning through pre-extracted
ways to ensure proper malware detectability. A color-inspired CNN based Android
malware detection (R2-D2) model has been preferred to eliminate the firepower of
feature engineering before the condition of not retrieving pre-selected features.Their
research has adopted an in-depth approach to constructing an end-to-end learning-
based Android malware detection method and proposes a color-inspired CNN [13].
• ReLU Layer
• Pooling Layer
8
If x < 0, f(x)=0 and,
if x >= 0, f(x)=x
Pooling Layer: This layer shrinks the size of the image. In this layer, there is a
window size that is specified. This window is moved according to the stride size
throughout the whole matrix that we obtained from ReLU layer and the maximum
pixel value from this window is taken out and is placed in a new matrix. After this,
we get a new matrix full of pulled out maximum values. This layer does the same to
all the matrices we obtained for different filters. These pulled out maximum pixel
values are in matrix form.
Fully Connected Layer: It is the final layer where the real identification occurs.
At first, we flatten our matrix into a vector and feed it into a fully connected layer
just like neural network. To be more precise, we take our filtered and shrank images
and put them into a single list. Now, this vector contains the pulled out maximum
values from all the matrices we have accomplished for all the filters.Therefore, ba-
sically with the fully connected layers, we combined features of an image together
for creating a model. For different object’s image, there are some certain values in
the vector which are high which the CNN model learns after finishing the training.
Finally, it comprises an activation function for instance Softmax or Sigmoid for clas-
sification.
To sum up, in deep learning, to train and evaluate CNN models, each input image
travels through a sequence of convolution layers with filters, pooling, completely
linked layers (FC) and apply Softmax or Sigmoid activation functions to identify an
object with probabilistic values in between 0 and 1 [38].
9
2.2.2 Long short-term memory (LSTM)
To begin with, LSTM which is abbreviated as Long short-term memory is one of
the most interesting breakthrough in the field of Deep learning and data science.
LSTM is an improved version of the generalized recurrent neural network. If we
compare LSTM with the basic RNN we can see that unlike RNN, LSTM models
have the capability of learning long ranged sequence data. This ability is very
helpful in various dynamic problem areas such as generating translation of different
languages, captioning a given image, generating the next possible text before a text.
Even to this day, they are used towards illustrating world-class outcomes. In an
LSTM model there is a cell and in that cell state there are three different gates.
The cell state keeps track of data over intervals of time and the LSTM gates in
that cell controls information transfer to and from the cell. LSTMs are somewhat
distinct from other methods of deep learning, such as Multilayer Perceptron (MLPs)
and convolutional neural Networks (CNNs), in that they are primarily designed for
problems with sequence prediction. In recurrent neural networks, LSTM is a better
way to solve the vanishing or exploding gradient problem.
To learn more about LSTM we need to be familiar to the basic concept of feed-
forward neural network model. And also visualize why it is irrelevant to be used
in sequence processing. For example, if we take a feed forward neural network that
is being trained for image classification, and if we feed some input the network
will provide us with an output f(x1) again, and also if we feed another input the
network will provide the output f(x2). But here the previous output is not being
used or has no relation with the new output. That shows that solving problems like
text generation, text-to-text conversion and sequence forecasting are of significant
disadvantage.
In order to solve such problem Recurrent Neural Network model was proposed. It
has many nodes and amongst them, there are connection which form a directed
graph to generate a sequence with respect to time and thus making the network to
display dynamics behavior. The recurrent neural network utilizes its own internal
memory in order to process sequences of the inputs achieved from feed- forward
neural networks which makes them applicable to tasks discussed above and it has
one or more layers which are hidden layers [1]. Each layer of an RNN computation
has feedback for the case of an RNN with two hidden layers. The first hidden layer,
the second hidden layer and the output layer of the RNN. The RNN described
10
herein sums the single hidden layer multilayer perceptron and the state-space model
resulting in a loop where the previous hidden layer output can be used to predict or
generate the later hidden layer outputs, generating a sequence of data that can be
fed to a feed-forward neural network for classification.
However, there is a slight disadvantage in the generic RNN structure due to the
vanishing gradient problem the model suffers from a problem best known as “short
term memory”. Vanishing gradient problem is also familiar in other neural network
models. As a neural network model trains and learns by processing further steps
it faces difficulty recalling data. As a consequence, data from initial time steps
seems like it does not exist in the final stage. This mainly is shown when the
model performs back-propagation in the model. Back-propagation is an algorithm
for training and optimizing various neural networks. Here, firstly, the model makes
a forward pass which give a certain output then the output is compared with the
targeted output and by doing so we get an error value by the loss function, which is
an approximation of the network’s last results. The value of the error is then used
to do back propagation. From the back propagation a gradient is generated to each
and every node in that neural network model. Moreover, using this gradient value
the model adjusts the internal weights and learns over time. If the gradient value is
high then the adjustments made in the network will also be higher and vice versa.
But the issue creates when each node adjusts its own weights with respect to the
immediate prior layer nodes. If it is seen that the changes made in the previous
layer are minimal, the adjustments in the new layer would be much smaller. This
action causes the gradients to exponentially shrink. Since the initial weights are
scarcely changed due to the extremely limited gradient, the earlier layers do not
do any learning and that is the vanishing gradient problem. For this reason, the
Recurrent Neural Network model faces difficulties learning long ranged sequence
processing with respect to time because of the vanishing gradients [37]. To combat
these problems LSTM model was proposed.
LSTM has the same control flow as a recurrent neural network, which processes
information and passes data sequentially as it propagates. The main difference
between LSTM and RNN lies in the cell operations. The cell operations are used
to make it easier for the LSTM to forget or hold data. The cell state and gates
are the core concept of LSTM. The cell state functions as a communication bridge
that transfers relative information all the way through the sequence chain. This
11
is also referred to as a network memory. The cells can carry data throughout the
sequence processing and can bring even data from initial time to the last level, thus
minimizing the effects of short-term memory. The activation function used in the
LSTM are “sigmoid” and “tanh” which are the most popular activation functions
used in neural network models. The tanh function normalizes the output values into
a range of (-1,1) later the sigmoid function even further limits the output values
ranged within (0,1). Lastly, the distinct gates in LSTM regulates the flow of data
in the layer those distinct gates are-
• Forget gate,
• Output gate
Forget gate determines what data is meant to be held away. The current input
and the immediate prior hidden state value are passed on to the sigmoid activation
function. The sigmoid functions then normalize all the values between 0 and 1, and
that value would be then used to update the cell state. Moreover, the same inputs
are also passed on to the input gate where in the input gate, the same values are
being passed on to both sigmoid function and tanh function. The two-output value
gotten from two activation function are then multiplied. The finalized value will
represent the output from the input gate. The output value of the forget gate and
the input gate are then used to update the cell state. First of all, the cell state
value of the immediate prior hidden state gets multiplied with the forget gate out-
put. Afterwards, the new cell state value gets added with the output from the input
gate thus updating the cell state. Lastly, in the output gate the current input and
the immediate prior hidden state value is also passed when in the output gate first
the values get normalized by the sigmoid activation function and the new cell state
gets normalized by the tanh function in the output gate. Then, both results are
multiplied thus creating the new hidden state which is later passed on to the next
time step.
12
2.2.3 Gated Recurrent Unit (GRU)
Gated Recurrent Unit (GRU) is a type of recurrent neural network which has similar-
ities with LSTM [35]. The GRU model is capable of representing a very complicated
system due to its specifically built structure.
GRU has two gates which are, a reset gate and an upgrade gate, but it lacks an
output gate in particular. These are basically two dimensions that make the decision
which data should be transmitted to the output. They can be skilled to preserve
data from earlier. In general, fewer parameters suggest that GRUs are easier or
faster to train than their LSTM equivalents.
Gated recurring unit (GRU) to allow each recurring unit to dynamically seize the
dependencies of contrasting time scales that modulate the information flow within
the unit, but without providing separate memory cells. The function of the two
gates of a GRU are-
• Update Gate: It decides how much information has to be moved into the
future from the past. In LSTM , it is similar to the Output Gate.
• Reset Gate: Decides how much to forget about past knowledge. In LSTM,
it is equivalent to the combination of the Input Gate and the Forget Gate [33].
The Gated Recurrent Neural Network has shown success in a variety of applications
involving sequential or temporal data. They have been commonly used, for example,
in speech recognition, natural language processing, etc. Furthermore, their output
is mainly because of the gating network signals that decide how the current input
and previous memory are used to adjust the current activation and produce the
current state. In addition, these gates have their own weight sets that are adapted
throughout the learning process (that is, the training and evaluation process) [8].
In GRU, the gates help us to determine which information has to pass or which
information has to be dropped. The values of the gates will be between 0 and 1. If
the value is very close to 0 then the information has to be dropped and if it is close
to 1, then the information will be retained.
Here, input state and the hidden state will together get us to the GRU. Also, the
13
reset gate and update gate will be working together which will generate the output
and the new hidden state.
14
2.3 Activation Functions
Activation functions are the most important component of any neural network. As
it includes very complicated tasks such as image recognition, language transforma-
tion, object detection. Therefore without these functions, these tasks are incredibly
difficult to execute. They basically choose to disable or activate neurons to get the
required outcome. However, weight and bias will only have a linear transformation
without activation functions. Conversely, by using the activation function in the
neural network, non-linear transformation is carried out into data, allowing com-
plex problems such as linguistic translations and image recognition to be resolved.
Besides, the activation functions are distinguishable since they can easily implement
back propagations and an efficient strategy for calculating gradient loss functions
in neural networks when performing back propagations. We are using two types of
activation functions for our three models which are Sigmoid and ReLU.
1. Sigmoid: The sigmoid is used mainly because it does its job with great
performance. Moreover, it is essentially a stochastic decision-making method
and ranges from 0 to 1. So we use this activation function when we have to
come to a decision or estimate an output because the range is the smallest and
as a result prediction will be more accurate. In a model, it is used to introduce
non-linearity as it determines which value to pass as output and what not to
pass. It is also known as a logistic function.
The equation for the sigmoid activation is-
1
S(x)=
1 + e−x
here input is the x and output is the S(x).
15
2. ReLU: The ReLU also known as rectified linear activation function is a linear
piecewise form that precisely outputs the input whether it is positive, or else it
outputs zero . For several forms of neural networks, it has become the primary
activation function because it is better to train a model that uses it and often
yields better results.
The equation of ReLU function-
ReLU = max(0, x)
16
Chapter 3
Proposed Model
17
Figure 3.2: Augmented Benign images
Since we had less amount of benign row in the collected dataset so we had to do
data augmentation of benign images only. It is a technique that can be used to
expand an image-based dataset. Moreover, in this augmentation technique, we used
many properties to generate benign images and the properties are rotation, height
shift, width shift, zoom, and horizontal flip and to each of these properties, we used
a fixed amount of ranges to generate augmented images accordingly. Here some of
the augmented benign images are shown above in figure 3.2.
We made two directories i.e. ”malware”, and ”benign” and placed these images into
respective directories.
All the images now we have are not of the same size. But to normalize everything
we need to keep the images in the same shape. So we resized all the images of both
directories into 50x50. Then we read the images one by one as 2D array where
each value is a pixel value and passed it into our training set along with the correct
label(0 for benign, 1 for malware). To ensure the proper learning of our model we
shuffled our training dataset so that it has a good balance in learning malware and
goodware. To feed the data to our neural network we took two different lists. One
is the feature set and the other is the label set. From the training set, we stored the
2D array into the feature set and labels into the label set of all the images. Lastly,
we converted the feature set list into numpy array and reshaped it into a size of
50x50 because neural networks can not work with lists. Also, we have normalized
all the values of it through dividing each pixel value by 255. Furthermore, for testing
purpose we have created two directories i.e “malware”, ”benign” and pre-processed
our testing dataset the same way we did for our training dataset.
18
3.2 Model Description
3.2.1 CNN model
Firstly, we have added a convolutional layer that has 64 neurons and the window
size of each filter here is 3x3. Pixel matrix of each image is being passed to the
convolutional layer and here the convolution occurs. Various features of the image
are extracted by different filters. Secondly, there is a ReLU layer that removes all
the negative value we got from the matrices after finishing the convolution operation
by all the filters. Then we have a pooling layer that pulls out the maximum value
within a window size from the matrixes achieved from different filters after passing
through the Relu layer. In our model, we have set this pool size 2x2. Again we have
these same three layers i.e. a convolutional layer, a ReLU layer, and a pooling layer
all with the same window size as previous. Next, we do a Flattening before passing
the data to the fully connected feed-forward network. By doing so we obtain a vector
that contains the maximum pulled-out pixel values by the pooling layer. Also, we
flatten here because a fully connected network works with a 1D array only. Our
fully connected feed-forward layer has 64 neurons and the activation function we
have used here is ReLU. After that, there is another densely connected layer that
has 128 neurons and also has a ReLU activation function to trigger the neurons.
Finally, we have a single neuron output layer and it uses the Sigmoid activation
function. We used “adam” as our model optimizer and set “accuracy” as metrics.
To train the model we used 64 batch size and 20 epochs. Also, to check out for
sample accuracy we declared a validation split of 30 percent.
19
3.2.2 LSTM model
To begin with, we have used the same data and pre-processing method in all our
Neural Network models. From our data set all the images were resized to both
directories into 50x50. Then we read the images one by one as 2D array where each
value is a pixel value and passed it into our training set along with the correct label (0
for benign, 1 for malware). Here, unlike convolutional neural network (CNN), where
the input data is taken as a four-dimensional input which are number of samples,
number of rows, number of columns, and number of channels respectively but LSTM
layers takes inputs in a three-dimensional input in STF manner, where S represents
the number of samples being used in LSTM, T represents the number of time steps
to be traced to generate sequences and F represents the features of the data. So,
to implement the data into our model, we processed the input data accordingly and
later added an LSTM layer with 64 neurons. Here the 64 neurons represent how
many hidden states there are for this layer and also represent the output dimension
since we get an output hidden state at the end of each LSTM. Moreover, since we
resized our image by 50x50, each neuron in the LSTM layer is being given a length
vector representing 50 features over 50 timesteps. Besides, we set the activation as
“ReLU” which is a very popular activation function in different neural networks.
Here the ReLU layer eliminates all the negative values obtained after the LSTM
layer operation. Afterward, the sequential output we got from the LSTM network
is being fed into a Dense layer containing 100 neurons with activation function as
ReLU for further extraction. Lastly, the output received from the first Dense layer
is being fed onto another dense layer with one neuron having the activation function
called “Sigmoid” and the sigmoid function generalizes the output into a range of
(0,1). Therefore, the value which is close to zero will be recognized as benign and
the value close to 1 will be recognized as malware by the LSTM model. Moreover,
we used “adam” as our model optimizer and set “accuracy” as metrics and to train
the model we used 64 batch size and 20 epochs just like the CNN model. Also, to
check out for sample accuracy we declared a validation split of 40 percent.
20
Figure 3.4: Flowchart for LSTM model
21
is useful for restoring or deleting information instead of squirting values between
-1 and 1, as every value multiplied with 0 becomes 0, allowing numbers to vanish
and being ”discarded.” Every value multiplied through 1 will be the identical value,
so that the value is the identical and it is ”preserved.” The system learns whatever
information is not necessary so that it can be ignored or which data is vital to retain.
We processed the input data accordingly to integrate the data into our model and
later added a GRU layer with 64 neurons. Here, the 64 neurons represent how
many hidden states there are for this layer and also represent the output dimension
as we get a hidden output state at the end of each GRU. In addition, because we
resized our image by 50x50, a longitude vector representing 50 characteristics over
50 timesteps is given to each neuron in the GRU layer. In addition, we set the ac-
tivation as ”ReLU”, which is a very common feature of activation in various neural
networks. The ”ReLU” layer here excludes all negative values obtained after the
activity of the GRU layer. The sequential output we receive from the GRU network
is then fed into a thick layer containing 100 neurons. With the ”ReLU” activation
feature for further extraction just like LSTM.
Finally, the output obtained from the first dense layer is fed to another dense layer
with one neuron having the activation function called Sigmoid, which generalizes
the output into a range of (0,1). The value close to zero will therefore be recognized
as a benign image and the value close to 1 will be recognized as GRU type malware
images.Moreover, we used “adam” as our model optimizer and set “accuracy” as
metrics and to train the model we used 64 batch size and 20 epochs just like the
LSTM model. Also, to check out for sample accuracy we declared a validation split
of 40 percent.
22
Chapter 4
Experimental Setup
2. NumPy: We used the NumPy library which is used for working with arrays
as while data-preprocessing we converted the feature set list into a numpy
array that we have already described in the data description part.
7. Accuracy Metric: The accuracy metric measures the accuracy rate of all
outcomes.
8. Epoch: An epoch is one single pass over the full training set to the network.
The full dataset needs to be transferred to the same neural network many
times so we use more than one epoch to optimize the learning process since
we use a limited dataset [40].
23
9. Batch size: The batch size is the number of samples that will be transmitted
through the network at one time. However, batch size and epoch are not the
same things.
Generally, the larger the batch size, the faster the model can complete each
epoch during training. This is because, based on the computing resources,
computers might be capable of processing more than one sample at a time
[32].
10. model.compile() : Keras platform gives a compile() method for compiling a
model where loss function, optimizer and metrics are passed.
11. model.fit() : It is used to transfer training and validation data where batch
size and epochs are also specified.
12. model.predict() : Predict() function takes an array of one or more contexts
of data. It can be used to assess performance of the models on the test dataset.
13. Scikit-learn (sklearn module): Scikit-learn is a Python library that offers
both unsupervised and supervised learning algorithms. It is based on some of
the technologies that we may already know, like NumPy, Pandas, and Mat-
plotlib [2].
14. Train-test split: Train-test split is a method used to assess the output of a
model. It is used to divide the data.
24
Figure 4.1: Brief working process of how CNN model works with training, validation
and testing data
2. For LSTM model: Unlike CNN, here, in this model, we imported train-test
split method from sklearn and used it to divide the data into two segments
which are train set and test set. However, the test set here was passed to
validation data when we used to fit our model. From Keras, we imported
Sequential which allows us to sequentially stack the layers, and also imported
LSTM and Dense for building the Long short-term memory (LSTM) model
and we created our LSTM model as described earlier in the model description
part. After that, we compiled our model using the model.compile() by defining
loss function as “binary crossentropy”, optimizer as “adam” and metrics as
“accuracy”. Lastly, we used model.fit() to pass the training and validation
data along with batch size 64 and 20 epochs.
We pre-processed the testing dataset the same way the training dataset was
pre-processed. Then, we used the model.predict() function where we passed
an array of test set to evaluate model accomplishment on the testing dataset.
To analyze our model results we used a confusion matrix which we imported
from sklearn. The obtained results are also shown and described in the Result
Analysis part in the next coming chapter.
Figure 4.2: Brief working process of how LSTM model works with training, valida-
tion and testing data
3. For GRU model: Same as the LSTM model, we again imported train-test
split method from sklearn and used it to divide the data into two parts which
are train set and test set. However, just like the LSTM, the test set here was
also passed to validation data when we used to fit our model. From Keras, we
imported Sequential which allows us to sequentially stack the layers, and also
imported GRU and Dense for building the Gated Recurrent Unit (GRU) model
and we created our GRU model as described earlier in the model description
part. After that, we compiled our model using the model.compile() by defining
25
loss function as “binary crossentropy”, optimizer as “adam” and metrics as
“accuracy”. Lastly, we used model.fit() to pass the training and validation
data along with batch size 64 and 20 epochs.
We pre-processed the testing dataset the same way the training dataset was
pre-processed. Then, we used the model.predict() function where we passed
an array of test set to evaluate model accomplishment on the testing dataset.
To analyze our model results we also used a confusion matrix here which we
imported from sklearn. The obtained results are also shown and described in
the Result Analysis part in the next coming chapter.
Figure 4.3: Brief working process of how GRU model works with training, validation
and testing data
26
Chapter 5
Result Analysis
27
true negative, false positive, and false negatives. [30].
TP + TN
accuracy=
TP + TN + FP + FN
3. Precision: Precision is the extent of correctly observed positive cases to all
forecast positive cases, that is,the accurate and falsely predicted positive cases.
Precision is the fraction of the recovered records that are important to the
query. Moreover, it is the closeness of measurements to each other [39].
TP
precision=
TP + FP
4. Recall: Recall illustrates the percentage of total relevant outcomes correctly
classified by the model. Moreover, it is the ratio of correctly defined positive
cases to all real positive cases that is the sum of the false-negatives and the
true positives [39].
TP
recall=
TP + FN
5. F1-score: The F1-score expresses the equity between precision and recall.
The more nearer the score is to 1 or 100 percent, the better the model is
performing [39].
2 ∗ precision ∗ recall
F 1=
precision + recall
6. Sample Loss: Sample loss is measured on the training dataset and it portrays
how the model is learning on the known dataset. Loss is not a percentage but
it is a aggregation of errors made for each example in training. Moreover, the
smaller the loss the better the model is learning.
7. Sample Accuracy: Sample accuracy is calculated on the training datasets
and it represents how the model is learning on the known dataset. Accuracy
is measured in terms of percentage and the more the percentage, the better
the model is learning.
8. Validation Loss: Validation loss is calculated on validated datasets that are
held back from training datasets and it represents how the model is performing
on the unknown portion of the training datasets. Loss is not a percentage but
it is a aggregation of errors made for each example while validating. Moreover,
the smaller the loss the better the model is ready for classifying among real-
world dataset or testing dataset.
9. Validation Accuracy: Validation accuracy is calculated on the validated
datasets that are held back from training datasets and it represents how the
model is performing on the unknown portion of the training datasets. Ac-
curacy is measured in terms of percentage and the more the percentage, the
better the model is ready for classifying among real-world dataset or testing
dataset.
10. Overfitting: It occurs when a model grasp the training dataset well enough
and performs effectively on the training dataset too, but does not perform
effectively on a holdout dataset also which is known as the validation dataset.
28
11. Underfitting: It occurs when a model stalls to properly learn the problem
and performs imperfectly on a training dataset and as a result does not perform
effectively on a holdout dataset also which is known as the validation dataset.
Figure 5.2: Graph showing CNN model result on training and validation datasets
From the above graph, as the epoch is proceeding validation loss is decreasing and
at the same time validation accuracy is increasing. Therefore, it can be noted that
the CNN model is not overfitting nor underfitting, as the model is performing well
on the validation dataset as much as it is doing well on the training dataset through
which the learning process is optimized in each epoch. Moreover, throughout every
last epoch, the model is accomplishing consistent results.
The average validation accuracy for the CNN model is 84 percent.
29
Figure 5.3: Confusion Matrix of CNN model
In the above diagram, the confusion matrix of CNN is shown. We are classifying
malware as positive events (1) and benign as negative events (0). The rows represent
actual values and the column represents predicted values.
Here, 2,418 malwares are correctly predicted by the model, and hence it is the true
positive (TP) value, and 546 benigns are correctly predicted and so it is the true
negative (TN) value. Also, the value 537 are actually benigns (negative events)
but the model predicted it as malwares (positive events) so it is the false positive
(FP) value, and the value 82 are malwares (positive events) but the model predicted
it as benigns (negative events) so it is the false negative (FN) value. Therefore,
it means that the left diagonal represents the correctly determined values and the
right diagonal represents incorrectly determined values by the CNN model.
Table 5.1: A table showing performance measure for CNN model on the testing
dataset
We used a total of 3,583 images in testing dataset (including malware and benign
which is the unseen dataset) to evaluate our model performance and got performance
measures (precision, recall, and F1-score) as shown in the table above.
For malware (positive events), we got a precision of 82 percent which means out of
total positive predicted values, 82 percent is the actual positive result. Also, we got
a recall of 97 percent which means out of total positive actual values, 97 percent is
the actual positive result which is much better in determining malwares. Therefore,
the F1-score is 90 percent which represents the balance between precision and recall
and it is close to 100 percent which means that the CNN model is performing better
in determining malware (positive events).
For benign (negative events), we got a precision of 87 percent which means out of
total negative predicted values, 87 percent is the actual negative result. Also, we got
a recall of 50 percent which means out of total negative actual values, 50 percent is
the actual negative result. Therefore, the F1-score is 60 percent which represents the
balance between precision and recall which means CNN is moderately performing
better in determining benign (negative events) as well.
30
5.2.2 Result analysis for GRU model
Figure 5.4: Graph showing GRU model result on training and validation dataset
From the above graph, as the epoch reached number 14 and onwards, the validation
loss is decreasing and at the same time validation accuracy is increasing. Also, the
model is doing better on the first half epoch as well. Therefore, it can be clearly
observed that the GRU model is not overfitting nor underfitting as well, as the
model is performing well on the validation dataset as much as it is doing well on the
training dataset through which the learning process is optimized in each epoch.
The average validation accuracy for the GRU model is 79 percent.
31
In the above diagram, the confusion matrix of GRU is shown. As we have already
mentioned, we are classifying malware as positive events (1) and benign as negative
events (0). The rows represent actual values and the column represents predicted
values.
Here, 2,248 malwares are correctly predicted by the model, and hence it is the true
positive (TP) value, and 460 benigns are correctly predicted and so it is the true
negative (TN) value. Also, the value 623 are actually benigns (negative events) but
the model predicted it as malwares (positive events) so it is the false positive (FP)
value, and the value 252 are malwares (positive events) but the model predicted
it as benigns (negative events) so it is the false negative (FN) value. Moreover, it
means that the left diagonal represents the correctly determined values and the right
diagonal represents incorrectly determined values by the GRU model.
Table 5.2: A table showing performance measure for GRU model on the testing
dataset
32
5.2.3 Result analysis for LSTM model
Figure 5.6: Graph showing LSTM model result on training and validation dataset
From the above graph, as epoch reached number 12 and onwards, the validation loss
is decreasing and at the same time validation accuracy is improving. Therefore, it
can clearly discover that the LSTM model is also not overfitting nor underfitting
like the other models, as the model is performing well on the validation dataset as
much as it is doing well on the training dataset through which the learning process
is optimized in each epoch.
The average validation accuracy for the LSTM model is 74 percent.
33
In the above diagram, the confusion matrix of LSTM is shown. As we have already
mentioned, we are classifying malware as positive events (1) and benign as negative
events (0). The rows represent actual values and the column represents predicted
values.
Here, 2,197 malwares are correctly predicted by the model, and hence it is the true
positive (TP) value, and 122 benigns are correctly predicted and so it is the true
negative (TN) value. Both the true positive and true negative values are lowest
among the other two models. Also, the value 961 are actually benigns (negative
events) but the model predicted it as malwares (positive events) so it is the false
positive (FP) value, and the value 303 are malwares (positive events) but the model
predicted it as benigns (negative events) so it is the false negative (FN) value. Be-
sides, the false positives and false negative values are much higher than the other
two models. To sum up, the left diagonal represents the correctly determined values
and the right diagonal represents incorrectly determined values by the LSTM model.
Table 5.3: A table showing performance measure for LSTM model on the testing
dataset
Since the same amount of testing dataset were also used to evaluate our LSTM model
performance and we got performance measures (precision, recall, and f1-score) as
shown in the table above.
For malware (positive events), we got a precision of 70 percent which means out of
total positive predicted values, 70 percent is the actual positive result. Also, we got
a recall of 88 percent which means out of total positive actual values, 88 percent is
the actual positive result which is not that bad for the LSTM model. Therefore,
the F1-score is 78 percent which represents the balance between precision and recall
and the LSTM model is doing average in determining malware (positive events).
However, all the measure values of positive events for the LSTM model is less than
the other two models.
For benign (negative events), we got a precision of 29 percent which means out of
total negative predicted values, 29 percent is the actual negative result. Also, we got
a recall of 11 percent which means out of total negative actual values, 11 percent is
the actual negative result which is quite low than the other two models. Therefore,
the F1-score is 16 percent which represents the balance between precision and recall,
which is very much low and that means LSTM is performing worst in determining
benign (negative events).
34
5.3 Accuracy analysis for all three models
First of all, the CNN model gives an accuracy of 83 percent on the testing dataset,
which means the model correctly classifies the true positive (malware) and true
negative (benign) events 83 percent accurately which is better among all the other
models. Since CNN performs better in recognizing images as both CNN layers have
several convolutional filters that function and examine the full matrix of features
and minimize spatial size. This makes it possible for CNN to be a very convenient
and appropriate network for classifying and manipulating images.
Secondly, the GRU model gives an accuracy of 76 percent on the testing dataset,
which means the model correctly classifies the true positive (malware) and true neg-
ative (benign) events 76 percent accurately and is performing better than the LSTM
model as shown in the histogram above since GRU performs better and train faster
than LSTM as it has little operations compared to LSTM. Lastly, the LSTM model
gives an accuracy of 65 percent on the testing dataset, which means the model cor-
rectly classifies the true positive (malware) and true negative (benign) events 65
percent accurately and it is lowest among all the other models since LSTM is hard
to train because it involves memory-bandwidth-bound computation, and eventually
reduce the use of neural network implementations.
To sum up, CNN is performing better among all the other models in detecting mal-
ware as well as it has shown better performance measures which are precision, recall,
and f1-score among the other models for both positive and negative events which
again proves that CNN is best for image-based classification problems. However, we
are not getting almost 100 percent accuracy in our research due to the limitation
of the dataset, and also we are using only one feature which is the static feature as
our dataset contains static analysis data.
35
Chapter 6
In this current world, the amount of malware is widening each year and new forms
of threats are more disruptive and complicated than ever. Hackers are continuing
to accelerate the advancement of malware development by implementing methods
such as mutations at a startling pace. Evidently, automated detection using highly
accurate models could be the only option in the future to fix this problem. There-
fore, we proposed three kinds of Neural Network models for malware detection and
compared their performance measures as shown in the result analysis section. Also,
we achieved better results with all the models and none of the models performed
abnormally as there was no underfitting or overfitting at all. Convolutional Neu-
ral Network (CNN) provided the best results among the other models since CNN
performs better in image classification tasks. As a result, we hope that our models
can contribute to solve cyber vulnerability of Bangladesh. However, we want to do
further research on this topic as our future plan is to concatenate the neural net-
work models together, for example, concatenating CNN with LSTM or GRU and at
the same time performing various cross-validation techniques to make the accuracy
results much higher than it is now.
36
Bibliography
[1] S. Haykin, Neural Networks and Learning Machines, ser. Neural networks and
learning machines v. 10. Prentice Hall, 2009, p. 794, isbn: 9780131471399.
[Online]. Available: https://books.google.com.bd/books?id=K7P36lKzI%
5C QC.
[2] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine
learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–
2830, 2011.
[3] L. E. Gonzalez and R. A. Vazquez, “Malware classification using euclidean
distance and artificial neural networks,” in 2013 12th Mexican International
Conference on Artificial Intelligence, IEEE, 2013, pp. 103–108.
[4] F. Chollet et al., Keras, https://keras.io, 2015.
[5] J. Saxe and K. Berlin, “Deep neural network based malware detection using
two dimensional binary program features,” in 2015 10th International Confer-
ence on Malicious and Unwanted Software (MALWARE), IEEE, 2015, pp. 11–
20.
[6] M. Oulehla, Z. K. Oplatková, and D. Malanik, “Detection of mobile botnets
using neural networks,” in 2016 Future Technologies Conference (FTC), IEEE,
2016, pp. 1324–1326.
[7] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, “Malware
detection with deep neural network using process behavior,” in 2016 IEEE
40th Annual Computer Software and Applications Conference (COMPSAC),
IEEE, vol. 2, 2016, pp. 577–582.
[8] R. Dey and F. M. Salemt, “Gate-variants of gated recurrent unit (gru) neural
networks,” in 2017 IEEE 60th international midwest symposium on circuits
and systems (MWSCAS), IEEE, 2017, pp. 1597–1600.
[9] W. Wang, M. Zhu, X. Zeng, X. Ye, and Y. Sheng, “Malware traffic classifica-
tion using convolutional neural network for representation learning,” in 2017
International Conference on Information Networking (ICOIN), IEEE, 2017,
pp. 712–717.
[10] M. Abdelsalam, R. Krishnan, Y. Huang, and R. Sandhu, “Malware detection
in cloud infrastructures using convolutional neural networks,” in 2018 IEEE
11th International Conference on Cloud Computing (CLOUD), IEEE, 2018,
pp. 162–169.
37
[11] B. Alsulami and S. Mancoridis, “Behavioral malware classification using con-
volutional recurrent neural networks,” in 2018 13th International Conference
on Malicious and Unwanted Software (MALWARE), IEEE, 2018, pp. 103–111.
[12] C. Hasegawa and H. Iyatomi, “One-dimensional convolutional neural networks
for android malware detection,” in 2018 IEEE 14th International Colloquium
on Signal Processing & Its Applications (CSPA), IEEE, 2018, pp. 99–102.
[13] T. Hsien-De Huang and H.-Y. Kao, “R2-d2: Color-inspired convolutional neu-
ral network (cnn)-based android malware detections,” in 2018 IEEE Interna-
tional Conference on Big Data (Big Data), IEEE, 2018, pp. 2633–2642.
[14] Y. Jin, T. Liu, A. He, Y. Qu, and J. Chi, “Android malware detector exploiting
convolutional neural network and adaptive classifier selection,” in 2018 IEEE
42nd Annual Computer Software and Applications Conference (COMPSAC),
IEEE, vol. 1, 2018, pp. 833–834.
[15] C. H. Kim, E. K. Kabanga, and S.-J. Kang, “Classifying malware using con-
volutional gated neural network,” in 2018 20th International Conference on
Advanced Communication Technology (ICACT), IEEE, 2018, pp. 40–44.
[16] D. Li, Z. Wang, and Y. Xue, “Fine-grained android malware detection based
on deep learning,” in 2018 IEEE Conference on Communications and Network
Security (CNS), IEEE, 2018, pp. 1–2.
[17] M. Yeo, Y. Koo, Y. Yoon, T. Hwang, J. Ryu, J. Song, and C. Park, “Flow-
based malware detection using convolutional neural network,” in 2018 Interna-
tional Conference on Information Networking (ICOIN), IEEE, 2018, pp. 910–
913.
[18] L. Zhao, D. Li, G. Zheng, and W. Shi, “Deep neural network based on android
mobile malware detection system using opcode sequences,” in 2018 IEEE 18th
International Conference on Communication Technology (ICCT), IEEE, 2018,
pp. 1141–1147.
[19] D. Chamou, P. Toupas, E. Ketzaki, S. Papadopoulos, K. M. Giannoutakis, A.
Drosou, and D. Tzovaras, “Intrusion detection system based on network traffic
using deep neural networks,” in 2019 IEEE 24th International Workshop on
Computer Aided Modeling and Design of Communication Links and Networks
(CAMAD), IEEE, 2019, pp. 1–6.
[20] A. Oliveira, Malware analysis datasets: Raw pe as image, 2019. doi: 10.21227/
8brp-j220. [Online]. Available: https://dx.doi.org/10.21227/8brp-j220.
[21] N. P. Poonguzhali, T. Rajakamalam, S. Uma, and R. Manju, “Identification
of malware using cnn and bio-inspired technique,” in 2019 IEEE International
Conference on System, Computation, Automation and Networking (ICSCAN),
IEEE, 2019, pp. 1–5.
[22] S. Shukla, G. Kolhe, S. M. PD, and S. Rafatirad, “Rnn-based classifier to
detect stealthy malware using localized features and complex symbolic se-
quence,” in 2019 18th IEEE International Conference On Machine Learning
And Applications (ICMLA), IEEE, 2019, pp. 406–409.
[23] F. Chollet et al., Keras, https://keras.io/guides/sequential model/, 2020.
38
[24] B. Zhang, W. Xiao, X. Xiao, A. K. Sangaiah, W. Zhang, and J. Zhang, “Ran-
somware classification using patch-based cnn and self-attention network on em-
bedded n-grams of opcodes,” Future Generation Computer Systems, vol. 110,
pp. 708–720, 2020.
[25] K. Bolton, A quick introduction to artificial neural networks (part 2), ”http:
//krisbolton.com/a-quick-introduction-to-artificial-neural-networks-part-2”,
Accessed: June 5, 2018.
[26] J. Browlee, Machine learning mastery: A gentle introduction to cross-entropy
for machine learning, ”https://machinelearningmastery.com/cross- entropy-
for-machine-learning/”, Accessed: October 21, 2019.
[27] J. Browlee, Machine learning mastery: A gentle introduction to the rectified
linear unit (relu), ”https : / / machinelearningmastery. com / rectified - linear -
activation-function-for-deep-learning-neural-networks/”, Accessed: January
9, 2019.
[28] J. Browlee, Machine learning mastery: Gentle introduction to the adam opti-
mization algorithm for deep learning, ”https://machinelearningmastery.com/
adam-optimization-algorithm-for-deep-learning/”, Accessed: July 3, 2017.
[29] Common vulnerabilities in cyber space of bangladesh, ”https://www.cirt.gov.
bd/common-vulnerabilities-in-cyber-space-of-bangladesh”.
[30] Deepai: Accuracy (error rate), ”https://deepai.org/machine-learning-glossary-
and-terms/accuracy-error-rate”.
[31] Deepai: Feed forward neural network, ”https://deepai.org/machine-learning-
glossary-and-terms/feed-forward-neural-network”.
[32] Deeplizard: Machine learning deep learning fundamentals, ”https://deeplizard.
com/learn/video/U4WB9p6ODjM”.
[33] Geeksforgeeks: Gated recurrent unit networks, ”https://www.geeksforgeeks.
org/gated-recurrent-unit-networks”, Accessed: July 14, 2019.
[34] I2tutorials: Long short-term memory: From zero to hero with pytorch, ”https:
//www.i2tutorials.com/long- short- term- memory- from- zero- to- hero- with-
pytorch/”, Accessed: June 20, 2019.
[35] S. Kostadinov, Towards data science: Understanding gru networks, ”https :
//towardsdatascience.com/understanding- gru- networks- 2ef37df6c9be”, Ac-
cessed: December 16, 2017.
[36] A. Mersch and E. Nealis, 6 common types of malware, ”https://blog.totalprosource.
com/5-common-malware-types”, Accessed: August 17, 2020.
[37] M. Phi, Towards data science: Illustrated guide to recurrent neural networks,
”https : / / towardsdatascience . com / illustrated - guide - to - recurrent - neural -
networks-79e5eb8049c9”, Accessed: September 20, 2018.
[38] Prabhu, Understanding of convolutional neural network (cnn) deep learning,
”https : / / medium . com / @RaghavPrabhu / understanding - of - convolutional -
neural-network-cnn-deep-learning-99760835f148”, Accessed: March 4, 2018.
[39] Python machine learning tutorial, ”https://www.python-course.eu/metrics.
php”.
39
[40] S. Sharma, Towards data science: Epoch vs batch size vs iterations, ”https:
//towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9”,
Accessed: September 23, 2017.
[41] The daily swig: Cybersecurity and views, ”https : / / portswigger . net / daily -
swig”.
[42] S. Vasudevan, Gru explained (gated recurrent unit), ”https://www.youtube.
com/watch?v=xLKSMaYp2oQ”, Accessed: May 3, 2020.
40