0% found this document useful (0 votes)
23 views

Malcode Detection

Uploaded by

mrimamss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Malcode Detection

Uploaded by

mrimamss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Malicious Code Detection based on Image Processing

Using Deep Learning


Rajesh Kumar Zhang Xiaosong Riaz Ullah Khan
University of Electronic Science University of Electronic Science University of Electronic Science
and Technology of China and Technology of China and Technology of China
0086-15520777096 008618982067786 0086-15520763595
[email protected] [email protected] [email protected]
m
Ijaz Ahad Jay Kumar
University of Electronic Science and Technology of Quaid-e-Azam University Islamabad, Pakistan
China 00923332836704
[email protected]
[email protected]

ABSTRACT detection, heuristic detection or behavior- based detection.


In this study, we have used the Image Similarity technique to Signature-based detection searches for specified bytes se-
detect the unknown or new type of malware using CNN ap- quences into an object so that it can identify exception- ally a
proach. CNN was investigated and tested with three types of particular type of a malware. Its drawback is that it cannot
datasets i.e. one from Vision Research Lab, which contains detect zero-day or new malware since these mal- ware
9458 gray-scale images that have been extracted from the same signatures are not supposed to be listed into the signa- ture
number of malware samples that come from 25 differ- ent database [5]. Heuristic-based detection was developed to
malware families, and second was benign dataset which basically overcome the limitation of the signature detec- tion
contained 3000 different kinds of benign software. Benign technique, in the way that it scans the system’s be- havior in
dataset and dataset vision research lab were initially exe- order to identify the activities which seems to be not normal,
cutable files which were converted in to binary code and then instead of searching for the malware signature. Heuristic-based
converted in to image files. We obtained a testing ac- curacy of detection method can be applied to newly created malware
98% on Vision Research dataset. whose signature has not yet been known. The limitation of this
technique is that it affects the system’s performance and
CCS Concepts requires more space. Behavior-based de- tection technique is
• Security and privacy ➝ Malware and its mitigation more about the behavior of the program when it is executing. If
a program executes normally, then it is marked as benign,
Keywords otherwise it is marked as a malware. By analyzing this
Malware Detection, Convolutional Neural Network, Mal- ware definition of the behavior-based detection, we can directly
Classification, Deep Learning conclude that the drawback of this technique is the production
of many false positives and false negatives, considering the
1. INTRODUCTION fact that a benign program can crashed and be marked as a
1.1 Background virus or virus can execute as if it was a normal program and
One of the major challenges in the realm of security threats is simply be marked as benign.
malicious software which is also referred as malware. The
main focus of malware is, to gather the personal informa- tion 1.2 Motivations
without the attention of users and to disturb the com- puter Malware is growing in the huge volume every day, we used
operations which makes problems for users. There are many image processing technique in order to improve accuracy and
kinds of malware i.e. Virus, Worm, Trojan-horse, Rootkit, performance. Image processing technique analyzes malware
Backdoor, Spyware, Adware etc [2]-[4]. Annual reports from binaries as gray-scale images. The previous research [10]
antivirus companies show that thousands of new malware are proposed a new method for visualization to classify malware
created every single day. This new mal- ware become more using image processing technique. Some of mature image
sophisticated that they could no longer be detected by the processing techniques are widely used for object recognition
traditional detection techniques such as signature-based e.g. taobao is popular shopping website in china which find’s
the product using image recognition technique. This method
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are performs high accuracy in practice. In this study, we con-
not made or distributed for profit or commercial advantage and that verted binary code to images for recognizing malware which
copies bear this notice and the full citation on the first page. Copyrights preserve the similarities variant images. We observed that the
for components of this work owned by others than ACM must be image recognition method is helpful to achieve better
honored. Abstracting with credit is permitted. To copy otherwise, or performance and accuracy.
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from 1.3 Our approach
[email protected]. Malware classified in different families has multiple char-
ICCAI 2018, March 12–14, 2018, Chengdu, China
© 2018 Association for Computing Machinery.
acteristics or features. Many authors used machine learn- ing
ACM ISBN 978-1-4503-6419-5/18/03$15.00 models such as Regression, K-nearest-neighbor, Random
Forest etc. Main disadvantage of using machine learning is,
https://doi.org/10.1145/3194452.3194459 features extraction is manual. Gavrilut et al. [8] gave an

81
overview of different machine learning techniques that were 2. DATA PREPARATION AND
previously proposed for malware detection. Unlike Machine
Learning, Deep learning skips the manual steps of extract- ing ENVIRON- MENT SETUP
features. For instance, we can feed directly images and videos This section is divided into two parts. The first part is, to
to the deep learning algorithm, which can predict the object. In collect malware and benign datasets from different sources and
this way deep learning model is more intelligent rather than second part describes the techniques of preparation of the
machine learning model. We used convolutional neural dataset. In second part we used a technique to prepare dataset
networks because it is reliable and it can be applied to the which is described in Section 2.2.
entire image at a time and then we can assume they are best to
use for feature extraction. Recently Constitutional Neural
2.1 Collection of Dataset
Networks [6] is the new approach to detect malware by using We have collected three datasets from different sources. Two
image based similarity technique. Its automated image of them are malicious datasets from two different sources i.e.
comparison helps analysts to visually identify com- mon code from Vision Research Lab and from Microsoft Malware Clas-
portions or specific instruction blocks within a sample. In this sification Challenge. We also collected 3000 benign file from
work we used three different datasets and compared the different sources. All three datasets are discussed briefly in the
accuracy. Secondly we used different tech- niques to prepare following discussions.
datasets for training and testing purposes. we trained and tested 2.1.1 Vision Research Lab Dataset
the CNN model for better under- standing of the malware First dataset is collected from Vision Research Lab and this
behavior. Overall, we show that our proposed approach dataset is called Malimg Dataset [10]. The dataset comprises
constitutes a valuable asset in the fight against malware. Figure 25 malware families while the number of variants is different
1 gives a brief overview to the different stages i.e. from data in each family. Dataset is shown in Table 1 along with class
preparation to malware detection. name, family name and number of samples.

Table 1. Malimg Dataset from Vision Research Lab Dataset


No Class Family No of
Name Samples
1 Worm Allaple.L 1591
2 Worm Allaple.A 2949
3 Worm Yuner.A 800
4 PWS Lolyda.A 231
A1
5 PWS Lolyda.A 184
A2
6 PWS Lolyda.A 123
A3
7 Trojan C2Lop.P 146
8 Trojan C2Lop.gen 200
!G
9 Dialer Instantacc 431
ess
10 Trojan Swizzor.ge 132
Figure 1. Workflow diagram; from data preparation to Downloader n!l
malware detection. 11 Trojan Swizzor.ge 128
Downloader n!E
12 Worm VB.AT 408
1.4 Contributions 13 Rogue Fakerean 381
The main contributions of the paper are summarized as follows: 14 Trojan Aluron.ge 198
We used the Convolutional Neural Networks for detection of n!J
malware, based on image similarity which is further described 15 Trojan Malex.gen 136
!J
in Section 3
16 PWS Lolyda.AT 159
We successfully analyzed and detected unknown or new type 17 Dialer Adialer.C 125
of malware 18 Trojan Wintrim.B 97
Downloader X
We achieved better results in terms of training / testing 19 Dialer Dialplatfor 177
accuracy and speed of detection which is further described in m.B
section 3 20 Trojan Dontovo.A 162
Downloader
We achieved 98% of accuracy on Vision Research Lab’s 21 Trojan Obfuscator 142
Dataset. Downloader .AD
22 Backdoor Agent.FYI 116
1.5 Structure of paper 23 Worm:AutoIT Autorun.K 106
The Section 1 discusses the background, motivation, ap- 24 Backdoor Rbot!gen 158
proach used in this study and main contributions of this work. 25 Trojan Trojan Trojan
Section 2 gives a brief overview to the methodology that how
to convert executable files in to images and also setup the
python libraries. Section 3 proposes a malware detection
technique, discusses optimized CNN model, de- scribes the
implementation and experiment results in terms of accuracy.
Finally, Section 4 concludes the paper.

82
3.1 Proposed Model
In this design we divided model in two phases i) Training
phase and ii) Detection phase. For the training and the de-
tection of malware we used CNN model, as shown in Figure

Figure 2. Images extracted from malware. 4. We prepare the dataset using different techniques shown in
data preparation section. The output of the data prepa- ration
section is “image files”. Images have binary labels
Malimg Dataset consists 9,458 gray-scale images of 25 mal-
ware families. Ratio of 90-10 was used for model perfor- i.e. either benign or malware. we used supervised learn- ing
mance evaluation. 90% of the total data was used for train- ing model in which the features are extracted automatically. The
and 10% was used for testing. The real malware binaries of this detection phase is shown in Figure 4. The same exe file convert
dataset was available in [1], in image and trained classifier detect the malicious code.
As Gavrilut et al. [7] explained that a binary code of a given 3.2 Training Convolutional Neural
malware can be read as a vector of 8 bits un-signed integers
and organized into 2-dimensional array which can be Networks Structure
visualized as a gray-scale image in the range of [0,255], where We have used convolutional neural networks because it is re-
0 represent black and 255 for white. The size of the image is liable and it can be applied to the entire image at a time and
different depending on their families. We observed in Figure 2, then we can assume they are best to use for feature extrac- tion.
that images which belong to the same family are looking very convolutional neural network is a feed-forward neural network
similar to one another. where the connectivity pattern between neurons is inspired by
the structure of an animal visual cortex and that has proven
2.1.2 Benign files great value in the analysis of visual imagery.
We collect 3000 benign files from different sources.

2.2 Data Preparation Techniques


This paper proposes the following two techniques to process
the data.

2.2.1 Direct Convert Assembly to Image

Figure 4. Architecture of proposed Method.

All the images are reshaped into a size of 128 X 128 pixels.
Since all the models of deep learning accept data in form of
numbers, we have used image library from PIL package of
Python to generate vectors of images and further processing are
Figure 3. Overview architecture of Preparation Dataset. done on these vectors.
We have then designed a three layers deep Constitutional
Decompiling: we used the following algorithm to de- compile
Neural Network for the detection task, which has the fol-
the exe file to binary and assembly.
lowing properties: On the Rectified Linear Units (ReLU)
Convert Assembly Code to Image: The process of con- verting
layers, we first apply a two dimensional convolutional layer
assembly to images is shown in Figure 3.
and after each layer, we applied a nonlinear later also known as
2.3 Environment Setup activation layer. In convolutional layer, we have opera- tions
Centos system with 64bit with 8 GB RAM environment is used like element-wise multiplication and summations. The ReLU
to perform tests. We used Python programming lan- guage to adds non-linearity to the system. We have used the ReLU
perform the experiments. Python packages and libraries such as instead of non-linearity function because it is faster than tanh
Tensor Flow, Docker Server, Anaconda are used which helped or sigmoid and help in vanishing gradient problem which arises
to detect the malware. The Tensor Flow Library is used for in lower layers of the network.
training the model which uses the con- volutional natural We have also used max pooling layer instead of other layers. It
network (CNN). takes a filter and a stride of the same length then applies it to
the input volume and outputs the maximum number in sub
3. IMPLEMENTATION AND region that the filter involves around. The intuition behind this
PERFORMANCE EVALUATION OF was the fact that our malware image is a gray scale and the
THE PROPOSED MODEL layers like average max pooling may not help much because

83
there are a lot of dark space in the image and they don’t
contribute much in the model.
The output that we want is a single class in which the given
malware belongs to. After applying all the layers, we have a
three-dimensional vector of arrays. To convert this vector into
a class probability, we convert these vectors into a sin- gle
layer of one dimension, known as fully connected layer. Down-
sampling all the vectors to a one-dimensional vector may lead
to loss of data. For that reason, we have used two fully
connected layers.
Figure 7. Loss of CNN model.
Cross entropy loss function that is commonly used for multi
class classification was used for this work as well as Adam 4. CONCLUSION
optimizer for optimization task. The overall architecture of the Being able to visualize the malicious code as a gray-scale
model is show in Figure 5. image has been a great achievement. Many researchers have
Initially, all the images were of different sizes and had to be been using this technique for the task of malware classifica-
converted into 128 X 128 pixels before they are used as input tion and detection. However, other works have shown that this
to the model. technique can be easily vulnerable to adversarial at-tacks and
produce erroneous results. [9], [11], [12] have shown in their
works how a small change in the image could lead to miss-
classification of images. The biggest challenge is to find an
efficient way to overcome the vulnerability of Neu- ral
Networks. This could be achieved by carefully analyzing
malware binaries.

5. REFERENCES
[1]. Vision reseach lab malimg dataset
http://old.vision.ece.ucsb.edu/spam/malimg.shtml.
[2]. Abadi, M., Agarwal, A., Barham, P., Brevdo, E.,
Figure 5. Overview architecture of CNN proposed Method. Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean,
J., Devin, M. and Ghemawat, S., 2016. Tensorflow:
Large-scale machine learning on heterogeneous
3.3 Implementation distributed systems. arXiv preprint
The following tools and techniques are required for experi- arXiv:1603.04467.
mental setup. For preparation of dataset we used method
shown in section 2, tools and algorithm which were used to [3]. Adebayo, O.S. and Aziz, N.A., 2015. Static Code
arrange the dataset for achieving better results is also written in Analysis of Permission-based Features for Android
data preparation section. For detecting malware we use Malware Classification Using Apriori Algorithm
supervised learning to train the model. CNN al- gorithm was with Particle Swarm Optimization. Journal of
used to train and test the model. Figure 5 shows 3 hidden layers, Information Assurance & Security, 10(4).
each layer has own parameters (e.g., filter − size1 = 3, numf [4]. Alme, C., Mcafee, Inc., 2012. Systems, apparatus,
ilters = 32, etc ), In this algo- rithm we used AdamOptimizer and methods for detecting malware. U.S. Patent
and the learning rate of the optimizer is le-4. The size of for all 8,312,546.
hidden layers for the
[5]. Bennasar, H., Bendahmane, A. and Essaaidi, M.,
convolutional neural network are 3*3*32, 3*3*32, 3*3*64, 2017, April. An Overview of the State-of-the-Art of
respectively. For the validation, system was trained with 20 Cloud Computing Cyber-Security. In International
epochs. Conference on Codes, Cryptology, and Information
Security (pp. 56-67). Springer, Cham.
3.4 Experiment Results
[6]. Cao, C., Liu, X., Yang, Y., Yu, Y., Wang, J., Wang,
We took two datasets in considerations which are discussed in
Z., Huang, Y., Wang, L., Huang, C., Xu, W. and
Section 2. One of two datasets consists of malicious code and
Ramanan, D., 2015. Look and think twice: Capturing
one dataset is a benign file. For the first test, we com- bined the
top-down visual attention with feedback
benign dataset with Malimg dataset and used the combined
convolutional neural networks. In Proceedings of the
dataset to obtain the accuracy in terms of mal- ware code
IEEE International Conference on Computer Vision
detection. The result obtained in the experiment shows an
(pp. 2956-2964).
accuracy of 98% for the Dataset of Vision Research Lab shown
in Figure 6 and 7. [7]. Gavriluţ, D., Cimpoeşu, M., Anton, D. and Ciortuz,
L., 2009, October. Malware detection using machine
learning. In Computer Science and Information
Technology, 2009. IMCSIT'09. International
Multiconference on (pp. 735-741). IEEE.
[8]. Gavriluţ, D., Cimpoeşu, M., Anton, D. and Ciortuz,
L., 2009, October. Malware detection using machine
learning. In Computer Science and Information
Technology, 2009. IMCSIT'09. International
Multiconference on (pp. 735-741). IEEE.

Figure 6. Accuracy of CNN model.

84
[9]. Goodfellow, I.J., Shlens, J. and Szegedy, C., 2014. [11]. Nguyen, A., Yosinski, J. and Clune, J., 2015. Deep
Explaining and harnessing adversarial examples. Neural Networks Are Easily Fooled: High Confidence
arXiv preprint arXiv:1412.6572. Predictions for Unrecognizable Images-
[10]. Nataraj, L., Yegneswaran, V., Porras, P. and Zhang, J., Nguyen_Deep_Neural_Networks_2015_CVPR.
2011, October. A comparative assessment of malware [12]. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S.,
classification using binary texture analysis and Celik, Z.B. and Swami, A., 2017, April. Practical
dynamic analysis. In Proceedings of the 4th ACM black-box attacks against machine learning. In
Workshop on Security and Artificial Intelligence (pp. Proceedings of the 2017 ACM on Asia Conference
21-30). ACM. on Computer and Communications Security (pp. 506-
519). ACM.

85

You might also like