0% found this document useful (0 votes)
26 views13 pages

2018 Minhash

Uploaded by

ifanxi1998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views13 pages

2018 Minhash

Uploaded by

ifanxi1998
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 18, NO.

1, JANUARY/FEBRUARY 2021 283

Deep Learning and Visualization for


Identifying Malware Families
Guosong Sun and Quan Qian

Abstract—The growing threat of malware is becoming more and more difficult to ignore. In this paper, a malware feature images
generation method is used to combine the static analysis of malicious code with the methods of recurrent neural networks (RNN) and
convolutional neural networks (CNN). By using an RNN, our method considers not only the original information of malware but also the
ability to associate the original code with timing characteristics; furthermore, the process reduces the dependence on category labels of
malware. Then, we use minhash to generate feature images from the fusion of the original codes and the predictive codes from the
RNN. Finally, we train a CNN to classify feature images. When we trained very few samples (the proportion of the sample size of
training dataset to validation dataset was 1:30), we obtained accuracy over 92 percent. When we adjust the proportion to 3:1, the
accuracy exceeds 99.5 percent. As shown in confusion matrices, our method obtains a good result, where the worst false positive rate
of all the malware families is 0.0147 and the average false positive rate is 0.0058.

Index Terms—Malware family identification, malware feature image, recurrent neural network, convolutional neural network

1 INTRODUCTION

M ALWARE is an abbreviated term for malicious soft-


ware. This kind of software is specifically designed
to gain access to or damage a computer without the permis-
The basic idea behind obfuscation is that either some
instructions of the original code are replaced by program
fragments that are semantically equivalent but more diffi-
sion of the owner and grows very fast. An Internet security cult to analyze, or that additional instructions are added to
threat report from Symantec [1] shows that more than the program that do not change its behavior [6]. When
430 million new unique pieces of malware were discovered dynamic analysis is used, malware designers test for debug-
in 2015, an increase of 36 percent from the year before. With gers and require some input or use other evasive techniques
the increasing influence of computer technology in daily to defeat the dynamic analysis.
life, malware (also known as malicious code) is becoming Recently, deep learning technology has achieved disrup-
more and more threatening to modern life. For example, tive results in many fields. Recurrent neural networks
WannaCry, an encrypting Ransomware worm, attacked on (RNN) [7] are the strongest neural networks and have been
Friday, 12 May 2017 and impacted 200,000 individuals in applied to language models [8], online handwriting recogni-
over 150 countries [2], [3], [4]. Desktop PCs, your smart- tion and generation [9], and speech recognition [10]. Convo-
phones, and even the networked smart gadgets around lutional neural networks (CNN) [11] are also predominant
your homes or offices are potentially vulnerable to thou- in the field of image recognition. Here, we use deep learning
sands of pieces of malware. Attackers can reap huge profits in computer security; some researchers have tried to use
and are difficult to track down. RNNs for malware detection and classification [12], [13],
At present, according to whether the program is exe- [14], [15].
cuted, malicious code analysis can be divided into static In this paper, we propose a static analysis method named
analysis and dynamic analysis [5]. Traditional static analy- RMVC, which is the combination of the initials from 4
sis is very fast and powerful, but it cannot do a very good words: RNN, minhash, visualization, and CNN. We use
job when the malicious code uses compressed or encrypted RMVC to analyze assembly language operation code and
binaries. Furthermore, some modern malware is authored complete the classification of malicious code. RNNs are
using obfuscation techniques to defeat this type of analysis. good at processing sequential information, and we use one
to handle assembly language operating code. Minhash [16]
can generate features of the same dimension. CNNs are
 G. Sun is with the School of Computer Engineering & Science, Shanghai good at processing grid information, and the visualization
University, Shanghai 200444, China. E-mail: [email protected].
 Q. Qian is with the School of Computer Engineering & Science, Shanghai of malicious code has also achieved good results. These
University, Shanghai 200444, China, and with Shanghai Institute for technologies can also be combined. Especially because of
Advanced Communication and Data Science, Shanghai University, the RNN, our method has the ability to obtain knowledge
Shanghai 200444, China, and also with Materials Genome Institute, about malicious code without category labels.
Shanghai University, Shanghai 200444, China. E-mail: [email protected].
The advantage of our approach lies in:
Manuscript received 8 Nov. 2017; revised 8 Nov. 2018; accepted 15 Nov.
2018. Date of publication 6 Dec. 2018; date of current version 15 Jan. 2021. 1) The method by which we used RNN: we did not use
(Corresponding author: Quan Qian.)
Digital Object Identifier no. 10.1109/TDSC.2018.2884928 information from the hidden layers. Neural networks

1545-5971 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
284 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2021

are black boxes, and the information in the hidden


layer is too abstract to interpret. We use predictive
opcode generated by RNN and increase the interpret-
ability of using RNN. Moreover, when we analyze
malicious code, we also combine the raw opcode
with the predictive code.
2) Less command of category labels: the main work of
this paper is multi-classification. There is not enough
data with category labels, so it is crucial to make
good use of unlabeled data. In our method, RNN can
obtain knowledge from malware without knowing Fig. 1. Structure of recurrent neural networks and bidirectional recurrent
neural networks after unfolding in time dimension.
category labels.
3) Using a new method of malicious code visualization:
the great success of CNN in the image processing damage. A sandbox executes a malware sample in a con-
motivates malicious code visualization. It is tricky to trolled environment that can monitor and record informa-
process the information into the same dimension. tion about system calls and behavior dynamically, which
We apply minhash to extract information of the are used to determine whether the program is benign or
same dimension from different sizes of malicious malicious [23], [24], [25]. However, dynamic analysis only
code and then map them into the same-dimension simulates the program a short timeframe and sometimes it
images. Malicious code is closer to text information cant trigger all the execution of the program.
than images. The later the visualization, the better In 2010, Nataraj et al. [26] proposed a method of visualiz-
the effect. It can retain more information, and can ing malware. The malware binary can be read as a vector of
also use powerful CNN. 8 bit unsigned integers and then organized into a 2D array.
4) Excellent generalization: when it comes to very tiny This can be visualized as a gray scale image. This visualiza-
training dataset test, compared to traditional meth- tion method shows the characteristics of different types of
ods, our method improves accuracy by more than 10 malicious software intuitively, and presents a new way for
percent. malicious analysis. With the observation that for many mal-
5) Traditional N-gram Markov chain and other meth- ware families, the images belonging to the same family
ods for sequence analysis have been developed for a appear very similar in layout and texture. Motivated by this
long time [17], [18]. We hope to solve this problem in visual similarity, a classification method using standard
a novel way of visualization and modularize the pro- image features is proposed and several visualization techni-
cess of visual analysis of malware for different visu- ques [27], [28], [29], [30], [31], [32], [33], [34] have been pro-
alization methods. posed for malware analysis.
The organization of the paper is as follows. Malware-
related studies are described in Section 2. In Section 3, 2.2 RNN and CNN
RMVC algorithm is proposed and discussed in more detail. Recurrent neural networks [7] are useful for processing
The experimental results are presented in Sections 4, and 5 sequential data. Different from traditional neural networks,
summarizes the whole paper. RNNs, as shown in Fig. 1a, have a special loop structure,
which retains the information of previous input. RNN is
much deeper than it looks. Recurrent neural networks not
2 RELATED WORK only deliver the output of the current hidden layer to the
2.1 Malware Analysis next layer but also exert the output as input to the current
Malware analysis methods can be divided into two types: hidden layer at the next point. The RNN can be expanded
static analysis and dynamic analysis. Static analysis exam- in time dimension, and the actual number of layers is far
ines the code by disassembling or decompiling the malware greater than that of the traditional feed-forward neural
binary file without executing it. Since Schultz et al. [19] networks. However, sometimes the prediction not only
introduced the concept of data mining for detecting mal- depends on the information of previous input but also
wares, researchers have done a lot of work in static analysis, depends on the whole input sequence. To overcome the lim-
such as Naive Bayes applied in n-grams input [17], function itations of a regular RNN, Schuster and Paliwal [35] propose
length frequency to classify Trojans [20], and framework a bidirectional recurrent neural network (BRNN) that can be
for automated malware classification based on structural trained using all available input information in the past and
information (function call graph) of malware [21]. Although future of a specific time frame (Fig. 1b). As the name sug-
all execution of the program will be reflected in the codes, gests, bidirectional RNN combine an RNN that moves for-
in most of the cases, static analysis is not a trivial task since ward through time beginning from the start of the sequence
attackers use code obfuscations, such as binary packers, with another RNN that moves backward through time
encryption or self-modifying techniques to evade static beginning from the end of the sequence. BRNN can result in
analysis. Dynamic analysis will not be affected by the obfus- better performance.
cations as it analyses the behavior of the malware during Backpropagation through time (BPTT) is often used to
execution in a sandbox like TTAnalyzer [22] and CWSand- train traditional neural networks to learn long-term depen-
box [23]. A sandbox consists of a virtual machine and a sim- dencies in recurrent networks. To avoid the exploding [36]
ulated environment protecting the operating system from and vanishing [37] gradient problems, previous work has

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
SUN AND QIAN: DEEP LEARNING AND VISUALIZATION FOR IDENTIFYING MALWARE FAMILIES 285

proposed RNN architecture carefully designed to remove


the long-range multiplicative characteristics of RNNs that
lead to these problems, such as long short-term memory
(LSTM) [38] and gated recurrent unit (GRU) [39].
Convolutional neural networks [11], also known as CNNs,
are specialized neural networks for processing data that has a
known, grid-like topology. These are deep, feed-forward arti-
ficial neural networks that have successfully been applied to Fig. 2. Overview of the RMVC method.
analyzing visual images. Basic CNNs consist of two special
layers: convolution layer and pooling layer. When processing train an RNN to extract features of process behavior from
images with millions of pixels, convolution layers can be used hidden layers. A CNN is trained to classify feature images
to detect meaningful features such as edges that occupy only which are generated by the extracted features from the
hundreds of pixels. Benefitting from CNNs improved archi- trained RNN. They validated the classifier with 5-fold cross
tecture and the powerful computing power of NVIDIA hard- validation using 150 process behavior log files. The evalua-
ware, the speed of training has been greatly improved. tion result in several image sizes by comparing the area
Compared to pure CPU computing, the use of NVIDIA Tesla under the curve (AUC) of obtained receiver operating char-
graphics cards can achieve a few tenfold speed increase [40]. acteristic (ROC) curves, with AUC of 0.96 in best case. How-
ever, this method is limited. When generating feature
2.3 Malware Analysis Using Deep Learning images, it stretches and transforms large images to process
In 2014, Yuan et al. [12] used deep learning in Android Mal- the information into the same dimension. For large images,
ware Detection. They proposed an ML-based method that limited useful information is kept in the simple way. Their
utilizes more than 200 features extracted from both static method also only considers the information extracted from
analysis and dynamic analysis of an Android app for mal- hidden layers of the RNN.
ware detection. To validate their deep learning model, they
experimented on public application sets (500 samples in
total, 300 samples for training). They compared deep learn-
3 RMVC METHOD
ing with traditional machine learning models such as SVM. In this section, we describe how RMVC performs the task of
They also compared deep learning models with different malware classification upon the prerequisite knowledge in
quantities of hidden layers, demonstrating that the deep Section 2.
learning technique was especially suitable for Android mal-
ware detection and could achieve 96 percent accuracy with 3.1 Overview
real-world Android application sets. However, they only We use static analysis methods to analyze malicious soft-
used traditional neural networks, and their only role was to ware. RMVC consists of 4 parts, as shown in Fig. 2: extract-
replace traditional classifiers. ing opcodes, training RNN, generating feature image, and
In 2015, Razvan et al. [13] used recurrent neural network training CNN. The RNN model shows a great success in
to analyze malware for the first time. They found that using natural language processing, and code language is also con-
the recurrent model to directly classify the files was not effi- text-sensitive. We will send the disassembly codes into the
cient. They proposed a different approach, which, similar to RNN. These disassembly codes are taken from unlabeled
natural language modeling, learns the language of malware data.The results of a malware disassembly are the code
spoken through the executed instructions and extracts blocks, which can be seem as local features of a malware.
robust, time-domain features. They use the recurrent model The interior of each code block is serialized with timing
to predict the next application programming interface (API) characteristics. The RNN can grasp more representative
and extract information from the hidden layer. Compared timing features and study the features of each malware
to the standard trigram of events model, this model autonomously until most of its features are mastered.
improves the true positive rate by 98.3 percent. It was the In fact, Nataraj et al. [26], in 2010, had been able to see the
first time RNN was used to analyze malware, but they differences between different families of malicious codes by
extracted information from the hidden layers and the infor- turning malicious codes into grayscale images. The limita-
mation of the hidden layer is too abstract to interpret. tion of this method is that the lengths of the malicious code
In 2015, Eui et al. [41] applied RNNs to the identify the are different, and the sizes of the generated feature image
function of files obtained from disassembly. They trained an are also different, so it cannot be applied to CNNs directly.
RNN to take bytes of the binary as input and predicted, for Meanwhile, using a uniform size can cause two problems:
each location, whether a function boundary was present at
that location. They found that RNNs can learn much more 1) The size of the image is so small that some feature
efficiently than ByteWeight [42], which reported using 587 images of malicious codes lose key information, or it
compute-hours; they could train on the same dataset in 80 cannot provide enough valuable information.
compute-hours, while achieving similar or better accuracy. 2) If the image size is very large, the training time of the
In 2016, Shun Tobiyama et al. [14] propose a malware CNN will be very long. Furthermore, it is not realis-
process detection method based on process behavior in pos- tic to set the image large enough, since the length of
sible infected terminals. In their proposal, they investigated malicious code is theoretically infinite.
stepwise application of Deep Neural Networks to classify In order to improve the classification effect, more research-
malware process. RNN is used for feature extraction. They ers have been applying neural networks to malicious code

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
286 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2021

Fig. 3. Illustration of (a) LSTM and (b) GRU. (a) i, f and o are the input,
forget and output gates, respectively. c and c~ denote the memory cell
and the new memory cell content. (b) r and z are the reset and update
gates, and h and h~ are the activation and the candidate activation [44]. Fig. 4. Use sliding windows to train RNN and generate predictive codes.
Every time, we use the opcodes in rectangle but not in circle to predict
the opcode in circle. Finally, we get a predictive code sequence.
analysis. Yuan et al. [12] proposed an ML-based method and
were the first to use neural networks; however, they only We convert opcodes into 1-hot vectors. For 256 types of
used traditional neural networks, and the only role of the neu- opcode, the encoding length is 256. 1-hot vector increases
ral network was to replace traditional classifiers. Razvan et al. the differences between pieces of opcode in the RNN. Each
[13] and Shun Tobiyama et al. [14] use the more complex 1-hot encoding contains only one 1, and the other positions
RNN, but they extracted information from the hidden layers in the vector are 0. The integer number of the opcode deter-
that may be too abstract to interpret. Furthermore, the method mines the location of the 1 in the 1-hot vector. For example,
by which they used the RNN may lose some useful informa- the 1-hot code of operation code 56 is as following:
tion. When generating feature images, their method [14]
does the equivalent of stretching and transforming large ½0; 0;    ; 0 1 0; 0;    ; 0: (1)
|fflfflfflfflfflffl{zfflfflfflfflfflffl} |fflfflfflfflfflffl{zfflfflfflfflfflffl}
images to process the information into the same dimension. 55 200
For large images, limited useful information is kept in the
simple way. We construct a bidirectional RNN with one input layer,
For these reasons, we use locality-sensitive hashing, which three hidden layers (each with 386 GRUs) and one softmax
can solve the problem of the sizes of feature images not being output layer.
the same. It can also extract locality-sensitive information Considering the problem of gradient extinction and gra-
from opcodes and make the same local information appear dient explosion in RNN training, we do not use LSTM
the same as visualization information in feature images. Com- (Fig. 3a) but GRU (Fig. 3b). Compared to LSTM contains
pared to the method containing single information, adding fewer training parameters, and each iteration requires less
RNN to process information can improve the classification time and space resources. Using either LSTM or GRU is bet-
effect. Finally, we send the feature images into CNN for train- ter than using the traditional tanh unit. However, it is diffi-
ing and classification. CNN can discover the local features of cult to definitively conclude which is better [44]. We discuss
the code block. Through the trained shallow convolution the structure of RNN in the experimental part. After weigh-
kernel features, it is possible to analyze the local features of ing the computing speed and accuracy, we adopted GRU.
different malwares and infer which malicious features reflect When training BRNN as shown in Fig. 4, we set a sliding
the maliciousness of the code. Using CNN to extract images window with a length of K. The hyperparameter K deter-
local features is, so far, a good choice for feature images gener- mines the learning effect of BRNN. K can be neither too
ated by different visualization methods. small nor too big. If K is too small, the information con-
tained in the feature may not be enough to make the correct
3.2 Extracting opcodes prediction; on the other hand, K being too large will
Using disassembler, we can obtain disassembly codes. For increase pressure to learn long-term dependencies in
simplicity, we only consider opcodes. There are a lot of opc- BRNN, which makes training more difficult and less accu-
odes, so we only consider 255 types that are used fre- rate. Unlike traditional RNNs, BRNNs are based on the idea
quently, and the rest are classified as the 256th. The that the prediction depends on the information of not only
experimental data is obtained from the 2015 Kaggle Micro- previous input but also the whole input sequence. For
soft Malware Classification Challenge: Classify malware example, to predict a missing word in a sequence, you
into families based on file content and characteristics [43]. should look at both the left and the right contexts. The RNN
For instance, among the data we have collected, there are predicts the Mth opcode in each sliding window. In other
735 types of opcode, 255 types of which often occurred words, each time, we use the former(M  1) opcodes in the
account for 99.98 percent of the total, and the remaining 480 sliding window and the latter(K  M) opcodes in the win-
types account for only 0.02 percent. dow to predict the Mth opcode in the window. The parame-
ter M determines the size of the information before and
3.3 Training RNN after the impact of the prediction. If the prediction depends
After the malicious code is processed, it only contains 256 on less previous information, we can set a smaller M, on the
kinds of opcodes. We cannot directly put the opcodes into contrary, we can set a larger M if the prediction requires
neural network. We use 0-255 integer represents each less future information. In summary, the input and output
opcode, which can have one of 256 (¼ 28 ) possible values. specification for RNN is as as Eqs. (2) and (3).

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
SUN AND QIAN: DEEP LEARNING AND VISUALIZATION FOR IDENTIFYING MALWARE FAMILIES 287

gradient direction is the direction in which the loss function


has the greatest change. In order to make the loss function
fall as quickly as possible, the prediction result of a sliding
window will be the most common opcode of similar win-
dows. Therefore, even if a few parts of the original mali-
cious code sequence are different, RNN will make the
malicious codes in the same family more similar by generat-
ing a predictive sequence.
Combining the original sequence with the RNN predicted
information contains the idea of information fusion. The origi-
nal sequence reflects the unique characteristics of each mali-
cious code, and the RNN predictive sequence reflects the
Fig. 5. Using sliding windows to train RNN and generate predictive common features within the malware family. The combina-
codes. The prediction result of a sliding window will be the most common
opcode of similar windows.
tion produces a more accurate and comprehensive judgment
than a single source of information. Since using RNN, it can
classify malwares without being given category labels. Using
I ¼ ½ OP1 OP2    OPm1 OPmþ1    OPK  RNN can improve the classification results in all experiments,
2 3
op1;1 op2;1    opm1;1 opmþ1;1    opK;1 as shown in the fouth part of our paper.
6 op1;2 op    op opmþ2;2    opK;2 7
6 2;2 m1;2 7
¼66 .. .. .. .. .. .. .. 7
7 3.4 Minhash
4 . . . . . . . 5 Minhash, proposed by Andrei Broder [16], is a locality-
op1;256 op2;256    opm1;256 opmþ1;256    opK;256 sensitive hash that can be used to quickly estimate the simi-
(2) larity of two sets. Initially, it was used to detect duplicate
2 3 pages in search engines [45]. It can also be applied to large-
opm;1
6 opm;2 7 scale clustering problems [16].
6 7
O ¼ OPm ¼ 6 .. 7: (3) Minhash uses the idea of the Jaccard index to compute sim-
4 . 5 ilarity between two sets. For set A and set B, the Jaccard index
opm;256 is the ratio between the cardinality of their intersection and
the cardinality of their union. The similarity between A and B
The advantage of RNN lies in its ability to process is as Eq. (4)
sequence information. We use BRNN to extract information
from a sliding window of the same size in training and pre- A\B
JðA; BÞ ¼ : (4)
dict the Mth opcode in the sliding window, as shown in A[B
Fig. 4. After a sliding window is processed, the window of We define hðxÞ as a hash function that maps x to an integer
the malicious code goes down by one opcode distance and and hmin ðSÞ is the smallest hash value after all the elements
continue to predict the Mth opcode in next sliding window. in the set S are mapped to integer by hðxÞ. If hðxÞ is a good
There is only one difference between adjacent sliding win- hash function, it will map different elements into different
dows. RNN predicts each window. For millions of sliding integers. For different set S1 and set S2 , the condition of
windows, more information is needed if accurate predic- hmin ðS1 Þ ¼ hmin ðS2 Þ is that the minmum element in S1 [ S2 is
tions are to be made. RNN will learn common information also in S1 \ S2 .
faster and unique information specific to each piece of mali- Minhash can use multiple hash functions to repeat the
cious code more slowly. Paying too much attention to above operations. For example, we can select k hash functions:
unique information will lead to overfitting, and it will not
help find common features of malicious codes. Therefore, it h1 ðxÞ; h2 ðxÞ; :::; hk1 ; hk ðxÞ (5)
is not necessary to pursue excessive accuracy at this stage.
Our idea is first based on the fact that malicious code in and then use hash functions to perform hmin ðSÞ operation
the each malicious code family has similar characteristics, on set S1 and set S2 respectively:
which cannot be found in other families. The results of a
malware disassembly are code blocks, which can be thought MINS1 ¼ f hmin
1 ðS1 Þ; h2 ðS1 Þ; ::: ; hk1 ðS1 Þ; hk ðS1 Þ g
min min min

as local features. The interior of each code block is serialized (6)


with timing characteristics. Using RNN helps to provide
more representative timing features. MINS2 ¼ f hmin
1 ðS2 Þ; h2 ðS2 Þ; ::: ; hk1 ðS2 Þ; hk ðS2 Þ g
min min min

As shown in Fig. 5, similar characteristics in the mali- (7)


cious code family are learned by using RNN. The RNN pre-
dicts opcodes in circle by using opcodes in sliding windows And then, we get set MINS1 and set MINS2 . The similarity
(without the opcodes in the circles). Neural networks have between set S1 and set S2 is:
strong generalization capabilities. Even if the sequence of MINS1 \ MINS2
opcodes within the sliding window are slightly different, JðS1 ; S2 Þ ¼ JðMINS1 ; MINS2 Þ ¼ : (8)
MINS1 [ MINS2
RNN will also learn similar sequence features very quickly.
In the backpropagation phase, the parameters update in the In our method, as shown in Algorithm 1, every malicious
opposite direction of the gradient of the loss function. The code sample samplei can form a set Si . We use the 3

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
288 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2021

Fig. 8. Convert original code features and predictive code features to


fusion features.

where mod is a very large number, for example 4294967311.


coei;1 and coei;2 are coefficients generated by random number
generators. Since we use random number generators to gener-
ate the coefficients(coei;1 and coei;2 ) of hash functions, we must
make sure that the random seeds are the same every time.
Fig. 6. Map hash values to feature image. Finally, we can use the results to evaluate the similarity
among the samples and convert the result into feature images.
consecutive opcodes in every malicious code sample to
make up an element and form a set containing such ele- 3.5 Generating Feature Images
ments. If only a single opcode is used for minhash, the result The visualization method for malicious code has achieved
is too simple, and minhash cannot reflect the characteristics good results in the field of static analysis. At this stage, the
of the particular malware code because the differences first problem is how to generate a consistent feature image
among different malware families are too small; if a long under the condition that the effective information is not
opcode sequence is used, the features extracted from differ- lost. For malicious code from the same family, effective
ent malwares will be overfitted on each sample and the uni- information in the code is quite similar. Therefore, we use
versality of the malware family cannot be seen. A 3-gram minhash to handle each piece of malicious code. As men-
has only 256  256  256 different sequences at most, and tioned above, minhash is a kind of locality-sensitive hash
this length is safe for malicious code. that hashes input items so that similar items map to the
same result with high probability. Moreover, the length of
Algorithm 1. Mapping opcodes to set the information obtained by the same minhash is consistent.
When we use the minhash , we can get n hash values
Input: MCode is an list consists of the sequence of the opcode in
from each piece of malicious code. We set a very large prime
a malware sample;
Output: Set St
number as an initial value for the minhash to best ensure
1: St is an empty set; that the n hash values are all greater than 2562 . Then, each
2: CRC32ðxÞ is a function to get 32-bit cyclic redundancy code hash value is divided modulus 256 three times, and each
of x; remainder is assigned to x, y, and z, respectively. We set the
3: i ¼ 0; size of the feature image to 128  128. Although the range of
4: while i  ðMCodelen  3Þ do x and y is ½0; 255, we did not set the size of the picture to
5: element ¼ MCode½i þ MCode½i þ 1 þ MCode½i þ 2; 256  256. If the size is too large, the effective information in
6: crc ¼ CRC32ðelementÞ & 0xFFFFFFFF ; the image will be too sparse and create some difficulties in
7: add crc to St ; CNN training process. Hence, we reduced the length and
8: end while width of the picture to half of its original size. The meaning
of this hash value in the feature image is that the pixel gray-
We applied Minhash, as shown in Algorithm 2, to every value at the point ðx=2; y=2Þ in the image is z. Just as shown
set Si and obtained the result MINSi . The format of hash in Figs. 6 and 7, we can map the n hash values into a feature
functions are as Eq. (9) image. Even if we do not use BRNN predictive information,
feature images from the same family are quite similar.
hi ðxÞ ¼ ðcoei;1  x þ coei;2 Þ%mod; (9) Apart from the original code, BRNN will generate a new
predictive sequence based on the malicious code. We then
apply minhash to the predictive sequence. Each predictive
sequence is also able to generate n hash values. Finally, we
map the hash values to the feature image using the pro-
posed method.
As shown in Fig. 8, we fuse the information of the origi-
nal sequence and predictive sequence to make more accu-
rate predictions. The prediction information of RNN will
eliminate some noise in the original data and add informa-
tion that helps classification. At the same time, it will lose
some useful information. This problem also existed in previ-
ous research that used RNN. It is for this reason that we
Fig. 7. Feature images without BRNN predictive information, (a) and (b) map both the original code sequence and the predictive
belong to the same family. sequence into the same feature image. So even if some

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
SUN AND QIAN: DEEP LEARNING AND VISUALIZATION FOR IDENTIFYING MALWARE FAMILIES 289

the attacker modifies some code, the RNN predictive


sequence will also bring similar information and provide new
features for the classification of the malicious code. Each fea-
ture image has 2n pixels that contains information at most,
including n pixels generated by raw opcodes and n pixels
generated by predictive sequence using BRNN.

3.6 Training CNN


Each malware can generate a feature image. The feature
images from the same malicious family are quite similar,
and different malicious families contain only a few features
Fig. 9. Feature images with BRNN predictive information, (a) and (b) in common.
belong to the same family. In the design of the CNN structure, we also used more
complex structures such as VGGnet. However, in the course
useful information is lost in the RNN predictive sequence of the experiment, we found that the network rarely con-
and pixels in some locations are missing, the original verged (did not converge on our feature images). Since our
sequence can still generate useful feature information at the feature images are scatter plots, we have been able to extract
pixels. The probability that those faulty feature points hap- enough information using only one convolutional layer
pen to map to most other malicious code families is also between the two pooling operations. In this way, we can
extremely low, so it wont cause misclassification. reduce the difficulty of training and reduce the scale quickly.
Fig. 9 is an example by using BRNN. A feature image Two or more continuous convolution layers would also cause
with BRNN prediction contains 2n pixels at most, including increased difficulty of the convergence of the feature map.
not only the visual information of the raw malicious code, We design a CNN with 5 convolutional layers, just
but also the predictive sequence extracted by BRNN. as shown in Fig. 10. The CNN structure contains one
input layer, four convolutional pooling layers, one fully-
connected layer, and one softmax output layer. Tanh is the
Algorithm 2. Minhash Mapping Opcode set to Signatures activation function. Each pooling operation receives the
Input: S and k. S is a set and k is the number of hash functions. last output of the previous convolutional layer, and set
Output: signature. pooling size ¼ 2  2. In addition, when the number of mal-
1: signature, coe list1, coe list2 are empty lists ware features is too large, CNN can significantly reduce the
2: seed is a constant value as a random seed; number of parameters and reduce the training difficulty by
3: Initialize a random integer generator rand intðÞ with seed; network sparse connectivity and weights sharing.
4: mod is a very large number; Resulting from the mapping of feature points, the conver-
5: i ¼ 0; gence is very fast. Once the features in different feature
6: while ði < kÞ do
images are the same, the mapping locations must be the same.
7: while Index ¼ rand intðÞ and Index 2 = coe list1 do
There will be no displacement information like the features in
8: add Index to coe list1;
the image classification problems, and the number of feature
9: end while
points in the image is limited. Therefore, the classification is
10: while Index ¼ rand intðÞ and Index 2 = coe list2 do
11: add Index to coe list2; relatively easy and the convergence is very fast.
12: end while
13: end while
4 EXPERIMENTS AND RESULTS ANALYSIS
14: i ¼ 0; 4.1 Experimental Dataset
15: while ði < kÞ do The experimental data are from the 2015 Kaggle Micro-
16: minhashcode ¼ mod; soft Malware Classification Challenge, and the goal is to
17: j ¼ 0 classify malware into families based on file content and
18: while ðj < Slen Þ do characteristics [43].
19: hashcode ¼ ðcoe list1½i  S½j þ coe list2½iÞ%mod; We have selected six kinds of malicious code, a total of 3557
20: if ðhashcode < minhashcode) samples as shown in Table 1. We randomly divide the sam-
21: minhashcode ¼ hashcode; ples into 3 parts: the dataset 1 contains 1470 samples, and the
22: end if bidirectional RNN (BRNN) will only be trained on this data-
23: end while set. Dataset2 has 2663 samples, inclusive of all the samples in
24: add minhashcode to signature;
dataset1, where the BRNN has not seen the rest of the samples
25: end while
in this set. Dataset3 contains 894 samples, which neither have
been studied by BRNN nor are included in dataset2. The pro-
After adding information extracted from BRNN to fea- portion of each malicious family in each dataset is not the
ture image, the generalization ability of the prediction is same. If the proportion is the same, families with more sam-
improved. BRNN is trained by malicious code without cate- ples will have a greater probability of being correctly classi-
gory labels and learned a lot of similar information. When fied. That is to say, the correct classification results may not be
generating a predictive sequence using BRNN, it can associ- caused by the quality of the method.
ate the closest information learned from the training infor- In addition, we extracted 49 samples from the dataset1 to
mation with the current malicious code. Therefore, even if make up dataset(49). These samples account for only a small

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
290 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2021

Fig. 10. Structure of the designed CNN for feature images training.

fraction of the total samples so as to evaluate the generali- 4.3 Training RNN
zation ability of the proposed method in very few sam- After trying different network structures, we found that
ples. These samples contain all sorts of malicious families LSTM and GRU have different effects on different layers. As
in the test, some families have more samples (10 sam- shown in Table 2, it is best to use two-layer LSTM structures
ples), and some families have only 3 samples. The split and three-layer GRU structures. The GRU structure is faster
can raise the difficulty of classification. The detailed dis- to compute, so we designed a 3-layer GRU structure RNN.
tribution of each malware family is as described in We also compare 1-hot and word embedding [46]. Word
Table 1. Dataset(300) and dataset(600) are also make up embedding is indeed a very effective and interesting idea
by samples in dataset1. that seems to provide more information than 1-hot. We also
In general, the purposes of dataset1 are as follows: complemented comparative test between word embedding
and 1-hot. However, the experimental results show that, for
1) The samples in dataset1 will be learned without malicious code in particular, word embedding is not supe-
labels by BRNN. rior to 1-hot. The convergence speed is slightly slower, and
2) In very few training sample test, it serves as the test- the accuracy is almost the same as for 1-hot embedding.
ing data to measure the effect on classification of Compared to natural language, the quantity of opcodes in
samples which have been learned by BRNN. malicious codes is much less obviously, and there are also
3) It is used as the training data, and then the trained clear differences in the method between using opcodes and
model is used to test dataset3. using natural language. In our opinion, the parameters
Dataset2 has only one function that is to be used as the regarding the 1-hot input layer and the first hidden layer
training set, and then the trained model is used to test can also be seen as a form of word embedding. Opcodes in
dataset3. the same window will have similar updates. These updates
Dataset3 has 2 functions: will accumulate, and opcodes with similar patterns will
1) In very few sample training tests, it serves as the test accumulate these similar updates to a considerable degree.
data to measure the effect on Classification of sam- Using the RNN sliding window, we need to consider the
ples which havent been learned by BRNN. value K and the value M. As shown in Fig. 11, when
2) It is used as testing data when dateset1 and dataset2 the size of sliding window is fixed(K is fixed), the RNN is
are trained. able to learn the characteristics of malicious families better
when M is near K=2. Therefore, the opcode to be predicted
should be set near the center of the sliding window. For the
4.2 Implementation size of the sliding window(K), a smaller K can bring faster
We implemented our models in python using keras-2.0.8 training speed; however, an appropriate increase in the
and select tensorflow-1.0.1 as backend. The hardware con- window size can fit the data better.
figuration of the experiment platform is: Intel(R) Xeon(R) At last, When training the BRNN we select dataset1 to
CPU E5-2640 v4 @ 2.40 GHz * 2, Tesla P100 16 GB * 2, 128 learn the malicious code without category labels. The
GB RAM. BRNN contains 3 hidden layers, and there are 384 GRUs in
each hidden layer. We set the size of sliding windowK ¼ 14
TABLE 1 and M ¼ 9. After all the samples in dataset1 are processed
Malware datasets for Experiment into the BRNN, we can obtain 4,276,033 sliding windows in
total. Different from traditional RNNs, BRNNs are based on
Quantities Families the idea that the prediction depends not only on the previ-
1 2 3 4 5 6
Datesets ous input but also on the entire input sequence. Intuitively,
dataset1 299 377 104 196 270 224 previous information in the sequence is more important
dataset2 600 377 194 304 728 460 than future information. Keeping the balance between the
dataset3 360 69 108 80 183 94 previous and future information as much as possible, previ-
dataset(49) 9 10 3 10 9 8 ous information is weighted slightly more than the future
dataset(300) 50 50 50 50 50 50
dataset(600) 100 100 100 100 100 100 information. Therefore, each time, we use the first 8 opcodes
and the last 5 opcodes to predict the 9th opcode in the

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
SUN AND QIAN: DEEP LEARNING AND VISUALIZATION FOR IDENTIFYING MALWARE FAMILIES 291

TABLE 2
RNN in Different Structure

epoch: accuracy epoch: loss


onehot/embedings LSTM/GRU layers speed parameters
5 10 15 20 5 10 15 20
onehot LSTM 2 63us/step 9,250,048 0.761 0.800 0.847 0.868 0.745 0.604 0.482 0.381
onehot LSTM 3 68us/step 12,792,064 0.759 0.792 0.829 0.860 0.754 0.614 0.501 0.401
onehot GRU 2 53us/step 6,986,752 0.760 0.798 0.831 0.860 0.756 0.613 0.503 0.411
onehot GRU 3 57us/step 9,643,264 0.763 0.803 0.837 0.866 0.741 0.595 0.480 0.386
embeding-64 GRU 3 42us/step 9,200,896 0.736 0.775 0.800 0.822 0.849 0.696 0.603 0.525
embeding-128 GRU 3 44us/step 9,348,352 0.748 0.785 0.811 0.836 0.799 0.659 0.563 0.481

window. During training, we increase batch size is used to We also tried to use training sets in different sizes to train
make the BRNN converge to a better result. At the begin- RNN, as shown in Fig. 13. As the size of the RNN training
ning of the training stage, we use a small batch (768) to dataset increases, the classification accuracy also increases.
jump out of the local optimum. When the accuracy con- The experiment proves our view that using RNN predictive
verges to a better result with regards to the current batch, information can indeed increase the accuracy of prediction
we increase the batch size gradually (1024, 2048, 3072, 4096) compared to not using RNN predictive information. This
to accelerate the convergence. The process is as shown in technique for unsupervised learning works. Malicious code
Fig. 12. Finally, we save the trained model. Initially, we tried in the same malware family has similar characteristics that
using 4 layer unidirectional RNN for training, but the final cannot be found in other families. A larger training dataset
accuracy was only 0.6549. After using bidirectional RNN, means it is easier to find the characteristics of the same mali-
the final accuracy increased to 0.8697, as shown in Fig. 12. cious family. Once the RNN finds common characteristics of
a family, it will make the malicious code in the same family
more similar by generating a predictive sequence. RNN pre-
dictive information can reduce the dependence on category
labels to some degree.

4.4 Feature Images


Before training CNN, we observe the feature images that
consist of BRNN predictions, as shown in Fig. 14. Each fea-
ture image consists of discrete pixels. In order to compare
different kinds of feature images, we use green pixels
to describe the same pixel information and red pixels to
depict different information, as shown in Fig. 15. This
means that the more green pixels in the image, the higher
the similarity.After visualization, not only the green pixels
matters, but also the red points can help to eliminate contra-
dictory information and finally give a clear result. The com-
parison results also validate our hypothesis that the same
kind of malwares can find a lot in common. These common-
Fig. 11. K and M of RNN sliding window. alities make us more confident that learning the features of

Fig. 12. BRNN training process. Gradually increase the size of the batch
(768, 1024, 2048, 3072, 4096) to accelerate the convergence speed. Fig. 13. Training RNN with training sets of different sizes.

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
292 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2021

Fig. 15. Similarity comparison among different malicious families, the


more green pixels, the higher similarity.

images and use GIST to compute texture features.


They used k-nearest neighbors for classification.

4.6 Small Training Dataset Test


First of all, we choose very few samples (49 samples)
from the dataset, and predict the whole dataset1
(1470 samples). When training CNN, if we use Method1,
both the convergence speed of training and validation
are very fast. As for Method2 Method3, since more infor-
mation is included, the convergence speed is slower.
Fig. 14. Feature images generated by Method3. (a) and (d) belong to
family1, (b) and (e) belong to family2, (c) and (f) belong to family3. However, as far as the accuracy is concerned, the accu-
racy of the Method1 is only 85.92 percent, the Method2 is
86.02 percent, the method 3 is 92.18 percent (Fig. 16a,
feature images by training CNN is effective to distinguish (Fig. 16b)). As for other people’s method, Method NJ is
them among different malware families. only 77.95 percent.
The improvement in accuracy is not caused by the
4.5 Verification Methods greater quantity of hash values in Method 3. The number of
During the experiment, we select feature images without hash values in Method 2 is also 2048, but the results are not
RNN prediction as well as feature images with RNN predic- improved comparing with Method 1. The results show that
tion. The size of the feature image is 128  128. We design 3 adding BRNN prediction information does lead to desired
methods to evaluate our approach: results. Although the training dataset is small, predictive
1) Method 1 (MVC): Apply minhash to each piece of mal- sequence generated by BRNN can associate the closest
ware to generate 1024 hash values, and then generate information learned from the training information accord-
feature image without RNN predictive information. ing to the current malicious code. Therefore, even if the
Then the hash values are mapped to a feature image. attacker modifies some opcodes, we can also unearth simi-
2) Method 2 (MVC): Apply minhash to each piece of mal- lar information in the stage of RNN prediction, which pro-
ware to generate 2048 hash values and then generate vides new features for malware classification, and improves
feature image without RNN predictive information. generalization ability of the learning model. Obviously,
Then the hash values are mapped to a feature image. Method NJ cant match our method.
3) Method 3 (RMVC): Apply minhash to each malware We continued using the tiny dataset (dataset(49)) to
to generate 1024 hash values, and also apply min- predict Dataset 3 (894 samples). Note that the BRNN had
hash to RNN predictive information to generate 1024 not trained any samples in Dataset3. Here, we only com-
hash values. The two types of information are map- pare Methods 1 and 3. The accuracy of Method 1 is only
ped to a feature image. 88.70 percent, and that of Method 3 is 92.06 percent. The
In addition, we compare these results with those of pre- confusion matrix of training dataset(49) to predict Dataset 3
vious studies using a different method, Method NJ. with or without the BRNN is shown in Table 3. Even for
samples that BRNN has not studied, using BRNN is still
1) Method NJ: Method proposed by Nataraj et al. [26] better. As for Method NJ, the accuracy is 81.88 percent and
wherein malware binaries are visualized as gray-scale the confusion matrix is shown in Table 3.

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
SUN AND QIAN: DEEP LEARNING AND VISUALIZATION FOR IDENTIFYING MALWARE FAMILIES 293

4.7 Large Training Dataset Test


To increase the size of the CNN training set, Dataset 1(1470
samples) was used to predict Dataset 3(894 samples). The
validation accuracy of the Method1 was 97.65 percent, the
Method3 increased to 98.65 percent. The detailed results are
shown as Table 4. Method NJ was 95.41 percent. While both
Method 1 and Method 3 have learned all the information
that BRNN has mastered without classification labels,
Method 3 is still better than Method 1. This means that the
preprocessing in the testing dataset3 by BRNN is important.
Continuing to increase the size of training sets, we used
Dataset 2 (2663 samples) to predict Dataset 3. Finally, the
accuracy of the Method 1 was 99.11 percent, and the accu-
racy of Method 3 increased to 99.55 percent (Table 5); the
worst false positive rate for all the malware families is
0.0147 and the average false positive rate is 0.0058. In com-
parison, the accuracy of Method NJ is 97.20 percent and the
average false positive rate is 0.0448. With or without an
RNN, this visualization method is very effective to classify
different kinds of malicious code. By using an RNN, one
can improve the generalization ability of the classifier.
The difference between using RNN prediction and not
using RNN prediction decreases as the size of the training
set increases. We chose several different training sets to do
more experiments. We used Dataset 1 (1470 samples) to
train the RNN. As shown in Fig. 17, when we used feature
images to do malicious family classification, if we chose the
feature images of the samples in dataset(49), dataset(300)
and dataset(600) as the training set (these training sets are
the subsets of the RNN training dataset), the RNN could
provide more knowledge compared with these training
sets. On the other hand, when we selected large training
sets like Dataset 1 (1470 samples) and Dataset 2 (2663 sam-
ples), the knowledge learned by the RNN was already pres-
ent in these datasets, so the RNN could provide limited
information and the difference was much smaller. This set
of comparative tests can illustrate two issues: 1. When there
is a large amount of unlabeled data, the effect of RNN is
very obvious. 2. Even if all the unlabeled data is labeled, the
Fig. 16. When using training dataset(49) to predict dataset1, the training
knowledge learned from the dataset is still less than the
accuracy and validation accuracy at different epoches. knowledge provided by the fusion of the original sequence
and the RNNs predictive sequence.

TABLE 3
Confusion Matrix of Training Dataset(49) to Predict Dataset3

TABLE 4
Confusion Matrix of Training Dataset1 to Predict Dataset3

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
294 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 18, NO. 1, JANUARY/FEBRUARY 2021

TABLE 5
Confusion Matrix of Training Dataset2 to Predict Dataset3

the accuracy can reach 99.5 percent. As shown in the confu-


sion matrix, RMVC obtains an almost diagonal matrix. Using
this method, the worst false positive rate of all the malware
families is 0.0147 and the average false positive rate is 0.0058.
In the future, we will continue to collect samples and
apply the method to a larger datasets; we will verify the
effectiveness of our method in dynamic analysis as well.
RNN is the most powerful artificial neural network and is
the future of artificial intelligence. We will try to integrate
the idea behind generative adversarial nets [47] into our
method to promote the ability of unsupervised learning.
Apple also used the model in its first artificial intelligence
paper [48]. We are all trying to do our best to avoid the need
for expensive annotations. Both RNN and CNN are the
main points of promotion.

ACKNOWLEDGMENTS
This work is partially sponsored by National Key Resea-
rch and Development Program of China(2018YFB0704400,
2016YFB0700504), Shanghai Municipal Science and Technol-
ogy Commission(15DZ2260301), Natural Science Foundation
of Shanghai(16ZR1411200). The authors gratefully appreciate
the anonymous reviewers for their valuable comments.

REFERENCES
[1] Symantec, “2016 internet security threat report highlights,” 2016.
[Online]. Available: https://www.symantec.com/security-center/
threat-report
[2] B. Brenner, “Wannacry: the ransomware worm that didnt arrive
on a phishing hook,” May 18, 2017. [Online]. Available: https://
nakedsecurity.sophos.com/2017/05/17/wannacry-the-
ransomware-wor m-that-didnt-arrive-on-a-phishing-hook/
[3] B. News, “Cyber-attack: Europol says it was unprecedented in
scale,” May 13, 2017. [Online]. Available: http://www.bbc.com/
Fig. 17. The performance of different methods under various scales of news/world-europe-39907965
training sets. [4] A. Liptak, “The wannacry ransomware attack has spread to 150
countries,” May 14, 2017. [Online]. Available: https://www.
theverge.com/2017/5/14/15637888/authorities-wannacry-ransom
ware-attack-spread-150-countries
5 CONCLUSION [5] E. Gandotra, D. Bansal, and S. Sofat, “Malware analysis and
classification: A survey,” J. Inf. Secur., vol. 5, no. 3, p. 56, 2014.
In this paper, we propose the RMVC method to analyze mali- [6] A. Moser, E. Kirda, and C. Kruegel, “Limits of static analysis for mal-
cious code statically using visual images by applying two ware detection,” in Proc. Annu. Comput. Secur. Appl. Conf., Dec. 2007,
deep learning techniques, RNN and CNN. RNN is applied to pp. 421–430.
associate the current data with similar information and [7] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., “Learning
representations by back-propagating errors,” Cognitive Model.,
improve the anti-interference ability in the process of analysis. vol. 5, no. 3, 1988, Art. no. 1.
CNN is applied to classify feature images. This method pro- [8] T. Mikolov, M. Karafiat, L. Burget, J. Cernockỳ, and S. Khudanpur,
vides both high accuracy and good generalization. Because of “Recurrent neural network based language model,” in Proc. 11th
the RNN, RMVC learns more malware classification without Annu. Conf. Int. Speech Commun. Assoc., Sep. 2010, pp. 1045–1048.
[9] A. Graves, “Generating sequences with recurrent neural net-
being given category labels. Using an RNN improves the works,” arXiv:1308.0850, 2013.
results in all experiments. Even if a small training dataset is [10] A. Graves, A. rahman Mohamed, and G. Hinton, “Speech
used, the accuracy of this method can still exceed 92 percent. recognition with deep recurrent neural networks,” in Proc. IEEE
Int. Conf. Acoust. Speech Signal Process., May 2013, pp. 6645–6649.
RMVC improves the accuracy of the traditional method by
more than 10 percent. If the training dataset size is increased,

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.
SUN AND QIAN: DEEP LEARNING AND VISUALIZATION FOR IDENTIFYING MALWARE FAMILIES 295

[11] Y. LeCun, “Generalization and network design strategies,” [36] Y. Bengio, P. Frasconi, and P. Simard, “The problem of learning
Connectionism Perspective, R. Pfeifer, Z. Schreter, F. Fogelman, and long-term dependencies in recurrent networks,” in Proc. IEEE Int.
L. Steels, Eds., Elsevier, 1989. Conf. Neural Netw., Mar. 1993, pp. 1183–1188.
[12] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, “Droid-sec: deep learning in [37] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen
android malware detection,” ACM SIGCOMM Comput. Commun. netzen,” Diploma, Technische Universit€at M€ unchen, M€
unchen,
Rev., vol. 44, no. 4, pp. 371–372, 2015. Germany, vol. 91, 1991.
[13] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and [38] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
A. Thomas, “Malware classification with recurrent networks,” in Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Apr 2015, [39] K. Cho, B. Van Merri€enboer, D. Bahdanau, and Y. Bengio, “On
pp. 1916–1920. the properties of neural machine translation: Encoder-decoder
[14] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, approaches,” http://arxiv.org/abs/1409.1259, 2014.
“Malware detection with deep neural network using process behav- [40] J. Huang, “Accelerating ai with gpus: A new computing model,”
ior,” in Proc. IEEE 40th Annu. Comput. Softw. Appl. Conf., Jun. 2016, Jan. 12, 2016. [Online]. Available: https://blogs.nvidia.com/
pp. 577–582. blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/
[15] W. Hu and Y. Tan, “Black-box attacks against rnn based malware [41] E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functions in
detection algorithms,” in Proc. 32nd AAAI Conf. Artif. Intell., New binaries with neural networks,” in Proc. 24th USENIX Conf. Secur.
Orleans, USA, Feb. 2018, pp. 1–10. Symp., Aug. 2015, pp. 611–626.
[16] A. Z. Broder, “On the resemblance and containment of documents,” [42] T. Bao, J. Burket, M. Woo, R. Turner, and D. Brumley,
in Proc. Int. Conf. Compression Complexity Sequences, Jun. 1997, “Byteweight: Learning to recognize functions in binary code,” in
pp. 21–29. Proc. 23rd USENIX Conf. Secur. Symp., Aug. 2014, pp. 845–860.
[17] R. Tian, L. M. Batten, and S. Versteeg, “Function length as a tool [43] Microsoft, “Microsoft malware classification challenge
for malware classification,” in Proc. 3rd Int. Conf. Malicious (BIG 2015),” [Online]. Available: https://www.kaggle.com/c/
Unwanted Softw., Oct. 2008, pp. 69–76. malware-classification, Accessed on: 2015.
[18] M. Z. Shafiq, S. A. Khayam, and M. Farooq, “Embedded malware [44] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
detection using markov n-grams,” in Proc. Int. Conf. Detection of gated recurrent neural networks on sequence modeling,” http://
Intrusions Malware Vulnerability Assessment, 2008, pp. 88–107. arxiv.org/abs/1412.3555, 2014.
[19] M. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining [45] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher,
methods for detection of new malicious executables,” in Proc. “Min-wise independent permutations,” J. Comput. Syst. Sci., vol. 60,
IEEE Symp. Secur. Privacy, May 2001, pp. 38–49. no. 3, pp. 630–659, 2000.
[20] M. F. Zolkipli and A. Jantan, “An approach for malware behavior [46] Y. Gal and Z. Ghahramani, “A theoretically grounded application of
identification and classification,” in Proc. 3rd Int. Conf. Comput. dropout in recurrent neural networks,” in Proc. 30th Int. Conf. Neural
Res. Develop., Mar. 2011, pp. 191–194. Inf. Process. Syst., Barcelona, Spain, 2016, pp. 1027–1035.
[21] D. Kong and G. Yan, “Discriminant malware distance learning on [47] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
structural information for automated malware classification,” in S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, in Proc. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
Aug. 2013, pp. 1357–1365. [48] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and
[22] U. Bayer, TTAnalyze: A Tool for Analyzing Malware, Technical R. Webb, “Learning from simulated and unsupervised images
University of Vienna, Dec. 2005. through adversarial training,” http://arxiv.org/abs/1612.07828,
[23] C. Willems, T. Holz, and F. Freiling, “Toward automated dynamic 2016.
malware analysis using cwsandbox,” IEEE Secur. Privacy, vol. 5,
no. 2, pp. 1735–1780, Mar./Apr. 2007. Guosong Sun received the bachelor’s degree in
[24] J. Z. Kolter and M. A. Maloof, “Learning to detect malicious computer science & technology from Huazhong
executables in the wild,” in 10th ACM SIGKDD Int. Conf. Knowl. University of Science and Technology, China.
Discovery Data Mining, Aug. 2004, pp. 470–478. He is working toward the master’s degree in the
[25] B. Anderson, D. Quist, J. Neil, C. Storlie, and T. Lane, “Graph-based School of Computer Engineering & Science,
malware detection using dynamic analysis,” J. Comput. Virology, Shanghai University, China. Now, his research
vol. 7, no. 4, pp. 247–258, 2011. interests include cloud computing, big data analy-
[26] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware sis, computer and network security especially in
images: Visualization and automatic classification,” in Proc. 8th Int. malware analysis.
Symp. Vis. Cyber Secur., Jul. 2011, pp. 1–7.
[27] K. Kancherla and S. Mukkamala, “Image visualization based
malware detection,” in Proc. IEEE Symp. Comput. Intell. Cyber
Secur., Apr. 2013, pp. 40–44. Quan Qian received the PhD degree in computer
[28] K. S. Han, J. H. Lim, B. Kang, and E. G. Im, “Malware analysis science from the University of Science and Tech-
using visualized images and entropy graphs,” Int. J. Inf. Secur., nology of China (USTC), in 2003 and conducted
vol. 14, no. 1, pp. 1–14, 2015. postdoc research in USTC from 2003 to 2005.
[29] A. Makandar and A. Patrot, “Malware class recognition using After that, he joined Shanghai University and now
image processing techniques,” in Proc. Int. Conf. Data Manage. he is the lab director of network and information
Analytics Innovation, Feb. 2017, pp. 76–80. security, also the director of the center of materi-
[30] S. Z. M. Shaid and M. A. Maarof, “Malware behavior image for als data and informatics. He is a full professor
malware variant identification,” in Proc. Int. Symp. Biometrics with the School of Computer Engineering &
Secur. Technol., Aug. 2014, pp. 238–243. Science, Shanghai University, China. His main
[31] T. Wang and N. Xu, “Malware variants detection based on opcode research interests concerns computer network
image recognition in small training set,” in Proc. IEEE 2nd Int. and network security, especially in cloud computing, big data analysis
Conf. Cloud Comput. Big Data Anal., Apr. 2017, pp. 328–332. and wide scale distributed network environments.
[32] M. Yang and Q. Wen, “Detecting android malware by applying
classification techniques on images patterns,” in Proc. IEEE 2nd
Int. Conf. Cloud Comput. Big Data Anal., Apr. 2017, pp. 344–347. " For more information on this or any other computing topic,
[33] J. Zhang, Z. Qin, H. Yin, L. Ou, and Y. Hu, “Irmd: Malware please visit our Digital Library at www.computer.org/csdl.
variant detection using opcode image recognition,” in Proc. IEEE
22nd Int. Conf. Parallel Distrib. Syst., Dec. 2016, pp. 1175–1180.
[34] L. Liu and B. Wang, “Malware classification using gray-scale
images and ensemble learning,” in Proc. 3rd Int. Conf. Syst.
Informat., Nov. 2016, pp. 1018–1022.
[35] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural
networks,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681,
Nov. 1997.

Authorized licensed use limited to: Shanghai Maritime University. Downloaded on September 23,2021 at 11:57:51 UTC from IEEE Xplore. Restrictions apply.

You might also like