A Malware Classification Method Based On Three-Channel Visualization and Deep Learning
A Malware Classification Method Based On Three-Channel Visualization and Deep Learning
a r t i c l e i n f o a b s t r a c t
Article history: With the rapid increase in the number of malware, the detection and classification of malware have
Received 25 August 2022 become more challenging. In recent years, many malware classification methods based on malware vi-
Revised 17 November 2022
sualization and deep learning have been proposed. However, the malware images generated by these
Accepted 27 December 2022
methods do not retain the semantic and statistical properties with a small and uniform size. This article
gives definitions of extracted content and filling mode to characterize the critical factors for the malware
Keywords: visualization task and proposes a new malware visualization method based on assembly instructions and
Malware classification Markov transfer matrices to characterize malware. Thus, a malware classification method based on three-
Malware visualization channel visualization and deep learning (MCTVD) is proposed. In MCTVD, its malware image has a small
Markov transfer matrix
and uniform size, and its convolutional neural network has few convolutional and pooling layers. Ex-
Deep learning
perimental results show that MCTVD can achieve an accuracy of 99.44% on Microsoft’s public malware
Convolutional neural network
dataset under 10-fold cross-validation and thus could be a highly competitive candidate for malware clas-
sification.
© 2022 Elsevier Ltd. All rights reserved.
1. Introduction ods still unsatisfactory. Deep learning methods can overcome these
shortcomings of machine learning. A convolutional neural network
Malware is any type of software that harms or exploits the nor- (CNN) is a deep learning approach that has proven very effective
mal operation of a system. In recent years, with the rapid devel- in tackling problems such as image recognition and classification
opment of the internet and computer technologies, the number (Basha et al., 2020). Correspondingly, deep learning was also intro-
of malware in the past decade has increased year by year. As re- duced in malware detection and classification (Kargarnovin et al.,
ported by AV-TEST (AV-TEST), the total number of malware cases 2022; Li et al., 2022; Wang et al., 2019b; Yadav and Tokekar, 2021),
was 1218.68 million as of 2022. Malware classification is a neces- and some malware classification approaches based on image visu-
sary task for malware analysis. It distinguishes different malware alization and deep learning were proposed in recent years. This
families to better understand the capabilities of malware variants type of method turns malware classification into an image clas-
from the same family and thus can reduce the work of security sification problem. Its feasibility lies in the fact that when differ-
analysts and facilitate their research on new malware or malware ent malware from the same family is converted into images, they
variants (Gibert et al., 2020). Unfortunately, classifying malware ef- appear to be similar in texture and layout (Verma et al., 2020).
ficiently and accurately is a challenging task. The generated malware image is the core of this type of malware
To improve the efficiency of classifying malware, traditional classification method. However, how to generate a high-quality im-
machine learning methods, such as decision tree (DT), naive age from a malware file is an issue that has not been deeply dis-
Bayes (NB), support vector machine (SVM), and k-nearest neigh- cussed. The most widely used malware visualization method uses
bor (KNN), have been widely used in a variety of malware clas- malware binaries directly as input, converting every 8-bit binary
sification methods (Pachhala et al., 2021). However, such meth- to one pixel to generate a grayscale image. This requires compres-
ods have the limitations of complex feature engineering and dif- sion or interception to keep the image size uniform when training
ficulty in processing large amounts of data. This makes such meth- with CNNs. There is undoubtedly a loss of effective information in
the original binary file during the conversion. A few malware vi-
sualization methods use opcode n-grams extracted from Windows
∗ Portable Executable (PE) files as pixels in generated malware im-
Corresponding author.
E-mail address: [email protected] (C. Guo). ages. Such methods generally use only the frequency of the unique
https://doi.org/10.1016/j.cose.2022.103084
0167-4048/© 2022 Elsevier Ltd. All rights reserved.
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
combination of n consecutive opcodes to visualize malware, ignor- time-consuming (Gibert et al., 2020). These methods can be classi-
ing the role of the information about the quality of opcodes and fied into two categories according to the different kinds of features
operands in assembly instruction. Therefore, this article aims to used: static and dynamic.
explore the quality of malware images in terms of “extracted con-
tent” and “filling mode” defined in Section 3 and proposes a mal- 2.1.1. Static feature-based methods
ware visualization method to generate a high-quality malware im- Static features are those that can be obtained without running
age. the malware. These are features such as byte sequences (Yousefi-
To achieve this goal, we analyze the critical factors in gener- Azar et al., 2018), opcode sequences (Yeboah et al., 2021; Zhang
ating an image from malware and use the information extracted et al., 2019), API calls (Soni et al., 2022), and function call graphs
from the assembly instructions (containing opcodes and operands) (FCGs) (Hassen and Chan, 2017). Shalaginov et al. (2018) conducted
of a PE file to generate a three-channel image. A malware an in-depth survey of different machine-learning methods for the
classification method based on three-channel visualization and classification of static characteristics of Windows PE files. A frame-
deep learning (MCTVD) is then designed. MCTVD extracts the se- work to detect malicious applications and to categorize benign ap-
quence of assembly instructions from the code section (also known plications with an ensemble of multiple classifiers, namely, SVM,
as “.text” section) of malware and uses the transfer probabilities KNN, NB, classification and regression tree (CART), and random for-
of the unique combination of every 2 consecutive letters or num- est (RF) was proposed in (Wang et al., 2018). This framework ex-
bers from the sequence of assembly instructions to construct the tracts as many as 2,374,340 static features that fall into 11 types
three-channel image. This image contains richer information about (Restricted API calls, Suspicious API calls, and so on) from each APK
assembly instructions than the grayscale image and the opcode n- file and chooses the top-ranked 34,630 static features for detec-
gram image, and is beneficial to improving the accuracy of mal- tion and categorization. Zhang et al. (2019) proposed an opcode
ware classification. In addition, it does not require compression or sequence-based ransomware classification method. This method
interception of the sizes of the generated images. The main contri- first converts the opcode sequences from ransomware samples into
butions of this article are as follows: n-gram sequences, and then a vector consisting of term frequency
1) A three-channel malware visualization method based on as- values of the n-gram feature is used as the feature vector. Finally,
sembly instructions and Markov transfer matrices is proposed. The five machine learning methods are used to perform ransomware
extracted content and filling mode are defined to characterize the classification. Soni et al. (2022) proposed a malware classification
critical factors for the malware visualization task. Subsequently, a method using the features extracted from API calls and opcode se-
new malware visualization method is proposed. The image gener- quences. After extracting the features, four machine learning algo-
ated by this method focuses on retaining the information about as- rithms, NB, logistic regression, RF, and SVM, are used to classify
sembly instructions in the code section of malware with a reduced malware. Hassen and Chan (2017) proposed an FCG vector repre-
and equal size, which is helpful to improve the accuracy and effi- sentation based on function clustering that has significant perfor-
ciency of malware classification. mance gains which is then used for malware classification.
2) A CNN is designed to effectively classify the three-channel
images generated by our malware visualization method. Com- 2.1.2. Dynamic feature-based methods
pared with common CNNs, such as AlexNet (Krizhevsky et al., These features are obtained by dynamic analysis methods. Dy-
2012), VGG16 (Simonyan and Zisserman, 2014), and VGG19 namic analysis observes the interaction between malware and the
(Simonyan and Zisserman, 2014), our presented architecture has system by executing the executable file of the malware in a con-
fewer convolutional and fully connected layers, which is conducive trolled environment. Registry changes, memory writes, and API
to less time consumption during training. call traces are commonly used as dynamic features of malware.
3) A malware classification method called MCTVD combined Amer and Zelinka (2020) used various API functions with simi-
the three-channel images with our presented CNN is proposed. Ex- lar contextual characteristics as a cluster by studying the contex-
periments on a public dataset from Microsoft Corporation show tual relationships that exist between API functions in malware.
that MCTVD is superior to the traditional grayscale image-based, This article proved that there is indeed a clear difference be-
byte-level Markov-based, and RGB color image-based methods in tween the API call sequence of malware and benign software.
terms of accuracy and macro F1-score. San et al. (2019) proposed a malware family classification system
The rest of this article is organized as follows. Section 2 gives by extracting the prominent API features of 11 malware families
a brief introduction to the current malware classification methods. from a cuckoo sandbox. Xiao et al., 2020 proposed a graph reparti-
Section 3 describes the motivations of our proposed method. The tion algorithm to extract fragment behaviors from original API call
proposed MCTVD is detailed in Section 4. In Section 5, experimen- graphs and then obtained the crucial N-order subgraph for mal-
tal results regarding our method are presented and compared with ware detection and classification. An association rule-based mal-
other works. Finally, Section 6 summarizes the work of the article. ware classification using common subsequences of API calls was
proposed in (D’Angelo et al., 2021). This method exploits the prob-
abilities of transitioning from two API invocations in the call se-
2. Related work
quence.
Over the last 20 years, an increasing number of researchers
2.2. Malware classification methods based on malware images and
have proposed many malware classification methods based on ma-
deep learning
chine learning technologies. They can be roughly divided into two
types: methods based on traditional machine learning and meth-
The image-based malware classification method was first intro-
ods based on malware images and deep learning.
duced by Nataraj et al. (2011), who used the binary content of
malware to generate a grayscale image and applied GIST and KNN
2.1. Malware classification methods based on traditional machine to extract texture features to classify malware. A few years later,
learning image-based malware classification methods using machine learn-
ing were also proposed (Ghouti and Imam, 2020), which have the
These methods rely on handcrafted features based on expert limitation of needing a complex feature engineering process. With
knowledge, and their feature engineering process are generally the rapid development of deep learning technology in recent years
2
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
and its excellent performance in the image classification field, mal- Yuan et al. (2022) converted a malware binary file into a multidi-
ware classification methods based on malware images and deep mensional Markov image. This image is a combination of several
learning have become a research hotspot in the malware classifi- byte transfer probability matrices and contains richer information
cation field in recent years. This type of approach can eliminate about byte distribution of a malware binary file than the Markov
many feature engineering works and obtain good classification ac- image proposed by Yuan et al. (2020).
curacy (Yuan et al., 2020). Different from existing work, in this article, the sequence of as-
Within the existing malware classification methods based on sembly instructions and the transfer probabilities of the unique
malware images and deep learning, the grayscale image converted combination of every 2 consecutive letters or numbers from the
directly from the binary sequence of malware is used by a major- sequence of assembly instructions are used as extracted content
ity of methods (Cui et al., 2018; Pinhero et al., 2021; Yan et al., and filling mode, respectively, to generate a three-channel image
2018; Zhao et al., 2020). For instance, Cui et al. (2018) converted from malware. In this way, the byte distribution of assembly in-
the binary sequence of malware into grayscale images. Then, these structions, the dependencies of an opcode on its previous opcode,
images were classified by using a CNN. Since only images with uni- and the quality of each opcode in malware are used to characterize
form sizes can be directly applied to CNNs, this type of method malware. The image thus can provide richer information about as-
requires compression or interception of the grayscale images when sembly instructions of malware than binary and opcode sequences.
they are trained with CNNs. Although Yuan et al. (2020) also used
binary sequences of malware, they converted the binary sequence 3. Motivation
into a Markov image with a fixed size of 256∗ 256 through the byte
transfer probability matrix. Then, a deep CNN was used for train- Since excellent performance can be obtained via malware clas-
ing a model to classify malware. Their Markov image contains the sification methods based on malware images and deep learning,
binary information and global structure of malware while ignor- this article focuses on designing a method belonging to this type
ing specific semantic information. The literature has shown that of method. One key problem of a malware classification method
opcode sequences can represent program behaviors (Jian et al., based on malware images and deep learning is how to convert
2021). Correspondingly, opcode sequences have also been used in malware into an image. This is because a deep learning model re-
some malware classification methods based on malware images quires a data representation that is convenient for this model to
and deep learning to generate images (Ni et al., 2018; Zhang et al., effectively extract key features from malware images. As shown in
2016). Zhang et al. (2016) collected opcode sequences from bi- Fig. 1, for generating an image from malware, there are two crit-
nary files in the dataset and used them to construct images. How- ical factors: what content is extracted from malware and how to
ever, the generated images are generally very sparse because the fill the content.
opcodes contained in a single sample are limited. In addition to Definition 1 (Extracted Content): Extracted content is the sub-
grayscale images, some other forms of malware images have also stance that is extracted from a malware file to prepare for filling
been proposed recently for malware classification. For instance, an image.
Wang et al. (2019a) converted the byte sequence into an RGB color Definition 2 (Filling Mode): The Filling mode is the method of
image. This conversion maps every 8-bit binary to an integer value filling the extracted content into an image.
of RGB in sequence. An RGB image named “CoLab image” was pro- Extracted content determines the collection of available infor-
posed by Xiao et al. (2021). The CoLab image uses colored label mation that can be used to fill an image. A binary sequence,
boxes to mark the sections of malware. A malware classification opcode sequence, and sequence of assembly instructions are in-
method based on CoLab image, VGG16, and SVM was constructed. stances of extracted content. The filling mode gives the specific
Fig. 1. Extracted content and filling mode for generating a malware image.
3
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Fig. 2. Comparison of the assembly instructions in code blocks of two malware samples from the same family.
pixel values and the size of the generated image and thus deter- section stores program codes of a PE file, while the “.data” sec-
mines what and how much information or property in the ex- tion stores the data variables of the program. Program code is the
tracted content can be provided by the generated image. It can core of the program, and different malware from the same fam-
be divided into two classes: nonuniform sized and uniform-sized. ily generally have similar program codes. To preserve more infor-
When the sizes of the generated image after filling are different mation about the assembly instructions of malware, the extracted
and thus cannot be directly applied to a CNN, this filling mode content used in this article is the sequence of assembly instruc-
belongs to the class of nonuniform size; otherwise, it belongs to tions in the code section of a malware file. An assembly instruc-
the class of uniform size. The 8 binary numbers and the stable tion includes an opcode and one or more operands. In a PE file,
value used in (Fu et al., 2018) are instances of the filling mode the code section is a block whose main content consists of as-
of nonuniform size, and the Markov transition probability and the sembly instructions. Therefore, we focuse on the code section of
Simhash value are examples of the filling mode of uniform size. the PE file rather than on the whole PE file and discard the other
Therefore, the extracted content and filling mode undoubtedly in- sections, such as the “.data” section. Fig. 2 gives the assembly in-
fluence whether a generated image is of high quality, i.e., the gen- structions in code blocks of two malware samples from the same
erated image can provide sufficient valuable information about its family and it is obvious that the assembly instructions of the two
original malware file, which is conducive to distinguishing the gen- malware samples are extremely similar in terms of opcode and
erated images belonging to different malware families. execution sequence. There are three main types of operands: im-
mediate operands, register operands, and memory operands. The
3.1. Extracted content similarity of the first two types of operands can be observed in-
tuitively, while the similarity of the memory operands can be bet-
Binary and opcode sequences are the two most common types ter observed by their relative addresses rather than the absolute
of extracted content used to generate the malware image. On the addresses. The relative address of a memory operand is obtained
one hand, the binary sequence preserves binary information and by the absolute address minus the code address. Intuitively, using
the global structure of a malware file while ignoring the specific the sequence of assembly instructions as the extracted content is
semantic information contained in the code section. On the other conducive to preserving richer information about assembly instruc-
hand, the opcode sequence preserves partial information in the as- tions of malware than binary and opcode sequences.
sembly instructions of malware. The literature has shown that the
opcode sequence is better than the binary sequence in terms of an- 3.2. Filling mode
alyzing malware files (Manavi and Hamzeh, 2017; Raff et al., 2018).
However, the opcode sequence lacks operands which are partici- As mentioned above, there are two types of filling modes:
pants in the execution of assembly instructions, i.e., the objects of nonuniform sized and uniform-sized. The most commonly used
various operations; thus, it just contains partial information about grayscale image uses every 8 binary numbers as a pixel, and thus,
the assembly instructions of malware. its filling mode belongs to the class of nonuniform size. The size
The address space of a PE file is flat, and its code and data are of the malware image generated by this filling mode varies with
stored in different sections in a certain format. Usually, the data the malware size. Therefore, it requires using byte truncation or
in different sections are logically related. For example, the code image scaling methods to unify the sizes of malware images when
4
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
training with a CNN. However, the original binary information of tions is omitted because it is only used to separate operands. There
malware could be partially missing when the sizes of grayscale im- are only 62 different letters and numbers, which is helpful to gen-
ages are unified (Yuan et al., 2020). In addition, when faced with a erate matrices with a small and uniform size. Most new malware
malware variant with relocation sections, the similarities between comes from known malware with some code differences (Sun and
the malware variant’s image and its original malware image are Qian, 2021). Therefore, it can be inferred that malware of the same
low. Specifically, the stable values of “section entropy” and “sec- family has great similarity in code structures. Specifically, when
tion size” used in work (Fu et al., 2018) are another filling mode generating our three-channel image, the transfer probability of the
belonging to the nonuniform size class. unique combination of 2 consecutive letters or numbers, the trans-
For the uniform-sized filling mode, the SimHash value and fer probability of the unique combination of the first letters of
Markov transition probability are most commonly used ap- 2 consecutive opcodes, and the transfer probability of the unique
proaches. In (Ni et al., 2018), SimHash was used to convert op- combination of the last 2 consecutive letters in each opcode are
code sequences of different malware samples into images of equal used. They can (approximately) represent the assembly instruc-
size, but it can only generate short-length values; thus, the im- tions’ byte distribution, the dependencies of an opcode on its pre-
age that is converted from SimHash values usually requires inter- vious opcode, and the quality of each opcode. Compared with the
polating, which may introduce meaningless padding information. previous filling modes, it can provide richer information about as-
For Markov transition probability, the sizes of the generated im- sembly instructions in a malware file and will have a uniform size,
age are fixed and thus can be used directly as inputs for a CNN. In which will not cause missing useful information of assembly in-
(Manavi and Hamzeh, 2017), the frequency of the unique combi- structions and is suitable for classifying its variant with relocation
nation of every 2 consecutive opcodes that are extracted from the sections. After generating the three-channel image, a CNN is used
opcode sequence of malware is used to directly form a Markov im- to train the malware classification model. CNNs can discover the
age. Since there are many opcodes, the generated Markov images local features of images and are a good choice for classifying im-
will be large and sparse. This may lead to the effective informa- ages according to current research. Since the size of the proposed
tion in the image being too sparse and create some difficulties in three-channel image is only 62 × 62 × 3, a CNN that has a few fully
the training process of a CNN (Sun and Qian, 2021). Although fix- connected layers is designed for the classification of the malware
ing some of the opcodes as detection objects can deal with the images. The details will be introduced in Section 4.
above problem, it leads to a partial loss of useful information.
To avoid the useful information being missed due to the dis-
carding of the bytes or opcodes and to alleviate the sparseness 4. MCTVD method
of the generated image, we use the transfer probabilities of the
unique combination of every 2 consecutive letters or numbers ex- This section describes how MCTVD classifies malware upon the
tracted from the sequence of assembly instructions to generate content in Section 3. The overall framework of MCTVD is shown in
Markov images. Punctuation of the sequence of assembly instruc- Fig. 3. It consists of the following three steps that are as follows.
5
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Step 1: Sequence extraction. The assembly instructions in the states that include 26 uppercase letters, 26 lowercase letters, and
code section of malware are extracted from executable files with 10 numbers, where S = {s0 , s1 , . . . , s61 }. We assume that the next
the help of static analysis tools such as IDA Pro. These assembly state is only related to its current state and hence the sequence
instructions are viewed as a sequence of assembly instructions in of assembly instructions can be regarded as a Markov chain, i.e.,
the later steps. P (si+1 |s0 , . . . , si ) = P (si+1 |si ). Assuming that the transfer from the
Step 2: Malware visualization. The sequence of assembly in- previous state m to the subsequent state n occurs with a certain
structions obtained in step 1 is used to generate three Markov ma- probability, the transfer probability Pm,n is computed by using For-
trices. The consequence is that a 62 × 62 × 3 three-channel image mula 1.
is generated. f (m, n )
Step 3: Model training and classification of malware images. Pm,n = P (n | m ) = 61 (1)
The three-channel images are used to build a malware classifica- n=0 f (m, n )
tion model by using our presented CNN. Then the three-channel where f (m, n ) is the frequency of moving from state m to state n
images converted by the samples to be classified can be classified (m, n ∈ {0, 1, . . . , 61} ).
into different families by this model. The transfer probability matrix with n states has n2 transfer
probabilities and is a n × n matrix. Therefore, the size of matrix M
4.1. Sequence extraction used in MCTVD is small and uniform (62 × 62). As shown in For-
mula (2), the element pi, j in Row i and Column j of M represents
The sequence extraction step extracts a sequence of assembly the transfer probability from state i to state j.
instructions as the extracted contents for the three-channel image ⎡ ⎤
of MCTVD. In a PE file, its code section stores the code to be ex- p0,0 p0,1 ··· p0,61
⎢ p1,0 p1,1 ··· p1,61 ⎥
M=⎢ . .. ⎥
ecuted in the running process. To obtain more useful and nonre-
.. (2)
dundant features, MCTVD focuses on the assembly instructions in ⎣ .. .
..
. .
⎦
the code section of malware. Since these instructions cannot be p61,0 p61,1 ··· p61,61
obtained directly from the malware itself, it is necessary to use
a third-party analysis tool such as IDA Pro to convert the mal- To ensure that the generated malware image can provide rich
ware files into assembly files. After obtaining the assembly file, information about the assembly instructions of malware, three
MCTVD extracts the assembly instructions in the code section of transfer probability matrices generated by the sequence of assem-
the assembly file. Then, these instructions are combined into a se- bly instructions are used to construct a three-channel image in
quence of assembly instructions. When generating the sequence of MCTVD. That is, each transfer probability matrix is used as a chan-
assembly instructions, the opcode, immediate operands, and reg- nel of the three-channel image. A schematic diagram of the three-
ister operands in the extracted assembly instructions are reserved channel image used in MCTVD is shown in Fig. 4.
directly. For the memory operands, their relative addresses are re- The transfer probability of the unique combination of every 2
served in the sequence of assembly instructions and can be ob- consecutive letters or numbers is used to fill a matrix M1 for be-
tained as their absolute addresses minus their code addresses. ing the first channel of the three-channel image. It can reflect the
Compared with the binary and opcode sequences, the sequence byte distribution of the assembly instructions of a malware file.
of assembly instructions contains richer information about the as- Two-tuple opcodes can reflect the dependencies of an opcode on
sembly instructions of malware. its previous opcode. To generate the second channel of the three-
channel image, the transfer probability of the unique combination
4.2. Malware visualization of the first letters of every 2 consecutive opcodes in the sequence
of assembly instructions are used as the pixel values because they
In the malware visualization step, MCTVD uses the sequence can approximately substitute the transfer probability of 2-tuple op-
of assembly instructions obtained in the previous step to gener- codes. This transfer probability is used to fill a 62 × 62 matrix
ate a three-channel image. Specifically, we use the transfer prob- (called M2 ) to ensure that its size is the same as that of M1 . The
abilities of uppercase and lowercase letters or numbers of the se- quality of each opcode is a useful statistical characteristic for a PE
quence of assembly instructions as the pixel values. Punctuation is file and we add this property to our malware image. Specifically,
omitted because it is only used to separate operands or to help the transfer probability of the unique combination of the last two
the computer understand human-written assembly code. The ex- letters in each opcode of the sequence of assembly instructions is
istence of such irrelevant information may lead to difficulties in used to approximately substitute the quality of this opcode in a PE
the learning phase. The uppercase and lowercase letters and num- file. A 62 × 62 matrix M3 filled by these transition probabilities is
bers of the sequence of assembly instructions can be represented used to form the third channel of the three-channel image. Finally,
as a byte stream S. Assuming that each letter or number is re- a 62 × 62 × 3 matrix M = [M1 |M2 |M3 ] is constructed and is used
garded as a state, then each element in stream S has 62 possible to form a three-channel image. Fig. 5 shows the malware images
6
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
belonging to two malware families that were generated using the the convolutional and the fully connected layers. For our presented
MCTVD. As shown in Fig. 5, images generated by malware from the architecture, the size of the convolution kernel for the convolu-
same family have similar pixels and colors, while images generated tional layer is 3 × 3. Combined with other parameters (stride=1,
by malware from different families have significant differences in padding=‘same’), each convolutional layer of our presented archi-
some areas. These commonalities and differences make us believe tecture can maintain the same width and height as the previous
that learning the features of the three-channel images by a CNN layers. The number of convolution kernels will gradually double; it
could distinguish different malware families. The process of gener- begins with 64 and ends with 512. Its pooling layer uses maximum
ating the three-channel images is given in Algorithm 1. pooling with a 2 × 2 pool matrix, and the default step size is also
2 × 2. After pooling, the length and width of the matrix are con-
Algorithm 1 Three-channel image generation. tinuously reduced by half; it begins with 62 and ends with 3. To
speed up the training, a ReLU function is selected as the activation
Input: PE software E.
function for both the convolutional layer and the fully connected
Output: Three-channel image M.
layer. The Relu function is given by Formula 3.
1: A ⇐ assembly language file extracted from sample E
2: C ⇐ code Section C is obtained by A x (x > 0 )
h (x ) = (3)
3: for each line li in C do 0 (x ≤ 0 )
4: if li contains assembly instruction then
5: Rmo ⇐ calculate the relative address of the memory After convolution and pooling, the data are flattened into a
operand in the assembly instruction one-dimensional vector by the flatten function. In addition, our
6: Aai ⇐ stores the opcode, immediate operands, register proposed architecture includes a fully connected layer with an
operands, and Rmo output size of 1024. Subsequently, a softmax layer is connected,
7: end if whose number of neurons is set according to the number of mal-
8: end for ware families in the training dataset. The softmax function can be
9: /* Fai is the set of the first letter of every 2 consecutive opcodes used to solve multiclass classification problems and is given by
in Aai , Eai is the set of last two letters of each opcode in Aai */ Formula 4.
10: PAai ⇐ removes punctuation from Aai , leaving only letters and exp (ai )
so f tmax(yi ) = n (4)
i=1 exp (ai )
numbers
11: Fai ⇐ stores the first letter of every 2 consecutive opcodes from
Table 1 gives a comparison of our presented architecture with
Aai
AlexNet, VGG16, and VGG19. As shown, compared with these net-
12: Eai ⇐ stores the last two letters of each opcode from Aai
work structures, our presented architecture has fewer convolu-
13: M1 ⇐ Markov matrix M1 is generated by PAai
tion layers and fully connected layers. Correspondingly, our pre-
14: M2 ⇐ Markov matrix M2 is generated by Fai
sented architecture requires less time during training than AlexNet,
15: M3 ⇐ Markov matrix M3 is generated by Eai
VGG16, and VGG19 (see Section 5 for details). During the training
16: M ⇐ three-channel image M is constructed of M1 , M2 , M3
process, the Adam optimizer is used to learn the parameters, and
17: return M
the network is trained using the cross-entropy loss.
5. Experimental evaluation
4.3. Model training and classification of malware images
5.1. Dataset and experimental environment
This step aims to train a model to classify malware images
into different families. In recent years, deep learning models have The malware dataset used to evaluate MCTVD was derived from
emerged, and their effectiveness has been proven in many fields. a malware classification contest held by Microsoft at Kaggle in
Some existing deep learning models, such as AlexNet, VGG16, and 2015 (Ronen et al., 2018). It has been the benchmark dataset most
VGG19, show good performance in the field of image recognition. widely used in the field of static malware analysis since 2016. The
As mentioned above, the size of the three-channel image generated dataset consists of two separate parts: a training dataset and a
in step 2 is only 62 × 62 × 3. It is not suitable for the adoption of test dataset. The training dataset contains nine malware families
a complex model such as VGG16 to process our three-channel im- with a total of 10,868 samples. The test dataset contains 10,873
ages because of the difficulty of training or the overfitting prob- samples, but the labels of these samples are not publicly available.
lem caused by too small images. To address this problem, we con- Therefore, similar to most of the literature, we used only the train-
struct a CNN (shown in Fig. 6) with fewer layers compared with ing dataset (hereafter referred to as the Microsoft dataset) to ob-
VGG16, VGG19, or AlexNet. Compared with these above network tain experimental results. Table 2 lists the sample distribution of
structures, our presented architecture reduces the numbers of both the Microsoft dataset. For each sample, the dataset provides two
7
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Table 1
Comparison of our presented architecture with AlexNet, VGG16, and VGG19 in the network structure.
AlexNet 5 3 3 11
VGG16 13 5 3 21
VGG19 16 5 3 24
Our presented architecture 4 4 1 9
Malware Family Malware Type Sample Number Four commonly used evaluation metrics were utilized to assess
Ramnit Confucianism 1541 the classification performance of MCTVD, namely, accuracy, preci-
Lollipop Advertising 2478 sion, recall, and F1-score. They can be calculated by Formulas 5∼8
Keilhos_ver3 Back Door 2942 based on true positives (T P ), true negatives (T N), false-positives
Vundo Trojan Horse 475 (F P ), and false negatives (F N).
Simda Back Door 42
Tracur Download Software 751 i∈{1,...,k} T Pi
Keilhos_ver1 Back Door 398 Accuracy = (5)
Obfuscator.ACY Obfuscating Software 1228 N
Gatak Back Door 1013
Total 10868 i∈{1,...,k} T Pi
P recision = (6)
i∈{1,...,k} (T Pi + F Pi )
i∈{1,...,k} T Pi
file formats: malware binary files with the suffix “.bytes” (binary Recall = (7)
i∈{1,...,k} (T Pi + F Ni )
stream files without PE headers) and the corresponding assembly
files with the suffix “.asm” decompiled by IDA Pro. MCTVD used 2 ∗ P recision ∗ Recall
only the assembly files. Note that 61 samples were removed in the F 1 − score = (8)
P recision + Recall
experiment for MCTVD because their assembly files did not con-
tain the code section. where N denotes the number of samples in the dataset with k mal-
To verify the effectiveness of MCTVD, we use stratified 10-fold ware families.
cross-validation. That is, the dataset was divided into 10 subsets of The meanings of T P , T N, F P , and F N for a specific malware fam-
equal size, the i-th subset was used as test data in turn, while the ily i ∈ {1, 2, . . . , k} are as follows: T Pi and F Ni denote the numbers
remaining subsets were used as training data. of samples correctly predicted as family i and not predicted as
MCTVD was implemented in Python 3 and trained on Ubuntu family i but actually belong to family i, respectively; T Ni and F Pi
18.04. Experiments were conducted with Intel(R) Xeon(R) Gold denote the numbers of samples correctly not predicted as family
5220, NVIDIA GeForce RTX 2080 Ti ∗ 1, 251G RAM. Table 3 lists i and incorrectly predicted as family i but actually do not belong
the parameters used in MCTVD and other network structures and to family i, respectively. Furthermore, we used the receiver operat-
methods in the experiment. ing characteristic (ROC) curve and the area under the ROC (AUC)
8
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Table 3
Parameters of different network structures and methods in the experiment.
Network Structure or Method Optimizer Learning Rate Decay Rate Batch Size Epoch
Table 4
Results of our presented architecture, AlexNet, VGG16, and VGG19 under 10-fold cross-validation.
9
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Fig. 8. ROC curves and AUC values of different network structures under 10-fold Fig. 10. ROC curves and AUC values of different methods under 10-fold cross-
cross-validation. validation.
Table 5
Results obtained by different methods under 10-fold cross-validation.
Markov. One main reason for this is that the size of the image
Method Accuracy Macro Precision Macro Recall Macro F1-score generated in MCTVD is only 62 × 62 × 3, which is smaller than
GDMC 0.9237 0.8948 0.8452 0.8649 the sizes of the images generated by the other four methods. For
MDMC 0.9826 0.9563 0.9422 0.9487 the ROC curves and AUC values of different methods as shown in
RGBDMC 0.9410 0.9198 0.8650 0.8839 Fig. 10, among all the methods, MCTVD obtains the highest AUC
MalCVS 0.9891 0.9821 0.9748 0.9763
value.
MulMarkov 0.9915 0.9902 0.9628 0.9747
MCTVD 0.9944 0.9944 0.9913 0.9929 The resulting confusion matrices obtained by the six methods
are shown in Fig. 11. It can help understand the detailed accuracy
of the six methods for each of the nine malware families. From
Fig. 11, we can observe that MCTVD outperforms GDMC, MDMC,
RGBDMC, and MalCVS on all nine malware families, and MCTVD is
inferior to MulMarkov in three out of the nine malware families.
To further assess the accuracy of MCTVD, the performances
of some other state-of-the-art malware classification methods are
given. For fairness, only the methods that had used the same
dataset and the division (the whole Kaggle’s Microsoft training
dataset under 10-fold cross-validation) as well as those based
on one modality of data (.bytes or .asm) were selected. More
specifically, method (Drew et al., 2016), method (Narayanan et al.,
2016), method (Drew et al., 2017), and method (Hassen and
Chan, 2017) are methods based on static features and tradi-
tional machine learning; method (Lin and Yeh, 2022), method
(Gibert, Mateu, Planes, Vicens, 2018), method (Ding et al., 2020)
and method (Gibert et al., 2018a) are methods based on static
features and deep learning; and method (Kim et al., 2017),
method (Kim and Cho, 2022), method (Gibert et al., 2019) and
method (Ren et al., 2020) can be classified into methods based
Fig. 9. Training times of different methods under 10-fold cross-validation.
on malware images and deep learning. Table 6 gives the av-
erage accuracies comparison of different methods under 10-fold
cross-validation on the Microsoft dataset. As shown in Table 6,
Markov were compressed into 256 × 256, 256 × 256, 256 × 256 × MCTVD obtains higher average accuracy than the other methods.
3, 224 × 224 × 3, and 256 × 256 × 3, respectively. Hence it could be a highly competitive candidate for malware
As shown in Table 5, among the six methods, MCTVD obtained classification.
the best accuracy, macro precision, macro recall, and macro F1-
score. It is worth mentioning here that the accuracy of MCTVD
is 5.34% higher than that of RGBDMC with the same CNN struc- 5.3.2. Small training dataset test
ture, which reflects that the malware image generated by MCTVD To evaluate the performance of MCTVD in the scenario where
is markedly superior to the RGB image generated by RGBDMC in only limited training samples are available, this section presents
terms of malware classification. In comparisons of different meth- the results of an experiment to test the MCTVD trained by a small
ods in terms of training time shown in Fig. 9, MalCVS requires training dataset. To avoid random errors, a specific 5-fold cross-
far less training time than the other five methods because it re- validation was used. In each fold, 20% of the samples from the
lies on a pretrained model and a traditional machine learning al- whole dataset were used as the training data, and the remaining
gorithm. Among the remaining five methods, the training time of 80% were used as the test data. We note that the value of each
MCTVD is less than that of GMDC, MDMC, RGBDMC, and Mul- valuation metric is an average value of 5-fold.
10
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Fig. 11. Confusion matrices obtained by different methods under 10-fold cross-validation.
Table 7 shows the comparison of the six methods under 5- different methods given in Fig. 12, MCTVD obtains the highest AUC
fold cross-validation on the Microsoft dataset. From Table 7, we value under the 5-fold cross-validation. Fig. 13 gives the training
can see that even when the training dataset only accounts for 20% times of different methods. Similar to the results obtained by dif-
and the testing dataset accounts for 80%, the accuracy of MCTVD ferent methods under 10-fold cross-validation, Fig. 13 shows that
still reaches 98.72%, which is higher than that of the other five MCTVD requires less training time than GMDC, MDMC, RGBDMC,
methods. Table 7 also shows that MCTVD achieves the best per- and MulMarkov, while it requires more training time than MalCVS
formance in terms of macro recall, macro precision, and macro F1- because MalCVS relies on a pretrained model and a traditional ma-
score under 5-fold cross-validation. As for the ROCs and AUCs of chine learning algorithm. The resulting confusion matrices of the
11
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Table 6
The average accuracies comparison of different methods under 10-fold cross-validation on the Microsoft dataset.
Table 8
Table 7
Models in the ablation experiment.
Results obtained by different methods under 5-fold cross-validation.
Method First Channel Second Channel Third Channel
Method Accuracy Macro Precision Macro Recall Macro F1-score
MCTVD-F • ◦ ◦
GDMC 0.8116 0.7258 0.6648 0.6820
MCTVD-S ◦ • ◦
MDMC 0.9620 0.9167 0.9014 0.9086
MCTVD-T ◦ ◦ •
RGBDMC 0.8673 0.8248 0.7326 0.7520
MCTVD-FS • • ◦
MalCVS 0.9741 0.9545 0.9363 0.9436
MCTVD-FT • ◦ •
MulMarkov 0.9756 0.9641 0.9318 0.9449
MCTVD-ST ◦ • •
MCTVD 0.9872 0.9789 0.9584 0.9674
MCTVD • • •
12
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Fig. 14. Confusion matrices obtained by different methods under 5-fold cross-validation.
13
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
Ding, Y., Wang, S., Xing, J., Zhang, X., Oi, Z., Fu, G., Qiang, Q., Sun, H., Zhang, J., 2020.
Malware classification on imbalanced data through self-attention. In: 2020 IEEE
19th International Conference on Trust, Security and Privacy in Computing and
Communications (TrustCom). IEEE, pp. 154–161.
Drew, J., Hahsler, M., Moore, T., 2017. Polymorphic malware detection using se-
quence classification methods and ensembles. EURASIP J. Inform. Secur. 2017
(1), 1–12.
Drew, J., Moore, T., Hahsler, M., 2016. Polymorphic malware detection using se-
quence classification methods. In: 2016 IEEE Security and Privacy Workshops
(SPW). IEEE, pp. 81–87.
D’Angelo, G., Ficco, M., Palmieri, F., 2021. Association rule-based malware classifica-
tion using common subsequences of API calls. Appl. Soft Comput. 105, 107234.
Fu, J., Xue, J., Wang, Y., Liu, Z., Shan, C., 2018. Malware visualization for fine-grained
classification. IEEE Access 6, 14510–14523.
Ghouti, L., Imam, M., 2020. Malware classification using compact image features and
multiclass support vector machines. IET Inf. Secur. 14 (4), 419–429.
Gibert, D., Mateu, C., Planes, J., 2018. An end-to-end deep learning architecture for
classification of malware’s binary content. In: International Conference on Arti-
ficial Neural Networks. Springer, pp. 383–391.
Gibert, D., Mateu, C., Planes, J., 2020. The rise of machine learning for detection
and classification of malware: research developments, trends and challenges. J.
Netw. Comput. Appl. 153, 102526.
Fig. 15. The accuracies of the different models used in the ablation experiment un- Gibert, D., Mateu, C., Planes, J., Vicens, R., 2018. Classification of malware by us-
der 10-fold cross-validation. ing structural entropy on convolutional neural networks. In: Proceedings of the
AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, Febru-
ary 2-7, 2018, AAAI Press, pp. 7759–7764.
ing large and small training datasets. In the future, we will further Gibert, D., Mateu, C., Planes, J., Vicens, R., 2019. Using convolutional neural networks
for classification of malware represented as images. J. Comput. Virol. Hacking
explore the effect of the different numbers of channels for generat-
Tech. 15 (1), 15–28.
ing malware images on malware classification and use API calls or Hassen, M., Chan, P.K., 2017. Scalable function call graph-based malware classifica-
FCG features as extracted content to generate high-quality malware tion. In: Proceedings of the Seventh ACM on Conference on Data and Applica-
images. tion Security and Privacy, pp. 239–248.
Jian, Y., Kuang, H., Ren, C., Ma, Z., Wang, H., 2021. A novel framework for im-
age-based malware detection with a deep neural network. Comput. Secur. 109,
Declaration of Competing Interest 102400.
Kargarnovin, O., Sadeghzadeh, A. M., Jalili, R., 2022. Mal2GCN: a robust malware
detection approach using deep graph convolutional networks with non-negative
The authors declare that they have no known competing finan- weights. arXiv preprint arXiv:2108.12473.
cial interests or personal relationships that could have appeared to Kim, J.-Y., Bu, S.-J., Cho, S.-B., 2017. Malware detection using deep transferred gener-
influence the work reported in this paper. ative adversarial networks. In: International Conference on Neural Information
Processing. Springer, pp. 556–564.
Kim, J.-Y., Cho, S.-B., 2022. Obfuscated malware detection using deep generative
CRediT authorship contribution statement model based on global/local features. Comput. Secur. 112, 102501.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep
convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114.
Huaxin Deng: Methodology, Software, Data curation, Writing Li, C., Cheng, Z., Zhu, H., Wang, L., Lv, Q., Wang, Y., Li, N., Sun, D., 2022. DMalNet:
– original draft. Chun Guo: Conceptualization, Methodology, For- dynamic malware analysis based on API feature engineering and graph learning.
mal analysis, Funding acquisition, Writing – review & editing. Comput. Secur. 122, 102872.
Lin, W.-C., Yeh, Y.-R., 2022. Efficient malware classification by binary sequences with
Guowei Shen: Formal analysis, Investigation, Funding acquisition,
one-dimensional convolutional neural networks. Mathematics 10 (4), 608.
Resources. Yunhe Cui: Investigation, Validation, Writing – review Manavi, F., Hamzeh, A., 2017. A new method for malware detection using opcode
& editing. Yuan Ping: Methodology, Writing – review & editing. visualization. In: 2017 Artificial Intelligence and Signal Processing Conference
(AISP). IEEE, pp. 96–102.
Narayanan, B.N., Djaneye-Boundjou, O., Kebede, T.M., 2016. Performance analysis
Data Availability
of machine learning and pattern recognition algorithms for malware classifi-
cation. In: 2016 IEEE National Aerospace and Electronics Conference (NAECON)
Data will be made available on request. and Ohio Innovation Summit (OIS). IEEE, pp. 338–342.
Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S., 2011. Malware images: visu-
alization and automatic classification. In: Proceedings of the 8th International
Acknowledgments Symposium on Visualization for Cyber Security, pp. 1–7.
Ni, S., Qian, Q., Zhang, R., 2018. Malware identification using visualization images
The authors thank the anonymous referees for their valuable and deep learning. Comput. Secur. 77, 871–885.
Pachhala, N., Jothilakshmi, S., Battula, B.P., 2021. A comprehensive survey on identi-
comments and suggestions, which improved the technical con- fication of malware types and malware classification using machine learning
tent and the presentation of the article. This work is supported techniques. In: 2021 2nd International Conference on Smart Electronics and
by the National Natural Science Foundation of China under Grant Communication (ICOSEC). IEEE, pp. 1207–1214.
Pinhero, A., Anupama, M., Vinod, P., Visaggio, C.A., Aneesh, N., Abhijith, S., Anan-
No. 62162009, the Science and Technology Foundation of Guizhou
thaKrishnan, S., 2021. Malware detection employed by visualization and deep
Province under Grant No. [2020]1Y268, the Guizhou Major Special neural network. Comput. Secur. 105, 102247.
Science and Technology Project under Grant No. 20183001, the Key Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., Nicholas, C.K., 2018. Mal-
ware detection by eating a whole EXE. In: Workshops at the Thirty-Second AAAI
Technologies R&D Program of He’nan Province under Grant Nos.
Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7,
212102210084 and 222102210048. 2018, AAAI Press, pp. 268–276.
Ren, Z., Chen, G., Lu, W., 2020. Malware visualization methods based on deep con-
References volution neural networks. Multimed. Tools Appl. 79 (15), 10975–10993.
Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., Ahmadi, M., 2018. Microsoft malware
Amer, E., Zelinka, I., 2020. A dynamic windows malware detection and prediction classification challenge. arXiv preprint arXiv:1802.10135.
method based on contextual understanding of API call sequence. Comput. Secur. San, C.C., Thwin, M.M.S., Htun, N.L., 2019. Malicious software family classification
92, 101760. using machine learning multi-class classifiers. In: Computational Science and
AV-TEST, Av-test, 2022. https://www.av-test.org/en/statistics/malware/.Online. Ac- Technology. Springer, pp. 423–433.
cessed: 24 August 2022. Shalaginov, A., Banin, S., Dehghantanha, A., Franke, K., 2018. Machine learning aided
Basha, S.S., Dubey, S.R., Pulabaigari, V., Mukherjee, S., 2020. Impact of fully con- static malware analysis: a survey and tutorial. In: Cyber Threat Intelligence.
nected layers on performance of convolutional neural networks for image clas- Springer, pp. 7–45.
sification. Neurocomputing 378, 112–119. Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale
Cui, Z., Xue, F., Cai, X., Cao, Y., Wang, G.-g., Chen, J., 2018. Detection of malicious image recognition. arXiv preprint arXiv:1409.1556.
code variants based on deep learning. IEEE Trans. Ind. Inf. 14 (7), 3187–3196.
14
H. Deng, C. Guo, G. Shen et al. Computers & Security 126 (2023) 103084
15