0% found this document useful (0 votes)
65 views6 pages

Android Malware Family Classification Using Images From Dex Files

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views6 pages

Android Malware Family Classification Using Images From Dex Files

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Android Malware Family Classification using Images from Dex

Files
Munyeong Kang Jihyeon Park Seonghyun Park
Dept. of Software Science, Dankook Dept. of Software Science, Dankook Dept. of Computer Engineering,
University University Dankook University
Republic of Korea Republic of Korea Republic of Korea
[email protected] [email protected] [email protected]

Seong-je Cho Minkyu Park


Dept. of Computer Science and Dept. of Software Technology,
Engineering, Dankook University Konkuk University
Republic of Korea Republic of Koreas
[email protected] [email protected]

ABSTRACT 1 INTRODUCTION
With the popularization 1 of the Android platform, Android mal- As Android smartphone users increase, malware targeting Android
ware occupies the largest portion of mobile malware. Malware continues to increase. According to the McAfee Mobile Threat
family classification is important for fast and accurate detection. Report in the first quarter of 2020, the total number of mobile
We propose a new detection method using images generated from malware in the fourth quarter of 2019 reached 35 million. This is
Dex files of Android apps. We generate two kinds of images: one an increase of about 40% compared to 25 million cases in the same
from an entire DEX file and one from a data section of a DEX file. period last year [10]. Therefore, many researchers have studied
We apply the CNN algorithm to the classification of both kinds of Android malware detection and classification techniques. Especially,
images. The experiments show that the proposed method classifies a malware family classification technique classifies malware into
malware families with 91% accuracy for both cases. In the case of related families and plays an important role in detecting malware.
using only the data section, the performance of the ExploitLinuxLo- The existing malware family classification techniques extracted
toor family and Gappisin family were improved. Also, the deviation feature information of malware and detected it using static or dy-
between Precision, Recall, and F1-Score was greatly reduced. The namic analysis. We propose a new Android malware family clas-
area under the Precision-Recall curve is almost the same in both sification method based on image processing and convolutional
experiments, which means that detection time can be shortened neural network (CNN). The method firstly produces a gray-scale
without deteriorating detection performance. image from an Android app and classifies malware families using
these images. This technique is time-efficient and uses less com-
CCS CONCEPTS puting resources because it does not require to extract API calls,
• Security and privacy → Artificial immune systems; Soft- control-flow graphs, permissions, opcodes, etc. through an analysis
of executable files or source codes. This technique does not need
ware reverse engineering.
to consider whether detection bypass techniques such as packing
or obfuscation have been applied. We create two types of images:
KEYWORDS one from whole Classes.dex file and the other from a data
Android, malware, classification, machine learning, CNN section inside the Classes.dex file. The method using im-
ages from data sections only shows almost the same accuracy
ACM Reference Format:
as the method using images from whole Classes.dex files.
Munyeong Kang, Jihyeon Park, Seonghyun Park, Seong-je Cho, and Minkyu
Park. 2018. Android Malware Family Classification using Images from Dex
Files. In The 9th ACM/SIGAPP Conference on Smart Media and Applications,
September 17–19, 2020, Jeju, Republic of Korea. ACM, New York, NY, USA, 2 RELATED WORK
6 pages. https://doi.org/10.1145/3426020.3426069 Yi-min and Tie-ming [14] proposed a method of classifying images
using a random forest algorithm after generating images from Dex
files. The data set used the DREBIN data set, and there are 14 families
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed in this set. As a result of the experiment, the average accuracy was
for profit or commercial advantage and that copies bear this notice and the full citation about 90%.
on the first page. Copyrights for components of this work owned by others than ACM Seok and Kim [11] generated an image by converting each byte
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a of the malicious code binary file into 8-bit grayscale pixels and clas-
fee. Request permissions from [email protected]. sified the malicious code family by applying CNN to the generated
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea image.
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-8925-9/20/09. . . $15.00 Tang and Wang [12] proposed a new neural network structure
https://doi.org/10.1145/3426020.3426069 called ConvProtoNet to solve the inaccuracy of classification caused

181
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea M. Kang et al.

by overfitting that occurs when the number of samples is small. 3.1 Malware Images
Through the experiment, it was reported that the average accuracy The proposed technique detects and classifies malware by generat-
of more than 70% was achieved for a small number of data sets. ing images from malicious codes and applying CNN to them. We use
Arp et al. [1] obtained feature information such as permission 2 types of images for classification, and the images are expressed
and API information using static analysis techniques, detected ma- in grayscale.
licious apps using machine learning, and classified families.
Jung et al. [6] proposed an Android malware detection technique 3.1.1 The Structure of a DEX file. Android apps can be developed
that was based on convering the entire DEX file of an app and using Java and Java codes are usually compiled to a .class file. How-
the data section of the DEX into gray-scale image respectively, ever, in the Android environment, a .class file is optimized and
and applied CNN deep learning algorithm on those images for converted to a DEX file [8]. A DEX file is an executable file for-
classifying the app as benign or malicious. They focused on malware mat of the Android app and exists under the name of Classes.dex.
detection not malware family classification. As shown in Fig. 2, a DEX file consists of Header, String_IDs,
Huang and Kao [4] proposed a technique using image processing Type_IDs, Proto_IDs, Field_IDs, Method_IDs, Class_Defs, Data
to detect Android malware. The technique treats the bytecode of (data section), and Link_Data. The data section includes class
classes.dex of the Android APK file like an RGB color code and data and executable code of each app. We create an image using
converts classes.dex into a fixed-size color image. This image is both the entire DEX file and only the data section.
then inputted into a CNN for automatic feature extraction and
training. Using CNN not only reduces the number of parameters
but also reflects the complexity of Android malware, saving a lot of
time compared to other methods.

3 CNN-BASED MALWARE DETECTION


METHOD
The procedure of the proposed method is shown in Fig. 1. Existing
malicious family classification techniques require several steps to
extract feature information and convert it into an input format of
a machine learning algorithm. On the other hand, the proposed
method can classify malicious codes by simply creating a grayscale
image from an APK file. Classification through image conversion
minimizes the need for special expertise [9] and has the advantage
of maintaining the overall image structure even if a small change
occurs in the image [3].
If CNN is used for image classification, local features of mali-
cious app images can be extracted, and since models share learned
parameters, new malicious apps can be classified through existing Figure 2: The DEX file Structure
networks. Also, CNN is tolerant of input noise and can be efficiently
classified because they learn local patterns in images. 3.1.2 Image Creation. Images are grayscale and each pixel is asso-
ciated with a number representing the brightness. 0 represents black
and 255 represents white. The larger the number, the brightness
the pixel. The image is represented by a twodimensional array. The
number of channels is set to one to enter it as the input of the CNN
and the image forms a certain shape. Table 1 shows the relationship
between the file size and the size of the generated image file.

Table 1: Table Image Height by File Size

File Size Range(KB) Image Width(pixel)


<10 32
10∼30 64
30∼60 128
60∼100 256
100∼200 384
200∼500 512
Figure 1: Steps of the proposed method 500∼1000 768
>1000 1024

182
Android Malware Family Classification using Images from Dex Files SMA 2020, September 17-19, 2020, Jeju, Republic of Korea

3.1.3 Dex file Image. To create an image of the DEX file, a Dex file and the image of the data section only and explore whether
is extracted from an APK. The DEX file is read as a binary number to reduce the overhead of classification.
at 8-bit intervals and interpreted as an unsigned decimal number
(Fig. 3).

Figure 3: Creating an image file from a DEX file

3.1.4 Data Section file Image. A DEX contains the size and offset
of the data section. The extracted data section is converted in the
same process as in Fig. 3 to create the grayscale image.

3.2 CNN Model


Fig. 4 shows our CNN model for Android malware family classi-
fication. We use LeNet (or LeNet-5) which is one of the famous
CNN architectures. LeNet structure is composed of a stacking of
several building blocks: convolution layers, pooling layers and full
connection layer [5, 7, 13]. The first several hidden layers of LeNet
can learn the strong local pattern of image. For example, if the
CNN learned a pattern in the lower right cornet of an image, it
can recognize the pattern elsewhere, making the CNN process the
image efficiently.
The convolution layer is followed by pooling or sub-sampling
layer which can reduce number of learnable parameters and in-
troduce translation invariance. The most popular form of pooling
operation is max pooling, which extracts patches from the input fea-
ture maps, outputs the maximum value in each patch, and discards
all the other values. We perform max pooling operation with a filter Figure 4: CNN model
size of 2 x 2, which extracts 2 x 2 patches from the input, outputs
the maximum value in each patch, and discards all the other values.
As a result, the output image is reduced to half the size, which has 4.1 Dataset
the effect of reducing the total number of weights of the feature
We used the dataset collected by the Drebin Project. The dataset
map [7]. The output of the first phase (includes convolution and
consists of 5,560 malicious apps consisting of 179 different fami-
pooling repetivively) is fed into the fully connected layer. Finaly,
lies [1]. Among them, 5,528 apps remain, excluding 28 files without
the two malicious apps’ images, which are CNN’s input data, have
Dex files and 4 files with APK file structure errors. Among them,
been reduced to a size of 64*64*1.
we experimented with apps belonging to the top 20 families with a
large number of family members (Table 2). The ratio between the
4 EXPERIMENTAL RESULTS training set and the test set is 8:2.
In Section 4, we present the experimental results of evaluating the Table 3 shows the average size of the image files generated from
performance of the malicious app classification method proposed the Dex file and the Data Section, and the difference in size between
in this paper. The following two experiments were conducted. the two files. The average size of image files was measured for each
(1) Performance comparison with other methods: We compared family. In Table 3, the ’ID’ column indicates the malware family
the proposed method with other techniques using the same defined in Table 2, and the second and third columns indicate the
data set. This shows that the proposed technique is effective average size of two image files of each family, respectively. The
for malware classification. last column shows how much smaller data section image files are
(2) Performance comparison when using the image of the entire compared to DEX image files. For example, in the case of the A
DEX file and the image of the data section only: We compare family, this means about an 18% reduction from 208 bytes to 170
the performance when using the image of the entire Dex file bytes compared to 208 bytes.

183
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea M. Kang et al.

Table 2: Drebin Dataset


TP samples correctly predicted
ID Family Samples ID Family Samples Recall = =
TP + FN all samples belonдinд to the f amily
A Adrd 91 K GinMaster 339
B BaseBridge 326 L Glodream 69 Precision × Recall
F 1 − Score = ×2
C DroidDream 81 M Iconosys 152 Precision + Recall
D DroidKungFu 666 N Imlog 43
E ExploitLinuxLotoor 66 O Kmin 147 The Precision-Recall curve is a useful measure for evaluating
F FakeDoc 132 P MobileTx 69 a classifier for imbalanced samples. Precision is the ratio of cor-
G FakeInstaller 922 Q Opfake 601 rect answers among the results predicted as positive, and recall is
H FakeRun 61 R Plankton 624 the ratio of correct answers among actual positive samples. The
I Gappusin 58 S SendPay 59 Precision-Recall curve shows the trade-off between precision and re-
J Geinimi 92 T SMSreg 41 call, which varies as the threshold for determining positive changes.
The wider the area under this graph, the higher the precision and
Table 3: Average Size of Image Files for each Family higher recall. High precision is associated with low FP values and
high recall is associated with low FN values. If both indicators yield
DEX image Data section high results, the classifier returns accurate results [2].
Reduction
ID file size image file 4.2.1 Performance comparison with other methods. We compare the
ratio (%)
(bytes) size (bytes) proposed method with DREBIN, which classified families through
A 208 170 18 static analysis and machine learning [1]. The proposed method
B 106 86 18 showed about 91% accuracy in using both the Dex file and the Data
C 124 100 19 Section images. Looking at Table 4, DREBIN shows 93% accuracy.
D 206 168 18 DREBIN analyzes the permission requested by the app through
E 106 87 17 static analysis, extracts API usage information, and classifies the
F 125 83 25 family of malicious apps through machine learning. On the other
G 20 16 21 hand, the proposed method is effective in terms of detection time
H 637 533 16 and resource usage because it only goes through the process of
I 125 100 20 converting DEX files or Data Sections into images.
J 113 91 19
K 230 190 17 Table 4: Accuracy of DREBIN and the proposed method
L 298 246 17
M 32 25 22 Research Family Number accuracy
N 31 24 23
DREBIN 20 93%
O 123 98 20
P 23 18 23 Dex 20 91%
Q 24 21 12 Data Section 20 91%
R 569 473 16
S 156 137 12
4.2.2 Performance comparison when using the DEX file and the
T 114 92 19
data section image. The experimental results of using both the Dex
file image and the Data Section image showed an average of 91%
accuracy. Table 5 and Table 6 show each detection performance. In
4.2 Performance of Family Classification the case of the ExploitLinuxLotoor family (E), the precision value
The definition of basic terms is as follows. TP (True Positiv) is the increases from 45% to 89% when we use the Data Section images
number of samples in which a sample belonging to a family is compared to using the DEX file images. In the case of the Gappusin
correctly predicted. FN (False Negative) is the number of samples family (I), the precision also increases from 82% to 100% when using
in which a sample belonging to a family is predicted as another the Data Section images compared to using the DEX file images. The
family. FP (False Positive) is the number of samples in which a overall precision ranged from 54% to 100% in the experiment using
sample that does not belong to the family is predicted as the family. the Data Section images, and from 45% to 100% in the experiment
Precision, Recall, and F1-Score were used as performance evaluation using the DEX file images.
criteria. Precision is the fraction of correctly predicted samples Similarly, the recall value ranged from 38% to 100% in the ex-
among all predicted samples as the family, while recall is the fraction periment using the DEX file images, but showed a range from 50%
of correctly predicted samples among all samples belonging to the to 100% in the experiment using the Data Section images. In other
family. words, we found out that the deviation between the values was
improved when using the Data Section images. As a result, F1-score
TP samples correctly predicted also shows an improvement between 42% and 100% to between 65%
Precision = =
TP + FP all samples predicted as the f amily to 100% when using the Data Section images.

184
Android Malware Family Classification using Images from Dex Files SMA 2020, September 17-19, 2020, Jeju, Republic of Korea

Table 5: Precision, recall, and F1-Score when using the Dex


image

ID Precision Recall F1-Score


A 100% 50% 67%
B 92% 88% 90%
C 92% 75% 83%
D 91% 92% 91%
E 45% 38% 42%
F 100% 96% 98% Figure 5: Precision-Recall curve for Dex Image
G 98% 95% 97%
H 100% 92% 96%
I 82% 75% 78%
J 71% 56% 63%
K 70% 90% 79%
L 73% 57% 64%
M 91% 100% 95%
N 100% 67% 80%
O 93% 97% 95%
P 100% 100% 100%
Q 94% 98% 96% Figure 6: Precision-Recall curve for Data Section Image
R 94% 98% 96%
S 92% 100% 96% In addition, the result of comparing the training time when two
T 88% 88% 88% images are used is shown in Table 7. We can reduce the training
time by about 7% by using the data section image.
Table 6: Precision, recall, and F1-Score when using the Data
Table 7: Time difference
Section image

Experiments Time difference


ID Precision Recall F1-Score
A 54% 72% 65% Dex 106.852(sec)
B 94% 78% 86% Data Section 100.126(sec)
C 87% 81% 84%
D 87% 89% 88% On average, the image size is reduced by 18%, the classifica-
E 89% 62% 73% tion performance is almost the same, and the training time is also
F 93% 96% 94% shortened, so using the data section is effective.
G 99% 97% 98%
5 CONCLUSIONS
H 86% 100% 92%
I 100% 50% 67% In this paper, we proposed a method of classifying a family of
J 71% 83% 77% malicious apps by creating an image using the Dex file and Data
K 67% 81% 73% Section of an Android app. In both experiments using images, the
methods classifies malware family with an average of 91% accuracy.
L 75% 64% 69%
Since this technique does not require the process of extracting
M 91% 97% 94%
feature information through static analysis or dynamic analysis, it
N 86% 67% 75%
has been shown to be effective in saving time and resource use.
O 94% 100% 97%
In addition, when the data section image is used, the size is
P 100% 100% 100% reduced by about 18% compared to the DEX file image, but the
Q 97% 97% 97% accuracy is almost the same, and the training time is also shortened.
R 98% 96% 97% However, some families showed poorer performance when using
S 100% 100% 100% the Data Section image. Research is needed to improve performance
T 78% 88% 82% by combining API information and permission information as well
as images.

Fig. 5 and Fig. 6 shows the precision-recall curve for each family ACKNOWLEDGMENTS
of the results with the Dex file image and Data Section image, This research was supported by Basic Science Research Program
respectively. You can see that the overall curve is almost the same. through the National Research Foundation of Korea(NRF) funded

185
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea M. Kang et al.

by the Ministry of Science and ICT (no. 2018R1A2B2004830) and the networks and data section images. In Proceedings of the 2018 Conference on
MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Research in Adaptive and Convergent Systems. 149–153.
[7] Paul Sayak Nain, Aakash and Maynard-Reid Margaret. 2015. Keras. Retrieved
Technology Research Center) support program(IITP-2020-2015-0- Sep 1, 2020 from https://keras.io/
00363) supervised by the IITP(Institute for Information & Commu- [8] Dong-Hyeok Park, Eui-Jung Myeong, and Joobeom Yun. 2016. Efficient Detection
of Android Mutant Malwares Using the DEX file. Journal of the Korea Institute of
nications Technology Planning & Evaluation). Information Security & Cryptology 26, 4 (2016), 895–902.
[9] Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and
REFERENCES Charles Nicholas. 2017. Malware detection by eating a whole exe. arXiv preprint
arXiv:1710.09435 (2017).
[1] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck,
[10] R Samani. 2020. McAfee Mobile Threat Report Q1.
and CERT Siemens. 2014. Drebin: Effective and explainable detection of android
[11] Seonhee Seok and Howon Kim. 2016. Visualized malware classification based-
malware in your pocket.. In Ndss, Vol. 14. 23–26.
on convolutional neural network. Journal of The Korea Institute of Information
[2] David Cournapeau. 2007. scikit-learn. Retrieved Sep 1, 2020 from https://scikit-
Security & Cryptology 26, 1 (2016), 197–208.
learn.org/
[12] Zhijie Tang, Peng Wang, and Junfeng Wang. 2020. ConvProtoNet: Deep prototype
[3] Daniel Gibert Llauradó. 2016. Convolutional neural networks for malware classifi-
induction towards better class representation for few-shot malware classification.
cation. Master’s thesis. Universitat Politècnica de Catalunya.
Applied Sciences 10, 8 (2020), 2847.
[4] TonTon Hsien-De Huang and Hung-Yu Kao. 2018. R2-d2: Color-inspired convo-
[13] Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi.
lutional neural network (cnn)-based android malware detections. In 2018 IEEE
2018. Convolutional neural networks: an overview and application in radiology.
International Conference on Big Data (Big Data). IEEE, 2633–2642.
Insights into imaging 9, 4 (2018), 611–629.
[5] Sakshi Indolia, Anil Kumar Goswami, SP Mishra, and Pooja Asopa. 2018. Concep-
[14] Yi-min YANG and Tie-ming CHEN. 2016. Android malware family classification
tual understanding of convolutional neural network-a deep learning approach.
method based on the image of bytecodeConstruction of MDS matrices. Chinese
Procedia computer science 132 (2018), 679–688.
Journal of Netword and Information Security 2, 6 (2016), 38.
[6] Jaemin Jung, Jongmoo Choi, Seong-je Cho, Sangchul Han, Minkyu Park, and
Youngsup Hwang. 2018. Android malware detection using convolutional neural

186

You might also like