Android Malware Family Classification Using Images From Dex Files
Android Malware Family Classification Using Images From Dex Files
Files
Munyeong Kang Jihyeon Park Seonghyun Park
Dept. of Software Science, Dankook Dept. of Software Science, Dankook Dept. of Computer Engineering,
University University Dankook University
Republic of Korea Republic of Korea Republic of Korea
[email protected] [email protected] [email protected]
ABSTRACT 1 INTRODUCTION
With the popularization 1 of the Android platform, Android mal- As Android smartphone users increase, malware targeting Android
ware occupies the largest portion of mobile malware. Malware continues to increase. According to the McAfee Mobile Threat
family classification is important for fast and accurate detection. Report in the first quarter of 2020, the total number of mobile
We propose a new detection method using images generated from malware in the fourth quarter of 2019 reached 35 million. This is
Dex files of Android apps. We generate two kinds of images: one an increase of about 40% compared to 25 million cases in the same
from an entire DEX file and one from a data section of a DEX file. period last year [10]. Therefore, many researchers have studied
We apply the CNN algorithm to the classification of both kinds of Android malware detection and classification techniques. Especially,
images. The experiments show that the proposed method classifies a malware family classification technique classifies malware into
malware families with 91% accuracy for both cases. In the case of related families and plays an important role in detecting malware.
using only the data section, the performance of the ExploitLinuxLo- The existing malware family classification techniques extracted
toor family and Gappisin family were improved. Also, the deviation feature information of malware and detected it using static or dy-
between Precision, Recall, and F1-Score was greatly reduced. The namic analysis. We propose a new Android malware family clas-
area under the Precision-Recall curve is almost the same in both sification method based on image processing and convolutional
experiments, which means that detection time can be shortened neural network (CNN). The method firstly produces a gray-scale
without deteriorating detection performance. image from an Android app and classifies malware families using
these images. This technique is time-efficient and uses less com-
CCS CONCEPTS puting resources because it does not require to extract API calls,
• Security and privacy → Artificial immune systems; Soft- control-flow graphs, permissions, opcodes, etc. through an analysis
of executable files or source codes. This technique does not need
ware reverse engineering.
to consider whether detection bypass techniques such as packing
or obfuscation have been applied. We create two types of images:
KEYWORDS one from whole Classes.dex file and the other from a data
Android, malware, classification, machine learning, CNN section inside the Classes.dex file. The method using im-
ages from data sections only shows almost the same accuracy
ACM Reference Format:
as the method using images from whole Classes.dex files.
Munyeong Kang, Jihyeon Park, Seonghyun Park, Seong-je Cho, and Minkyu
Park. 2018. Android Malware Family Classification using Images from Dex
Files. In The 9th ACM/SIGAPP Conference on Smart Media and Applications,
September 17–19, 2020, Jeju, Republic of Korea. ACM, New York, NY, USA, 2 RELATED WORK
6 pages. https://doi.org/10.1145/3426020.3426069 Yi-min and Tie-ming [14] proposed a method of classifying images
using a random forest algorithm after generating images from Dex
files. The data set used the DREBIN data set, and there are 14 families
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed in this set. As a result of the experiment, the average accuracy was
for profit or commercial advantage and that copies bear this notice and the full citation about 90%.
on the first page. Copyrights for components of this work owned by others than ACM Seok and Kim [11] generated an image by converting each byte
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a of the malicious code binary file into 8-bit grayscale pixels and clas-
fee. Request permissions from [email protected]. sified the malicious code family by applying CNN to the generated
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea image.
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-8925-9/20/09. . . $15.00 Tang and Wang [12] proposed a new neural network structure
https://doi.org/10.1145/3426020.3426069 called ConvProtoNet to solve the inaccuracy of classification caused
181
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea M. Kang et al.
by overfitting that occurs when the number of samples is small. 3.1 Malware Images
Through the experiment, it was reported that the average accuracy The proposed technique detects and classifies malware by generat-
of more than 70% was achieved for a small number of data sets. ing images from malicious codes and applying CNN to them. We use
Arp et al. [1] obtained feature information such as permission 2 types of images for classification, and the images are expressed
and API information using static analysis techniques, detected ma- in grayscale.
licious apps using machine learning, and classified families.
Jung et al. [6] proposed an Android malware detection technique 3.1.1 The Structure of a DEX file. Android apps can be developed
that was based on convering the entire DEX file of an app and using Java and Java codes are usually compiled to a .class file. How-
the data section of the DEX into gray-scale image respectively, ever, in the Android environment, a .class file is optimized and
and applied CNN deep learning algorithm on those images for converted to a DEX file [8]. A DEX file is an executable file for-
classifying the app as benign or malicious. They focused on malware mat of the Android app and exists under the name of Classes.dex.
detection not malware family classification. As shown in Fig. 2, a DEX file consists of Header, String_IDs,
Huang and Kao [4] proposed a technique using image processing Type_IDs, Proto_IDs, Field_IDs, Method_IDs, Class_Defs, Data
to detect Android malware. The technique treats the bytecode of (data section), and Link_Data. The data section includes class
classes.dex of the Android APK file like an RGB color code and data and executable code of each app. We create an image using
converts classes.dex into a fixed-size color image. This image is both the entire DEX file and only the data section.
then inputted into a CNN for automatic feature extraction and
training. Using CNN not only reduces the number of parameters
but also reflects the complexity of Android malware, saving a lot of
time compared to other methods.
182
Android Malware Family Classification using Images from Dex Files SMA 2020, September 17-19, 2020, Jeju, Republic of Korea
3.1.3 Dex file Image. To create an image of the DEX file, a Dex file and the image of the data section only and explore whether
is extracted from an APK. The DEX file is read as a binary number to reduce the overhead of classification.
at 8-bit intervals and interpreted as an unsigned decimal number
(Fig. 3).
3.1.4 Data Section file Image. A DEX contains the size and offset
of the data section. The extracted data section is converted in the
same process as in Fig. 3 to create the grayscale image.
183
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea M. Kang et al.
184
Android Malware Family Classification using Images from Dex Files SMA 2020, September 17-19, 2020, Jeju, Republic of Korea
Fig. 5 and Fig. 6 shows the precision-recall curve for each family ACKNOWLEDGMENTS
of the results with the Dex file image and Data Section image, This research was supported by Basic Science Research Program
respectively. You can see that the overall curve is almost the same. through the National Research Foundation of Korea(NRF) funded
185
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea M. Kang et al.
by the Ministry of Science and ICT (no. 2018R1A2B2004830) and the networks and data section images. In Proceedings of the 2018 Conference on
MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Research in Adaptive and Convergent Systems. 149–153.
[7] Paul Sayak Nain, Aakash and Maynard-Reid Margaret. 2015. Keras. Retrieved
Technology Research Center) support program(IITP-2020-2015-0- Sep 1, 2020 from https://keras.io/
00363) supervised by the IITP(Institute for Information & Commu- [8] Dong-Hyeok Park, Eui-Jung Myeong, and Joobeom Yun. 2016. Efficient Detection
of Android Mutant Malwares Using the DEX file. Journal of the Korea Institute of
nications Technology Planning & Evaluation). Information Security & Cryptology 26, 4 (2016), 895–902.
[9] Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and
REFERENCES Charles Nicholas. 2017. Malware detection by eating a whole exe. arXiv preprint
arXiv:1710.09435 (2017).
[1] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck,
[10] R Samani. 2020. McAfee Mobile Threat Report Q1.
and CERT Siemens. 2014. Drebin: Effective and explainable detection of android
[11] Seonhee Seok and Howon Kim. 2016. Visualized malware classification based-
malware in your pocket.. In Ndss, Vol. 14. 23–26.
on convolutional neural network. Journal of The Korea Institute of Information
[2] David Cournapeau. 2007. scikit-learn. Retrieved Sep 1, 2020 from https://scikit-
Security & Cryptology 26, 1 (2016), 197–208.
learn.org/
[12] Zhijie Tang, Peng Wang, and Junfeng Wang. 2020. ConvProtoNet: Deep prototype
[3] Daniel Gibert Llauradó. 2016. Convolutional neural networks for malware classifi-
induction towards better class representation for few-shot malware classification.
cation. Master’s thesis. Universitat Politècnica de Catalunya.
Applied Sciences 10, 8 (2020), 2847.
[4] TonTon Hsien-De Huang and Hung-Yu Kao. 2018. R2-d2: Color-inspired convo-
[13] Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi.
lutional neural network (cnn)-based android malware detections. In 2018 IEEE
2018. Convolutional neural networks: an overview and application in radiology.
International Conference on Big Data (Big Data). IEEE, 2633–2642.
Insights into imaging 9, 4 (2018), 611–629.
[5] Sakshi Indolia, Anil Kumar Goswami, SP Mishra, and Pooja Asopa. 2018. Concep-
[14] Yi-min YANG and Tie-ming CHEN. 2016. Android malware family classification
tual understanding of convolutional neural network-a deep learning approach.
method based on the image of bytecodeConstruction of MDS matrices. Chinese
Procedia computer science 132 (2018), 679–688.
Journal of Netword and Information Security 2, 6 (2016), 38.
[6] Jaemin Jung, Jongmoo Choi, Seong-je Cho, Sangchul Han, Minkyu Park, and
Youngsup Hwang. 2018. Android malware detection using convolutional neural
186