Shi Et Al. - 2023 - SFCGDroid Android Malware Detection Based On Sens
Shi Et Al. - 2023 - SFCGDroid Android Malware Detection Based On Sens
https://doi.org/10.1007/s10207-023-00679-x
REGULAR CONTRIBUTION
Abstract
Android is now one of the most popular operating systems in the world because of its open source character, so the threshold
for hackers to make malware has also become lower, and more and more malware has started to threaten people’s lives.
Graphs are used to represent the program’s syntactic and semantic structure, and can naturally represent malicious behavior,
so we propose a malware detection method named SFCGDroid, which based on sensitive function call graph, so we propose a
malware detection method named SFCGDroid, which based on sensitive function call graph. We first decompile the Android
application to generate a function call graph (FCG), and extract the sensitive function call graph (SFCG) on the FCG. Secondly,
we extract two class features (1) use the Skip-gram model to obtain function embeddings, and (2) treat the SFCG as a social
network and extract the triads attribute of the sensitive API. The two types of features are combined as a feature representation
of the SFCG and fed into a graph convolutional network (GCN) for malware detection. For experiments on 26,939 Android
software datasets, SFCGDroid in this paper can achieve 98.22% accuracy and 98.20% F1 score.
Keywords Malware detection · Sensitive function call graph · Graph convolutional network · Skip-gram · Triads
123
1116 S. Shi et al.
the smali code, get the function call graph (FCG) from the ing with artificial neural networks. However, [10] mentions
smali code, and then obtain the SFCG based on the sensitive that malware does not need to request permissions to per-
API. Each node in the graph includes two types of attributes, form sensitive operations, which is one of the reasons for
which are semantic attributes and graph structure attributes. the low accuracy of static analysis. Androdialysis [12] dis-
The node semantic attributes are obtained in the trained func- covers the importance of intent in the detection of malware
tion embedding model, in addition to this we further explore in Android applications, which is an important means of
the hidden information in the SFCG structure, and we treat cross-process communication in Android applications and
the SFCG as a social network and get the attributes of each is naturally a means of attack for malware. Androdialysis
sensitive API triad in the graph. We feed this into a GCN combined intent and permission to detect can achieve 95%
classifier for malware detection. accuracy, but intent needs to be combined with other features
The main contributions of this paper are as follows: to have good results, intent features alone cannot work well.
Droidapiminer [5] used APIs as features to detect malware,
– We propose SFCGDroid, an Android malware detection looking for key APIs and the parameters called by the APIs.
system based on sensitive function call graphs. We focus However, this approach does not identify the behavioural
on sensitive API calls in FCG, reduce the size of FCG information implied by the API, nor does it represent the
to generate SFCG, and detect the existence of malicious interactions between the APIs.
behaviors. Incorrect declarations of permissions and API calls are dif-
– SFCGDroid combines function semantic information ficult to represent implicit behavioral information, so work in
with graph structure information. It looks for suspected recent years has tried to address these issues. MAMADROID
malicious behaviour from function node semantics and [14] first extracts the API call sequences from the call graph
sensitive API call structures. and models the API call sequences as Markov chains to
– We verified the effectiveness of SFCGDroid in a large capture the behavioral information in the API. ProDroid
dataset. The experimental results show that SFCGDroid [15] uses the API classes and method sequences to train
outperforms some of the more recent detection methods. the Profile Hidden Markov Model and achieves a detection
accuracy of 94.5%. Image-based detection techniques have
The rest of the paper is organised as follows. Section 2 dis- been an important approach for Android malware detection.
cusses related work on Android malware detection. Section 3 DeepVisDroid [16] converted some files from APK into grey-
presents the framework and details of SFCGDroid, followed scale images, constructed four grey-scale image datasets and
by the experimental results in Sect. 4. Section 5 briefly sum- extracted two types of image-based features, the local fea-
marises the full paper. tures and global features, fed into a neural network model
with a 1D-convolutional layer, and obtained a classification
accuracy of 98.96% in effective time. Ünver et al. [17] con-
2 Related work verted Manifest.xml, DEX, and Resource.ARSC from APK
to grey-scale images to extract four types of local features and
The malware detection methods [11] that exist today fall into three types of global features and fed them into a machine
three main categories, static analysis, dynamic analysis and learning classifier, AdaBoost, with an accuracy of 98.75%.
hybrid analysis. Similarly graphs as a representation of unstructured data
can model Android application behavioural activity. Dapasa
2.1 Static analysis [18] extracted sensitive subgraphs and extracted five numer-
ical features based on the scale of the sensitive subgraphs,
Static analysis as the most frequently used method for into a random forest classifier for classification. Gascon et al.
Android malicious detection, permissions [6], APIs [5], and [19] extracted function call graphs and used linear time graph
intent [12] are often used as static features. Permissions as kernel-inspired explicit mapping of the call graph to the fea-
an important part of an Android application will show the ture space, but this approach selects fine-grained bytecodes
permissions required to run the Android application when as features and has no behavioural properties of the func-
it is installed, and the Android application will verify that tions, and the detection performance is only 89%. CDGroid
the permissions are granted when it performs some sensi- [20] used control flow graphs (CFG) and data flow graphs
tive operations. Therefore 22 out of 135 permissions were (DFG), encoded them and fed them into a convolutional neu-
selected as important permissions and the detection accu- ral network (CNN) with a detection accuracy of over 99.5%.
racy was above 90% in [6], Chavan et al. [13] found that GDroid [21] proposed an API usage model to build a het-
permissions were an important feature and that with care- erogeneous graph through APK-API and API-API edges,
fully selected permissions features, even with highly skewed converted the application classification task into a node clas-
data, an accuracy of over 96% could be achieved by train- sification task, and used graph convolutional networks for
123
SFCGDroid: android malware detection based… 1117
Android malware detection for the first time. AMalNet [22] the two types of features with 94.25% detection. DL-Droid
used graph convolutional networks (GCNs) and independent [28] improves code coverage through state-based input, as
recurrent networks (IndRNNs) for Android malware detec- state-based input requires an evaluation of the current state
tion by constructing graphs based on the relative position of the Android application to ensure that it can trigger mali-
relationships of nodes and combining NLP techniques. Ou et cious behaviour, and dynamic features application attributes,
al. [23] proposed a feature called S3feature from function call action events and static features permissions are extracted
graphs to extract sensitive subgraphs and neighbour-sensitive during the running of the application to detect malicious
subgraphs in function call graphs to mine local malicious behaviour through these three types of features.
behaviour. We propose SFCGDroid deals with sensitive func- Overall, static analysis-based malware detection research
tion call graphs directly, and does not lose coarse-grained accounts for the majority of research, as static analysis does
features. not have to run the application, saving resources while still
yielding good detection results. As one of the hotspots for
2.2 Dynamic analysis Android malware detection, we generate sensitive function
call graphs for sensitive APIs, focusing on local malicious
Dynamic analysis is an important technique as a mal- behaviour to automatically capture semantic feature informa-
ware detection method. Damodaran et al. [11] train Hidden tion and obtain better performance in detection classification.
Markov Models based on API call sequences and opcode
sequences, compare the detection rate of models based on
static analysis, dynamic analysis and hybrid analysis and find 3 Method
that the fully dynamic approach is very effective in malware
detection. So another analysis method for Android mal- The overall structure of SFCGDroid is shown in Fig. 1. The
ware detection is dynamic analysis. Unlike static analysis, APK is decompiled to extract the FCG, the SFCG is further
dynamic analysis requires the Android application to run in a extracted from the FCG, the function semantic features are
virtual environment or on a real Android device, and extracts combined with sensitive API triad representation features and
features while the Android application is running. It requires sent to the GCN for detection.
more resources, but reduces the impact of Android applica-
tion obfuscation. System calls are an important illustration 3.1 SFCG
of dynamic analysis, as they reveal the dynamic behaviour of
an application, where malware accesses sensitive resources The requirement for extracting SFCG is to extract FCG from
more frequently, such as sending SMS messages, opening APK file. We use the apktool [29] and androguard [30] tools
the camera, etc., which need to be implemented by system for decompiling to gain FCG. We extracted the SFCG based
calls. Xiao et al. [24] extracted the sequence of system calls on the FCG, and the extraction process is shown in Algo-
and classified them by the lstm model, achieving an accuracy rithm 1.
rate of 93.7%. EnDroid [25] extracts the dynamic behaviour We need to input the APK file, the list of sensitive APIs,
logs and system calls generated by the application during and the order number. The list of sensitive APIs is from [7].
sandbox operation and classifies them using the Stacking sensitive APIs mean that some APIs can perform sensitive
algorithm, achieving 96.56% detection. But dynamic anal- operations, such as reading SMS, reading and writing files,
ysis may not be able to obtain a valid feature because the etc. and sensitive APIs are highly associated with malware.
malicious behaviour of the application is not triggered, or Sensitive APIs accounts for a very small percentage of the
because the malware finds the emulator environment so that current APIs provided by the Android SDK, which currently
it does not perform the malicious behaviour. provides over 50,000 APIs, with 426 sensitive APIs that per-
form sensitive operations. The order parameter represents
2.3 Hybrid analysis how far the sensitive API node is to reach the other nodes
and requires the FCG to be converted to an undirected graph.
Hybrid analysis is a combination of static features dynamic In Algorithm 1, The function getFCG() in line 2 gets the
features. Surendran et al. [26] combined system calls, API FCG of the APK file, the funcrion toUndirected() in line 3
calls, and permissions to classify malware detection by tree gets undirected graph UnFCG of the FCG, and the funcrion
augmented plain Bayes (TAN) with 97% accuracy. NTPdroid getSAPIsOfFCG() in line 4 gets the sensitive APIs of the
[27] extracted dynamic features network traffic and static FCG. Si in line 5 represents each sensitive API in the sen-
features permissions, network traffic analysis can detect mal- sitive APIs list, the function getNeiOrder() in line 6 is to
ware via remote control, permissions can detect malware get some nodes in the undirected graph UnFCG that are less
that does not generate network traffic, and the FP-Growth than or equal to the order parameter value in distance from
algorithm generates frequent patterns for the combination of the sensitive API node, thus there are both sensitive and non-
123
1118 S. Shi et al.
Benign Benign
Triads Malware
Sensitive APIs
Malware Extract FCG Extract SFCG
sensitive API nodes in the set N. Lines 8-15 find whether sensitive API nodes associated with the sensitive API node,
there are edge connections between the nodes in the set N, inspired by the Skip-gram model, we treat the sensitive API
the edges between these nodes are obtained in the FCG to nodes as central words and the other nodes obtained through
construct the SFCG. Since the nodes and edges in the SFCG the sensitive API nodes as contextual words. Ultimately we
come from the FCG, the SFCG is an induced subgraph of the treat all nodes in SFCG as words and obtain the semantic
FCG. Finally the SFCG is constructed from the nodes and properties E f of each node after training with the Skip-gram
edges found. model.
Figure 2 shows the architecture for extracting the semantic
features of the function using the Skip-gram model training.
Algorithm 1: Extract SFCG Firstly, all functions in SFCG are collected to construct a cor-
Input : apk, senAPIs, order pus, and each function in the corpus is treated as a word and
Output: SFCG
trained to transform the function into an N-dimensional vec-
1 N = ∅,E = ∅ tor using the Skip-gram model. In the training process, each
2 FCG = getFCG(apk)
3 UnFCG = toUndirected(FCG) function is encoded in a one-hot encoding form as a vector
4 apkSAPIs = getSAPIsOfFCG(FCG,senAPIs) of V dimensions, with V representing the size of the corpus.
5 for each Si ∈ apkSAPIs do All inputs share the weights WV ×N , N is the dimension size
6 N = N ∪ getNeiOrder(UnFCG, Si , order) of the hidden layer and also the dimension size of each func-
7 end
8 for each sr c ∈ N do tion vector obtained after training is completed. The training
9 Sdst = getDstNode(FCG,sr c) objective of the Skip-gram model is to maximize the proba-
10 for each dst ∈ Sdst do bility of predicting context words given the target word, for
11 if dst ∈ N then a sequence of functions f 1 , f 2 , . . . , f T , the objective can be
12 E = E ∪ <src,dst>
13 end written as the average log probability
14 end
15 end
1
T
16 SFCG ← (N, E)
J( f ) = log P f t+ j | f t (1)
T
t=1 −d≤ j≤d
3.2 Function semantic features Skip-gram models use log functions to prevent gradient
explosion and to improve computational
efficiency,
2d + 1
After obtaining the SFCG, we also need to obtain the func- refers to the sliding window size. P f t+ j | f t is defined by
tion semantic features. The one-hot encoding is a simple
scheme, but it does not capture the association between func-
tion semantics and generates great sparse dimensionality. In exp e TfO e f I
this paper, we use the Skip-gram model proposed by Mikolov P ( f O | f I ) = V T (2)
i=1 exp ei e f I
et al. [31,32], which is based on the assumption that seman-
tically similar words also have similar contextual semantics.
After training is completed, each word can be given a low- e fO and e fI are vector representations of f O and
f I , the exp
dimensional vector which is a mapping of the words in the function is called an exponential function. P f t+ j | f t is
semantic space and is able to express the association between defined by the softmax function, which is computationally
V
the semantics of the words. very expensive because i=1 exp eiT e f I requires summing
Because the SFCG is constructed based on the sensitive over the entire corpus, so we use negative sampling tech-
API node, the SFCG contains both sensitive and non- niques [31] to simplify and speed up training.
123
SFCGDroid: android malware detection based… 1119
123
1120 S. Shi et al.
3.4 GCN
123
SFCGDroid: android malware detection based… 1121
123
1122 S. Shi et al.
Table 5 Performance evaluation of different feature sets RF permission 96.05 98.93 93.20 95.98
intent 93.82 99.73 87.88 93.43
Feature Dimension Acc P R F1
API 95.77 93.70 98.13 95.87
Ef 60 98.04 98.04 98.04 97.99 SVM permission 94.57 97.30 91.81 94.48
Tf 4 90.74 91.58 91.13 91.05 intent 92.34 98.63 85.89 92.82
E f + Tf 64 98.22 98.27 98.28 98.20 API 96.19 94.80 97.74 96.25
The float numbers in bold correspond to the best performance indicators SAGPool Ours 98.22 98.27 98.28 98.20
in the table
The float numbers in bold correspond to the best performance indicators
in the table
Table 7 Description of the different methods
Method Year Feature Classifier
123
SFCGDroid: android malware detection based… 1123
Table 8 Results of the comparison between SFCGDroid and other Acknowledgements We would like to thank anonymous reviewers for
methods their comments. This work was supported by Autonomous Region Key
R&D Project (2021B01002), the Key Program of National Natural Sci-
Method ACC(%) P(%) R(%) F1(%)
ence Foundation of China (U2003208), Major science and technology
Drebin 97.61 99.0 96.18 97.57 projects in the autonomous region (2020A03004-4).
DeepMalDet 95.22 96.67 93.63 95.13 Data Availability The datasets analysed during the current study are
MalDetGCN 97.21 97.80 97.86 97.83 available from the corresponding author on reasonable request.
MalScan 98.17 98.13 98.21 98.17
NATICUSdroid 96.49 98.59 94.39 96.44 Declarations
SFCGDroid 98.22 98.27 98.28 98.20
Conflict of interest The authors declare that they have no conflict of
The float numbers in bold correspond to the best performance indicators interest.
in the table
123
1124 S. Shi et al.
15. Sasidharan, S.K., Thomas, C.: ProDroid-an android malware detec- 35. Lee, J., Lee, I., Kang, J.: Self-attention graph pooling. In: Inter-
tion framework based on profile hidden Markov model. Pervasive national Conference on Machine Learning, pp. 3734-3743. PMLR
Mobile Comput. 72, 101336 (2021) (2019)
16. Bakour, K., Ünver, H.M.: DeepVisDroid: android malware detec- 36. Cangea, C., Veličković, P., Jovanović, N., Kipf, T., Liò, P.:
tion by hybridizing image-based features with deep learning Towards sparse hierarchical graph classifiers (2018). arXiv preprint
techniques. Neural Comput. Appl. 33(18), 11499–11516 (2021) arXiv:1811.01287
17. Ünver, H.M., Bakour, K.: Android malware detection based on 37. Kipf, T.N., Welling, M.: Semi-supervised classification with graph
image-based features and machine learning techniques. SN Appl. convolutional networks (2016). arXiv preprint arXiv:1609.02907
Sci. 2(7), 1–15 (2020) 38. Rehurek, R., Sojka, P.: Software framework for topic modelling
18. Fan, M., Liu, J., Wang, W., Li, H., Tian, Z., Liu, T.: Dapasa: detect- with large corpora. In: Proceedings of the LREC 2010 Workshop
ing android piggybacked apps through sensitive subgraph analysis. on New Challenges for NLP Frameworks (2010)
IEEE Trans. Inf. Forensics Secur. 12(8), 1772–1785 (2017) 39. Hagberg, A., Schult, D., Swart, P.: Exploring network struc-
19. Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection ture, dynamics, and function using networkX. In: Varoquaux, G.,
of android malware using embedded call graphs. In: Proceedings Vaught, T., Millman, J. (eds.) Proceedings of the 7th Python in
of the 2013 ACM Workshop on Artificial Intelligence and Security, Science Conference (SciPy 2008), pp. 11–15 (2008)
pp. 45–54 (2013) 40. Wang, M., Zheng, D., Ye, Z., Gan, Q., Li, M., Song, X., Zhang, Z.:
20. Xu, Z., Ren, K., Qin, S., Craciun, F.: CDGDroid: android malware Deep graph library: a graph-centric, highly-performant package for
detection based on deep learning using CFG and DFG. In: Interna- graph neural networks (2019). arXiv preprint arXiv:1909.01315
tional Conference on Formal Engineering Methods, pp. 177–193. 41. Ood, G.: Virustotal: R Client for the virustotal API. R package
Springer, Cham (2018) version 0.2.1 (2017)
21. Gao, H., Cheng, S., Zhang, W.: GDroid: android malware detec- 42. Mahdavifar, S., Kadir, A.F.A., Fatemi, R., Alhadidi, D., Ghor-
tion and classification with graph convolutional network. Comput. bani, A.A.: Dynamic android malware category classification
Secur. 106, 102264 (2021) using semi-supervised deep learning. In: 2020 IEEE International
22. Pei, X., Yu, L., Tian, S.: AMalNet: a deep learning framework based Conference on Dependable, Autonomic and Secure Computing,
on graph convolutional networks for malware detection. Comput. International Conference on Pervasive Intelligence and Comput-
Secur. 93, 101792 (2020) ing, International Conference on Cloud and Big Data Comput-
23. Ou, F., Xu, J.: S3Feature: a static sensitive subgraph-based feature ing, International Conference on Cyber Science and Technology
for android malware detection. Comput. Secur. 112, 102513 (2022) Congress (DASC/PiCom/CBDCom/CyberSciTech), pp. 515-522.
24. Xiao, X., Zhang, S., Mercaldo, F., Hu, G., Sangaiah, A.K.: Android IEEE (2020)
malware detection based on system call sequences and LSTM. Mul- 43. VirusShare. https://virusshare.com. Accessed November 2019
timed. Tools Appl. 78(4), 3979–3999 (2019) 44. Wang, W., Wang, X., Feng, D., Liu, J., Han, Z., Zhang, X.: Explor-
25. Feng, P., Ma, J., Sun, C., Xu, X., Ma, Y.: A novel dynamic Android ing permission-induced risk in android applications for malicious
malware detection system with ensemble learning. IEEE Access 6, application detection. IEEE Trans. Inf. Forensics Secur. 9(11),
30996–31011 (2018) 1869–1882 (2014)
26. Surendran, R., Thomas, T., Emmanuel, S.: A TAN based hybrid 45. Au, K.W.Y., Zhou, Y.F., Huang, Z., Lie, D.: Pscout: analyzing the
model for android malware detection. J. Inf. Secur. Appl. 54, android permission specification. In: Proceedings of the 2012 ACM
102483 (2020) Conference on Computer and Communications Security, pp. 217-
27. Arora, A., Peddoju, S. K.: NTPDroid: a hybrid android mal- 228 (2012)
ware detector using network traffic and system permissions. 46. Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.,
In: 2018 17th IEEE International Conference On Trust, Secu- Siemens, C.E.R.T.: Drebin: effective and explainable detection of
rity And Privacy In Computing And Communications/12th IEEE android malware in your pocket. Ndss 14, 23–26 (2014)
International Conference On Big Data Science And Engineering 47. McLaughlin, N., Martinez del Rincon, J., Kang, B., Yerima, S.,
(TrustCom/BigDataSE) pp. 808-813. IEEE (2018) Miller, P., Sezer, S., Joon Ahn, G.: Deep android malware detection.
28. Alzaylaee, M.K., Yerima, S.Y., Sezer, S.: DL-Droid: deep learn- In: Proceedings of the Seventh ACM on Conference on Data and
ing based android malware detection using real devices. Comput. Application Security and Privacy, pp. 301–308 (2017)
Secur. 89, 101663 (2020) 48. Vinayaka, K.V., Jaidhar, C.D.: Android malware detection using
29. Apktool. https://ibotpeaches.github.io/Apktool. Accessed 26 Feb function call graph with graph convolutional networks. In: 2021
2022 2nd International Conference on Secure Cyber Computing and
30. Androguard. https://github.com/androguard/androguard. Communications (ICSCCC), pp. 279–287. IEEE (2021)
Accessed 18 Feb 2019 49. Wu, Y., Li, X., Zou, D., Yang, W., Zhang, X., Jin, H.: Malscan: fast
31. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: market-wide mobile malware scanning by social-network central-
Distributed representations of words and phrases and their com- ity analysis. In: 2019 34th IEEE/ACM International Conference on
positionality. In: Advances in Neural Information Processing Automated Software Engineering (ASE) pp. 139-150. IEEE (2019)
Systems, vol. 26 (2013) 50. Mathur, A., Podila, L.M., Kulkarni, K., Niyaz, Q., Javaid, A.Y.:
32. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estima- NATICUSdroid: a malware detection framework for android using
tion of word representations in vector space (2013). arXiv preprint native and custom permissions. J. Inf. Secur. Appl. 58, 102696
arXiv:1301.3781 (2021)
33. Batagelj, V., Mrvar, A.: A subquadratic triad census algorithm for
large sparse networks with small maximum degree. Soc. Netw
23(3), 237–243 (2001)
Publisher’s Note Springer Nature remains neutral with regard to juris-
34. Allix, K., Bissyandé, T. F., Klein, J., Le Traon, Y.: Androzoo: col-
dictional claims in published maps and institutional affiliations.
lecting millions of android apps for the research community. In:
2016 IEEE/ACM 13th Working Conference on Mining Software
Springer Nature or its licensor (e.g. a society or other partner) holds
Repositories (MSR) pp. 468-471. IEEE (2016)
exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such
publishing agreement and applicable law.
123