Android Malware Detection: An Eigenspace Analysis Approach: Suleiman Y. Yerima, Sakir Sezer Igor Muttik
Android Malware Detection: An Eigenspace Analysis Approach: Suleiman Y. Yerima, Sakir Sezer Igor Muttik
AbstractThe battle to mitigate Android malware has Hence, in this paper, we propose and evaluate an Android
become more critical with the emergence of new strains malware detection scheme based on the face recognition
incorporating increasingly sophisticated evasion techniques, in technique known as eigenfaces [2]. The eigenfaces method
turn necessitating more advanced detection capabilities. Hence, measures the extent of similarities and differences amongst a
in this paper we propose and evaluate a machine learning based set of faces in a database. It is based on the premise that every
approach based on eigenspace analysis for Android malware face is a linear combination of the basic set of faces called
detection using features derived from static analysis eigenfaces. The eigenfaces are projections of real faces into an
characterization of Android applications. Empirical evaluation eigenspace which is defined by the vectors that spans across
with a dataset of real malware and benign samples show that
significant variations amongst them. In this paper, these basic
detection rate of over 96% with a very low false positive rate is
achievable using the proposed method.
principles are applied to develop an eigenspace model based on
static analysis characterization of applications which is then
Keywordsmalware detection; statistical machine learning; used to determine whether or not a given new application is
Android; eigenvectors; eigenspace; mobile security potentially malicious.
1236 | P a g e
www.conference.thesai.org
Science and Information Conference 2015
July 28-30, 2015 | London, UK
static analysis based on 58 features derived from API calls, system consists of: API related features, permissions related
intents, and system commands to classify Android applications features, intents, and commands related features. Previous
as benign or suspicious. This model was further analysed in works such as [7]-[12] and [14]-[16] have found these type of
[9] to study the impact of using features derived from static features to be effective for training machine learning
permissions alongside API calls, intents and system commands models for classification. Hence, we are similarly motivated to
for Bayesian-based detection of Android malware. Peng et. al. utilize features derived from these categories to develop the
[10] also proposed and evaluated probabilistic generative eigenspace based detection models proposed in this paper.
models based on Android permissions features; while Sanz et.
al. [11] trained and compared several machine learning The API related features are obtained by mining the
algorithms employing Android permissions features in order to disassembled Dalvik executable (dex) file. These include API
detect malware. X. Liu and J. Liu [12] proposed a two layer calls for accessing subscriber identity, device identity,
system based on permission features that classify applications executing external commands, intercepting broadcast
in three stages, with each stage utilizing the J48 Decision Tree notifications, encryption, etc. Indeed, most mobile malware
algorithm to determine whether an application is suspicious or attempt to steal sensitive information or send premium SMS
benign. Sahs and Khan [13] utilized call flow graphs to train a messages by taking advantage of standard platform APIs [20].
one-class SVM machine learning algorithm for detection of It is therefore conceivable to include features related to such
Android malware, while Arp et. al. [14] proposed DREBIN API calls in our feature set.
which also employed SVM and performs a broad static The permissions related features are keywords that map
analysis to gather as many features of an application as onto standard Android permissions which an app requests by
possible for device-based detection of suspicious apps. declaring them in the manifest file. As observed previously,
Other recent efforts include: Sharma and Dash [15] where (e.g. [9], [11], and [12]) certain permissions are requested more
models based on API calls and requested permissions were frequently by malware than others. For example, risky
built and investigated using Nave Bayes and K-Nearest permissions that request ability to read contacts, read SMS
messages, send SMS messages, install packages, delete
Neighbour classifiers. Yerima et. al. [16], investigated parallel
classifiers that utilized static features based on permissions, packages, access location information, etc. are commonly
system commands and API calls. The paper compared various found to be declared in manifest files of packages containing
parallel classification combination schemes aimed at malware.
improving detection accuracy over single classification Intents are used for intra-process and inter-process
algorithms. Unlike [7]-[16] however, Kang et. al. [17] did not communication on Android. They are passive data structures
use features from decompiled or disassembled code but rather exchanged as asynchronous messages allowing information
utilized Dalvik bytecode frequency analysis extracted from about events to be shared between different applications and
1300 malware samples and classified the malware into their 26 different components of applications. For example, malware
respective families using the Random Forest machine learning commonly listen for the BOOT_COMPLETED intent in order
algorithm. to trigger malicious activity immediately on booting a device.
Different from all the aforementioned existing works, this Commands related features are keywords that detect the
paper proposes and evaluates an eigenspace approach for presence of system commands like chown, chmod, mount
Android malware detection. Earlier work employing related etc. or certain parameters which may be used with these
approaches for malware detection were investigated for commands. In many malicious APKs, these commands can be
Windows-based metamorphic worms by Saleh [18] and found embedded in hidden files within an APK and invoked to
Deshpande [19]. A different approach is developed in this enable privilege escalation, launch concealed scripts, remove
paper for Android malware detection using the same basic traces of malicious activities, install additional malicious
principles. components, etc. The features used for characterizing
applications for the eigenspace scheme developed in this paper
III. EIGENSPACE APPROACH FOR ANDROID MALWARE can be found in the appendix.
DETECTION
B. Model development
A set of input features are required to characterize the
applications when using the eigenspace approach. The features The goal is to apply the eigenspace approach to recognize
chosen to characterize the applications are motivated by the whether a given unknown Android application is potentially
manner in which malware typically misuse certain API calls, malicious or not based on how close it is determined to be
permissions, and intents for malicious activities, and how they similar to a known application in the dataset and then assign it
attempt to illegitimately run system commands for privilege to the same class as the known application. A set of malware
escalation. The features are extracted from static analysis of the and benign applications are used to create an eigenspace for
APKs in order to characterize each application according to detecting new/unknown malware.
their usage of these features. The training sets were constructed as follows:
A. Applications characterization 1) Create a training set S consisting of M malware and B
The features extracted from the applications for benign applications. Denote S ={S1, S2, ..., SK} where K = M +
characterization and input into the eigenspace analysis based B
1237 | P a g e
www.conference.thesai.org
Science and Information Conference 2015
July 28-30, 2015 | London, UK
as follows:
Hence, to determine how much the new application is close
1 to an application in the training set with weight vector Wi, the
(1)
K
= n =1
Vn Euclidean distance between the weight vectors are calculated
K to yield a score given by:
5) Obtain the matrix A = [V1, V2, ..., VK] where Vi Score=||W-Wi||=
= Vi , for i = 1 ... k
6) Next, find the eigenvectors of the covariance matrix
( (w )
1 1
2
+ ( w2 2) 2 + ... + ( wN ' N ') 2 ) (5)
C, where C = AAT. Note that C is an N N matrix, and since By scoring the new application against each application in
N K given the number of applications to be incorporated the training set S represented by their respective weight vectors
into the training set is larger than the number of features, the in D, we can predict the class of the application by:
eigenvectors are computed directly from C. P = arg mini ||W-Wi ||, i = 1 ... N (6)
7) Suppose i is a eigenvector of C then
If P belongs in the (labelled) malware class then the new
application is predicted to be malicious otherwise it is
C i = ii (2)
predicted to be benign.
where i is the corresponding eigenvalue. IV. METHODOLOGY AND EXPERIMENTS
8) Thus the set of eigenvectors = { 1 , 2 , , N } are This section describes the methodology of the experiments
computed and sorted in descending order of magnitude of that were conducted to evaluate the Android malware detection
their corresponding eigenvalues with those having higher approach proposed in this paper.
values being more important in describing the applications. A. Dataset pre-processing
Hence, a number of N eigenvectors are chosen to describe The experiments were undertaken using 6,860 applications;
the eigenspace (i.e. the linear space spanned by the selected of these, 2,925 were malware while the remaining 3,935 were
eigenvectors). benign applications. The samples were provided by McAfee
9) Each sample in the training set S is projected into the (part of Intel Security). In order to extract the defining features
eigenspace defined by these eigenvectors by representing each from the applications, a bespoke Android package analysis tool
one as a linear combination of eigenvectors and weights. was developed using Java and python. The tool enables
automated reverse engineering of Android applications to
N' allow for the construction of feature vectors which are
V 'i = wj j , where N N (3) subsequently arranged into a matrix of column vectors
characterizing each application as explained in the previous
j =1
section.
10) The weights of each application in the training set can Initially, 175 features based on API calls, permissions,
intents and commands related keywords were extracted for
be calculated from
each of the applications. The features were ranked in order of
wj = j TV ' i , j = 1, 2, ..., N ' (4) relevance using the Gain Ratio criterion in WEKA [21].
Subsequently, 100 top ranked features were selected for
11) The weights of the application can be combined into a training the eigenspace model. The 100 features are given in
vector W, where W T = [w1, w2, ., wN] the appendix in their ranking order. Thus, according to our
model, we have N = 100. This pre-processing stage resulted in
12) Let D = [W1,W2, ., WK]
a 100 6860 matrix of feature vectors that were further
13) D can be considered a model trained from the K processed using various python scripts and MATLAB in order
samples in S which consists of applications from both to build the eigenspace model as described in section III.
malware and benign classes.
1238 | P a g e
www.conference.thesai.org
Science and Information Conference 2015
July 28-30, 2015 | London, UK
app in the training set using the Euclidean distances (of their
projections into eigenspace represented by the weight vectors
W ) to derive a score for each app in the training set. The class 3000
1239 | P a g e
www.conference.thesai.org
Science and Information Conference 2015
July 28-30, 2015 | London, UK
1240 | P a g e
www.conference.thesai.org
Science and Information Conference 2015
July 28-30, 2015 | London, UK
malware from existing families. Furthermore, eigenspace applications, and services (MobiSys '12). ACM, New York, NY, USA,
models can readily be improved over time by a self-improving 2012, pp. 281-294.
mechanism that can automatically eliminate false positive [5] E. Chin, A. P. Felt, K. Greenwood, and D. Wagner. Analyzing inter-
application communication in Android In Proc. of the 9th international
clusters from the training set or include new true negative conference on Mobile systems, applications, and services (MobiSys '11).
samples into the training set (to further enrich the eigenspace). ACM, New York, NY, USA, 2011, pp. 239-252.
This possibility to dynamically control the FPR and TNR in [6] P. P.F. Chan, L. C.K. Hui, and S. M. Yiu. DroidChecker: analyzing
this manner is another advantage of the eigenspace approach android applications for capability leak In Proc. of the fifth ACM
over other machine learning methods. The ease which the conference on Security and Privacy in Wireless and Mobile
model can be incrementally trained/re-trained to improve Networks (WISEC '12). ACM, New York, NY, USA, 2012, pp.125-136.
accuracy by including novel training samples makes it an [7] W. Dong-Jie, M. Ching-Hao, W. Te-En, L. Hahn-Ming, and W. Kuo-
Ping, DroidMat: Android malware detection through manifest and API
attractive machine learning based classification technique for calls tracing, in Proc. Seventh Asia Joint Conference on Information
detecting new Android malware in practice. Security(Asia JCIS), 2012, pp. 62-69.
[8] S. Y. Yerima, S. Sezer, G. McWilliams, I. Muttik, (2013) A new
VI. CONCLUSIONS Android malware detection approach using Bayesian classification.
This paper presented an effective eigenspace analysis Proc. 27th IEEE int. Conf. on Advanced Inf. Networking and
Applications (AINA 2013), Barcelona, Spain.
approach to Android malware detection. The proposed
approach is investigated using 2,925 real malware samples and [9] S. Y. Yerima, S. Sezer, G. McWilliams. Analysis of Bayesian
Classifcation Approaches for Android Malware Detection, IET
3,935 clean samples employing standard cross-validation Information Security, Vol 8, Issue 1, January 2014.
method. The eigenspace approach is based on the eigenfaces [10] H. Peng, C. Gates, B. Sarma, N. Li, A. Qi, R. Potharaju, C. Nita-
technique which has its origins in face recognition applications. Rotaru and I. Molloy. Using Probabilistic Generative Models For
The results obtained from extensive empirical evaluations Ranking Risks of Android Apps. In Proceedings of the 19th ACM
show that it is a promising scheme for Android malware Conference on Computer and Communications Security (CCS 2012),
detection with 96.4% accuracy and only 3.5% false positives Oct. 2012.
observed. We have also found that compared to several popular [11] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedro, P. G. Bringas, G.
Alvarez PUMA: Permission usage to detect malware in Android
machine learning techniques the eigenspace approach performs International Joint Conference CISIS12-ICEUTE12-SOCO12 Special
quite well and at the same time can enable better usability in Sessions, in Advances in Intelligent Systems and Computing, Volume
practical systems. Moreover, it is easily applicable in many 189, pp. 289-298
scenarios where inference of other additional knowledge such [12] X. Liu and J. Liu A two-layered permission-based Android malware
as related malware families may be useful. detection scheme in Proc. 2nd IEEE International Conferenece on
Mobile Cloud Computing, Services and Engineering, Oxford, UK, 8-11
The model proposed in this paper can be improved further. April 2014.
Hence, future work will investigate means of reducing [13] J. Sahs and L. Khan A Machine Learning Approach to Android
detection errors, such as applying more effective filters at the Malware Detection 2012 European Intelligence and Security
feature extraction phase to improve the application Informatics Conference, 2012.
characterization, or deriving and experimenting with different [14] D. Arp, M. Sprientzenbarth, M. Hubner, H. Gascon, K. Reick
DREBIN: Effective and explainable detection of Android malware in
sets of features which could be more discriminative for your pocket NDSS 2014, February 2014, San Diego, CA, USA.
classification. Another aspect for further investigation is
[15] A. Sharma, and S. K. Dash. "Mining API Calls and Permissions for
streamlining and optimizing the eigenspace whilst still Android Malware Detection." Cryptology and Network Security.
retaining high accuracy performance. Springer International Publishing, 2014. 191-205.
[16] S. Y. Yerima, S. Sezer, I. Muttik Android malware detection using
ACKNOWLEDGMENT parallel machine learning classifiers in Proc. 8th International
The authors are grateful to McAfee (part of Intel Security) Conference on Next Generation Mobile Applications Services and
Technologies (NGMAST 2014), Oxford, UK, Sept. 2014.
for providing the sample applications that were used in this
[17] B. Kang, B-J. Kang, J. Kim, E. G. Im Android malware classification
work. method: Dalvik bytecode frequency analysis RACS 2013, October 1-4,
REFERENCES 2013, Montreal, QC, Canada.
[1] Sophos mobile security threat report 2014. Available online: [18] M. E. Saleh, A. B. Mohamed, and A. A. Nabi Eigenviruses for
http://www.sophos.com/en-us/medialibrary/PDFs/other/sophos-mobile- metamorphic virus recognition IET Inf. Secur., vol. 5, iss. 4, pp. 191-
security-threat-report.pdf?la=en [Accessed Nov. 2014] 198, 2011.
[2] M. Turk and A. Pentland Eigenfaces for recognition, J. cognitive [19] S. Deshpande, Y. Park and M. Stamp Eigenvalue analysis for
Neurosci. 3, pp. 71-86, 1991 metamorphic detection J. Comput Virol Hack Tech, October 2013.
[3] A. Apvrille and T. Strazzere, Reducing the window of opportunity for [20] McAfee threat report June 2014. Available online:
Android malware Gotta catch em all, Journal in Computer Virology http://www.mcafee.com/us/resources/reports/rp-quarterly-threat-q1-
vol. 8, No. 1-2, pp. 61-71, 2012. 2014.pdf [Accessed Nov. 2014]
[4] M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang. RiskRanker: [21] M. Hall, E. Frank, G. Holmes, B. Pfahriger, P. Reutermann and I. H.
scalable and accurate zero-day android malware detection. Witten. The WEKA data mining software: an update. SIGKDD
In Proceedings of the 10th international conference on Mobile systems, Explorations, Vol.11, No.1. pp 10-18, June 2009.
1241 | P a g e
www.conference.thesai.org
Science and Information Conference 2015
July 28-30, 2015 | London, UK
APPENDIX
TABLE II. 100 TOP GAIN RATIO RANKED FEATURES USED FOR THE EIGENSPACE BASED ANDROID MALWARE DETECTION MODEL
Feature Type Feature Type
SEND SMS P BROADCAST SMS P
createSubprocess API KILL BACKGROUND PROCESSES P
remount CR READ SYNC STATS P
/system/bin/sh CR CAMERA P
chown CR res CR
RECEIVE SMS P KeySpec API
/system/app CR DELETE PACKAGES P
abortBroadcast API MODIFY PHONE STATE P
pm install CR Ljavax crypto Cipher API
READ SMS P WRITE CONTACTS P
WRITE SMS P BIND INPUT METHOD P
mount CR PROCESS OUTGOING CALLS P
FACTORY TEST P SET WALLPAPER HINTS P
WRITE APN SETTINGS P READ LOGS P
RESTART PACKAGES P CALL PHONE P
CHANGE COMPONENT ENABLED STATE P INTERNAL SYSTEM WINDOW P
getSubscriberId API BLUETOOTH ADMIN P
BIND REMOTEVIEWS P CHANGE WIFI MULTICAST STATE P
DISABLE KEYGUARD P UPDATE DEVICE STATS P
CHANGE WIFI STATE P RECEIVE BOOT COMPLETED P
CLEAR APP CACHE P SecretKey API
READ PHONE STATE P getLine1Number API
TelephonyManager API BLUETOOTH P
FindClass API DEVICE POWER P
AUTHENTICATE ACCOUNTS P READ EXTERNAL STORAGE P
chmod CR BROADCAST WAP PUSH P
BIND WALLPAPER P FLASHLIGHT P
BIND ACCESSIBILITY SERVICE P HARDWARE TEST P
DELETE CACHE FILES P WRITE SECURE SETTINGS P
GET PACKAGE SIZE P Runtime API
READ CALL LOG P INTERNET P
INSTALL PACKAGES P READ CONTACTS P
GET ACCOUNTS P RECORD AUDIO P
SMSReceiver API Intent.action.RUN intent
Ljava net URLDecoder API REBOOT P
intent.action.BOOT COMPLETED Intent ACCESS LOCATION EXTRA CS P
GLOBAL SEARCH P READ HISTORY BOOKMARKS P
MANAGE ACCOUNTS P getNetworkOperator API
ACCESS NETWORK STATE P EXPAND STATUS BAR P
SET ORIENTATION P jar CR
/system/bin CR DexClassLoader API
USE CREDENTIALS P WRITE HISTORY BOOKMARKS P
RECEIVE WAP PUSH P CHANGE NETWORK STATE P
bindService API getDeviceId API
NFC P STATUS BAR P
RECEIVE MMS P SET WALLPAPER P
BIND APPWIDGET P HttpGet init API
Ljavax crypto spec SecretKeySpec API getPackageManager API
exec API getCallState API
getSimSerialNumber API apk CR
P: permission
CR: Command related
API: API call related
1242 | P a g e
www.conference.thesai.org