Comparative Analysis of Feature Extraction Methods of PXC
Comparative Analysis of Feature Extraction Methods of PXC
1
International Journal of Computer Applications (0975 8887)
Volume 120 - No. 5, June 2015
framework of malware detection using machine learning approach using machine learning exhibits three distinct stages: Feature
followed by malware analysis in section 4. In this vein, we mainly extraction, feature selection sometimes followed by dimensionality
focus on the feature categorization based on malware analysis. reduction techniques, and then classification using machine
These features description is briefed in section 5. Further, Section learning algorithm. This flow of malware detection process is
6 analyzes the performance of existing malware detection systems as shown in Fig. 1. Each stage indicates different measure and
based on feature extraction techniques on standard dataset which is methods used in previously existing methods. Firstly the dataset is
briefed in Table-II, finally concluding remarks in section 7. prepared which consists of malware and benign executables. These
files are preprocessed depending on the FE method and next feature
2. MOTIVATION selection is done to quantify the correlation of feature for improving
performance and reducing number of computations to attain the
Network security has always been a major concern for everyone learning speed. Further after generalizing the feature capability,
involved in internet and for everyone using computer system. classifier is trained on the basis of the filtered results of feature
According to the Sophos Security Threat Report 2014 [5], malware selection. Researchers have adopted supervised machine learning
and related IT security threats have grown and matured, and the approach which uses classifiers Decision trees, Support vector
developers of malicious softwares have become far more creative machine, Nave bayes, Bayesian network, KNN algorithm, etc. are
in camouflaging their work. In 2013 there was a rise of a vicious mentioned in [6,7,1]. The best classifier is chosen which gives
new version of trojans, spywares. McAfee Security Labs catalogues the clear margin, and reduces interference and misclassification
nearly 100,000 malware versions every day, i.e. approximately between maliciousness and benignancy of executables. The dataset
one new threat per new second of time. Since this is urged to is tested corresponding to the trained classifier and results are
know how to circumvent the malware propagation. Most of the generated as malicious or benign softwares. The obtained outcomes
previous surveys briefed malware types and detection techniques. are evaluated with consequent performance metrics.
In [6], Saeed et al. gave an overview of malwares and their
detection systems; while in [1] Shabtai et al. presented a state of art
survey on machine learning techniques employing static features. 4. MALWARE ANALYSIS
Hence here we give an abstract view of the recently formulated Malware analysis is a technique to study malware behavior and
malware detection systems. The prime motivation of our survey its structure by extracting features which describes its malevolent
is to summarize the types of widely used features for malware intention. Several techniques have been introduced to detect unseen
detection. variants of malware. The domain of features is characterized
by the way of analyzing executables. Traditionally features are
categorized on the basis of static and dynamic analysis of
program files. For attaining efficiency and robustness, the system
adheres to the best feature type which explores a meaningful
corpus of malwares. In static analysis, the expected behavior of
program is determined over the observations in its binary code
or internal structure of files instead of actually executing it [6].
The static feature uniquely identifies the signature of malware or
malware families. Static analysis is vulnerable to code obfuscation
techniques. Dynamic analysis is test the program real time by actual
execution in controlled environment. In dynamic analysis behavior
of malicious softwares is monitored in emulated environment and
traces are obtained from the reports generated by sandbox. It can
deal with code evasion techniques [8]. However, it is resource
consuming and time intensive. Further malware detection system
utilized hybrid approach which is an integration of static and
dynamic analysis. Variety of features is invented by compounding
the static and dynamic approach. This taxonomy of features based
on malware analysis is depicted in Fig.2.
2
International Journal of Computer Applications (0975 8887)
Volume 120 - No. 5, June 2015
the frequencies of features extracted. Features extracted are chosen malicious behavior. First all dataset executable files are
such that it attains maximum classification accuracy. The time disassembled and opcodes are extracted. An opcode is the
required to get features from input dataset is also depends on the assembly language instruction which describes the operation
feature extraction methods. to be performed. It is short form of operational code.
Feature extraction method affects the performance of the system An instruction contains an opcode and operands, optionally
in terms efficiency, robustness, and accuracy. At first Schultz et upon which the operation should act. Some operations have
al. in [9], introduced the notion of applying machine learning operands upon opcodes may operate, depending on CPU
techniques for the detection of malwares based on their respective architecture, registers, values stored on memory and stacks,
representation of files from the dataset. They employed three etc. The action of an opcode takes in arithmetic, logical
FE methods, while further researchers extended this idea of operations, and data manipulation operation. Opcodes are
feature extraction to ameliorate the performance and accuracy of capable to statistically derive the variability between malicious
the system. Following the aforementioned research background and legitimate software.
features are described as follows:
Moskovitch et al. [16] presented mean accuracy of the
combinations n-gram opcode sequences. They stated that
2-gram opcode sequence was the best N-gram sequence
comparatively, which showed classification accuracy.
However, for more than bigram opcode sequence the accuracy
is decreased. Santos et al. [13, 15] used opcode sequences for
categorizing malicious and benign files with different feature
selection and classification algorithms. In [13, 28], opcode
sequence of 1-gram and 2-gram sequences for detecting new
variants of malware families. They used histograms for each
n-gram sequences calculating frequency of similarity ratio for
each malware instance. Sekar et al. [29] used n-gram approach
and examined performance of system by applying Finite State
Automaton (FSA) approach. They estimated two approaches
on httpd, ftpd, and nsfd protocols which resulted into a lower
false positive rate when compared to the n-gram approach.
3
International Journal of Computer Applications (0975 8887)
Volume 120 - No. 5, June 2015
classifies malware using SVM based on interpretable string classifiers SVM, KNN, and nave bayes. In [20], Tian et al.
features. It outperformed existing antivirus softwares achieved presented an automated classification system which uses API
better accuracy and efficiency using string features. call sequences as features and discriminates malwares and
cleanwares performance an accuracy of 97% achieved over a
dataset of malwares and cleanwares.
(5) Function Based Features
Function based features are extracted over the runtime Biley et al. [26] investigated an antivirus (AV) technique
behavior of the program file. Function based features functions which eliminates the drawbacks of earlier AV products and
that reside in a file for execution and utilize them to qualifies consistency, conciseness and completeness across
produce various attributes representing the file. Dynamically malware. System state changes describe the malware behavior
analyzed function calls including system calls, windows fingerprint in terms of files registry, process creation, network
application programming interface (API) calls, their parameter flows, etc. It uses clustering and classification of malware
passing, information flow tracking, instruction sets, etc. These samples. However the virtualized environment was static. An
functions increase the code reusability and maintenance. It is automatic behavior analyzing system proposed by Rieck et
semantically richer representation. Any malicious software for al. in [20] which gives an incremental and timely defense
execution or replications invokes some kernel level system call method for clustering and classification of malware binaries
to communicate with operating system; it is a sign of malicious in similar behavior and identifying novel classes of malwares
activity. In [22, 21, 25], addressed automatic behavior analysis using machine learning method. It avoids runtime overhead
using Windows API calls, instruction set, control flow graph, and gives accurate discrimination of novel malware.
function parameter analysis and system calls are used as
features. Park et al. [23] presented a malware detection system which
uses system call and their parameters values as the features
In [31] presented an automated malware detection system and generates directed subgraph for each programs behavior
which classifies malwares into their families monitoring their during execution. It creates a maximal behavior subgraph for
network behavior. It creates behavior graph from network measuring their similarity between their programs and known
traces obtained which represents network activities and their malware families. They evaluated performance over 6 known
network flow dependencies. The graph structure, in-degree, malware families and provided fair dissimilarity rates keeping
out-degree of nodes and root denotes the features of malware low false positives still the accuracy needed to be improved
activity. As per [31, 24], J48 decision trees given better as some malwares succeed to get kernel privileges. Lee et al.
TPR, FPR and accuracy results in comparison with other in [27] proposed a similar technique of clustering malware
classifiers. Firdausi et al. [24] propounded a malware detection families using supervised machine learning technique. It
system which monitors the behavior of malicious files in also analyzes sample datasets behavior according to system
controlled environment using a free online dynamic analysis call and parameters in virtual environment and generating
tool named Anubis. Then the generated results are parsed a behavior profile for network activities. Further they
into vector model for classification on the basis of the trained computed similarities between those profiles and grouping
classifier. The performance is tested on the small dataset of of different samples is done by applying k- medoids clustering.
benign and malicious files with and without feature selection.
The accuracy of 92.3% and 96.8% with and without feature (6) Hybrid Analysis Features
selection resp. achieved by J48 classifier was better than other These features are obtained by combining both techniques
4
International Journal of Computer Applications (0975 8887)
Volume 120 - No. 5, June 2015
Performance Metrics
Feature
(High Accuracy, TPR, & Low FPR is better)
Feature Type Feature Signature TPR Accuracy (%) FPR
Opcode n-gram + Byte Code n-gram [16] - 95 0.06
Opcode n-gram [11] - 99 0.03
Static Opcode n-gram [13] 0.95 92 0.03
Byte code n-gram + Opcode n-gram [14] 0.95 96 0.1
Portable Executable Header [17] - 99 0.05
Opcode n-gram + Application Programming Interface
0.97 96.22 0.07
Function calls [2]
Hybrid Function Length Frequency + Printable String Information
0.98 97.05 0.055
+ Application Programming Interface calls [3]
Application Programming Interface Function calls +
- 93.7 0.15
Portable Executable Header + String [19]
Function Length Frequency + Printable String Information [4] - 98.86 -
System Call [24] 0.95 96.8 0.04
Dynamic System state change [26] - 91.6 -
static analysis as well as dynamic analysis. It reduces the The aforementioned researches evaluated their system on the
effect of countermeasures of each static and dynamic technique standard dataset which consists of two sets of executables benign
for analyzing malwares and improves the performance and and malicious. The malicious executables dataset is downloaded
detection rates. Islam et al. [3] extracted static features of from the VXheavens website [30], which covers malwares such
functions such as function length frequency and printable as virus, adwares, worms, Trojan horses, etc. Here we provide
string Information (FLF and PSI) based on the functions comparative assessment of performance measures over results
of different lengths and the number of distinct printable generated by systems on the malware dataset. Table II gives the
strings present in unpacked malware executables. Further they overview of referenced malware detection system. We found some
extracted Application Programming Interface (API) function insights from our review which are as follows: First we observed
calls and parameters by dynamic analysis. They provided that systems using opcode and PE features adhere to low FPR
superior results in terms of accuracy on combining the function and high accuracy i.e. above 95% with some fluctuations [11, 14,
based features and string features. Similarly a combination 17, 18]. They were unable to cope with packed executables, while
of string and function features is used for classification of disassembly of executables is not always feasible.
malwares in [4]. They used different function length frequency
ranges and printable string information performed better over PE-miner approach in [17] was robust and reliable against packed
seen malware set. Santos et al. [2] introduced a hybrid executables in real time with low processing overheads. Behavioral
approach eliminating the need for each individual static and features API call and system call tracing is effective on zero day
dynamic malware analysis using both emulation (Qemu) and malwares while they increase the FPR which can undermine the
simulation (Wine) techniques for attaining the transparency efficacy of the system. Combining the features in a single method
without interference to the system. They extracted opcode step up the performance and provides accuracy up to 99% along
sequences statically and Windows API calls dynamically; with high TPR keeping the low FPR. Features based on dynamic
characterizing their behavior in groups of system information, analysis are less vulnerable to code evasion techniques. Though
persistence, file creation, process or thread creation, adding features based on dynamic analysis are best indicators of malware,
registry keys, errors, etc. This method employed classification they are time consuming and resource intensive. Since, precise and
algorithms such as KNN, SVM, Decision trees, bayesian effective results are achieved by hybrid approach which eliminates
networks, etc. to discriminate malwares and benign softwares. the loopholes of each method. In [2, 3, 4] malware detection
This provided more accurate results leading to notable increase system employing hybrid features showed high accuracy and TPR
in performance metrics. in comparison with those using static and dynamic features.
6. PERFORMANCE EVALUATION
7. CONCLUSION
Every malware detection system is obliged to provide a timely
defense against cyber-attacks caused by malwares with high This paper gives an overview of malware detection techniques
precision. The performance evaluation is done by using classical based on static, dynamic and hybrid analysis of executables. We
metrics such as classification accuracy, False Positive Rate (FPR) presented a comparative assessment of features and illuminated
and True Positive Rate (TPR) with least processing time. TPR their effect on performance of the system. We found that, high
is ratio of the number of correctly detected malware to the total accuracy and TPR can be achieved by selecting an appropriate
number of malware in the testing set. FPR ratio of the number of feature extraction method. Although opcode and PE features
benign files detected as malware to the total number of benign files enhanced the speed and accuracy of malware detection system, they
in the testing set. The efficiency and robustness of the system is give rise to false positives. Hybrid analysis features maintain low
defined by high accuracy, high TPR and low FPR, such system is false positive rate and yield precise results in least processing time.
effective in the real life scenarios. These methods used for malware classification should be able to
5
International Journal of Computer Applications (0975 8887)
Volume 120 - No. 5, June 2015
deal with huge and daily emerging malware variants which can Taiwan; 2008.
preserve the performance and accuracy of the system in real time.
[14] I.Santos, F. Brezo, X. Ugarte-Pedrero, P. G. Bringas, Opcode
sequences as representation of executables for data-mining-based
8. REFERENCES unknown malware detection, Information Sciences, vol. 231, pp.
64-82, 2013.
[1] A. Shabtai, R. Moskovitch, Y. Elovici, C.Glezer, Detection of
malicious code by applying machine learning classifiers on static [15] A.Shabtai, R. Moskovitch, C. Feher, S. Dolev, and Y. Elovici,
features: A state-of-the-art survey, Information security technical Detecting unknown malicious code by applying classification
report 14, 2009. techniques on opcode patterns, Security Informatics, vol. 1, pp.
122, 2012.
[2] Santos, I., Devesa, J., Brezo, F., Nieves, J. and Bringas,
P.G. (2013) OPEM: A Static-Dynamic Approach for Machine [16] I. Santos, F. Brezo, J. Nieves, Y. K. Penya, B. Sanz, C.
Learning Based Malware Detection, Proceedings of International Laorden, and P. G. Bringas, Opcode-sequence-based malware
Conference CISIS12-ICEUTE12, Special Sessions Advances in detection, in Proc. 2nd Int. Symp. Eng. Secure Software and Syst.
Intelligent Systems and Computing, 189, 271-280. (ESSoS), Pisa, Italy, . vol. LNCS 5965, pp. 3543, Feb.34, 2010.
[3] R. Islam, R Tian, Lynn, M. Batten , S. Versteeg, Classification [17] M. Z. Shafiq, S. M. Tabish, F. Mirza, and M. Farooq,
of malware based on integrated static and dynamic features, Pe-miner: Mining structural information to detect malicious
Journal of Network and Computer Applications 36,646656,2013. executables in realtime, in Proceedings of the 12th International
Symposium on Recent Advances in Intrusion Detection,
[4] Islam R, Tian R, Batten L, Versteeg S. Classification of malware ser. RAID 09. Berlin, Heidelberg: Springer- Verlag, 2009,
based on string and function feature selection, Cybercrime and pp.121141.i.org/10.4236/jis.2014.5-2006.
Trustworthy Computing Workshop (CTC) 2010:917.
[18] Mikhail Zolotukhin, Timo Hamalainen, Support Vector
[5] Sophos labs, Security Threat Report 2014. Machine Integrated with Game-Theoretic Approach and Genetic
Algorithm for the Detection and Classification of Malware,
[6] I.A. Saeed, A. Selamat, Ali M. A. Abuagoub, A Survey on Globecom 2013 IEEE Workshop - First International Workshop on
Malware and Malware Detection Systems, International Journal of Security and Privacy in Big Data
Computer Applications, Volume 67 No.16, April 2013.
[19] Y. Ye, L. Chen, D. Wang, T. Li, Q. Jiang, and M. Zhao,
[7] Mathur, K. and Hiranwai, S. A Survey on Techniques in Sbmds: an interpretable string based malware detection system
Detection and Analyzing Malware Executables. International using svm ensemble with bagging, Journal in Computer Virology,
Journal of Advanced Research in Computer Science and Software vol. 5, no. 4, pp. 283293, 2009.
Engineering, 2013, 3: 422428.
[20] Rieck, K., Trinius, P., Willems, C. and Holz, T. (2011)
[8] Ekta Gandotra, Divya Bansal, Sanjeev Sofat, Malware Analysis Automatic Analysis of Malware Behavior Using Machine
and Classification: A Survey, Department of Computer Science Learning. Journal of Computer Security, 19, 639-668.
and Engineering, PEC University of Technology, Chandigarh,
India Journal of Information Security, 2014, 5,56-64 Published [21] Tian R, Batten L, Islam R, Versteeg S. An automated
Online April 2014 in SciRes. classification system based on the strings of Trojan and virus
families, In: Proceedings of the 4th international conference on
[9] Schultz, M., Eskin, E., Zadok, F., Stolfo, Data mining methods malicious and unwanted software: MALWARE 2009; 2009. p.
for detection of new malicious executables. In: Proceedings of the 2330.
22nd IEEE Symposium on Security and Privacy. (2001) 3849.
[22] Tian, R., Islam, M.R., Batten, L. and Versteeg, S. (2010)
[10] Tony Abou-Assaleh, Nick Cercone, Vlado Keselj, and Differentiating Malware from Cleanwares Using Behavioral
Ray Sweidan. Detection of new malicious code using n-grams Analysis, Proceedings of 5th International Conference on
signatures In Proceedings of Second Annual Conference on Malicious and Unwanted Software (Malware), Nancy,October
Privacy, Security and Trust, pp. 193196, 2004. 2010, 23-30.
[11] R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, [23] Park, Y., Reeves, D., Mulukutla, V. and Sundaravel, Fast
S. Dolev and Y. Elovici. Unknown Malcode Detection Using Malware Classification by Automated Behavioral Graph Matching.
OPCODE Representation. Proc. Of the 1-st European Conference Proceedings of the 6th Annual Workshop on Cyber Security and
on Intelligence and Security Informatics (EuroISI08), 2008. Information Intelligence Research, Article No. 45,2010.
[12] W. Li, K. Wang, S. Stolfo, B. Herzog. Fileprints: Identifying [24] Firdausi, I., Lim, C. and Erwin, Analysis of Machine
file types by n-gram analysis. Proc. of the IEEE Workshop on Learning Techniques Used in Behavior Based Malware Detection,
Information Assurance and Security,2005. Proceedings of 2nd International Conference on Advances in
Computing, Control and Telecommunication Technologies (ACT),
[13] Moskovitch R, Stopel D, Feher C, Nissim N, Elovici Y. Jakarta, 2010, 201-203.
Unknown malcode detection via text categorization and the
imbalance problem In: IEEE Intelligence and Security Informatics,
6
International Journal of Computer Applications (0975 8887)
Volume 120 - No. 5, June 2015
[26] Biley, M., Oberheid, J., Andersen, J., Morley Mao, Z.,
Jahanian, F. and Nazario, Automated Classification and Analysis
of Internet Malware, Proceedings of the 10th International
Conference on Recent Advances in Intrusion Detection, 4637,
178-197.