Detection of Advanced Malware by Machine Learning Techniques
Detection of Advanced Malware by Machine Learning Techniques
Abstract
In today’s digital world most of the anti-malware tools are signature based which is ineffective to detect advanced
unknown malware viz. metamorphic malware. In this paper, we study the frequency of opcode occurrence to detect
unknown malware by using machine learning technique. For the purpose, we have used kaggle Microsoft malware
classification challenge dataset. The top 20 features obtained from fisher score, information gain, gain ratio, chi-square
and symmetric uncertainty feature selection methods are compared. We also studied multiple classifiers available in
WEKA GUI based machine learning tool and found that five of them (Random Forest, LMT, NBT, J48 Graft and
REPTree) detect the malware with almost 100% accuracy.
Keywords: Metamorphic, Anti-malware, WEKA, Machine Learning.
1. Introduction
A program/code which is designed to penetrate the system without user authorization and takes inadmissible action is
known as malicious software or malware [1]. Malware is a term used for Trojan Horse, spyware, adware, worm, virus,
ransomware, etc. As the cloud computing is attracting the user day by day, the servers are storing enormous data of
the users and thereby luring the malware developers. The threats and attacks have also increased with the increase in
data at Cloud Servers. Figure 1 shows the top 10 windows malware reported by quick heal [2].
Malwares are classified into two categories - first generation malware and second generation malware. The category
of malware depends on how it affects the system, functionality of the program and growing mechanism. The former
deals with the concept that the structure of malware remains same, while the later states that the keeping the action as
is, the structure of malware changes, after every iteration resulting in the generation of new structure [3]. This dynamic
characteristic of the malware makes it harder to detect, and quarantine. The most important techniques for malware
detection are signature based, heuristic based, normalization and machine learning. In past years, machine learning
has been an admired approach for malware defenders.
In this paper, we investigate the machine learning technique for the classification of malware. In the next section, we
discuss the associated work; section 3 describes our approach comprehensively, section 4 includes experimental
outcomes and section 5 contains inference of the paper.
30.00%
25.09% W32.Sality.U
25.00% 22.79% Trojan.Starter.YY4
19.71% Trojan.NSIS.Miner.SD
20.00%
TrojanDropper.Dexel.A5
Worm.Mofin.A3
15.00%
PUA.Mindsparki.Gen
8.84% Trojan.EyeStye.A
10.00%
6.39% Trojan.Suloc.A4
Classifier
𝑻𝑷+𝑻𝑵
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = × 𝟏𝟎𝟎
𝑻𝑴+𝑻𝑩
Where
True Positive: the no. of malware correctly detected.
True Negative: the no. of benign correctly detected.
False Positive: the no. of benign identified as malware.
False Negative: the no. of malware identified as benign.
Table 2 shows the result obtained by the top 5 classifiers. The study shows that the selected five classifiers accuracy
is more or less same.
Table 2. Performance of Top 5 Classifiers with Fisher Score Feature Selection Method
Classifiers True Positive False Negative False Positive True Negative Accuracy
Random Forest 100% 0 0 100% 100%
LMT 100% 0 0 100% 100%
NBT 100% 0 0 100% 100%
J48 Graft 100% 0 0 100% 100%
REPTREE 99.98% 0.04% 0.05% 99.95% 99.96%
5. Conclusion
In this paper, we have presented an approach based on opcodes occurrence to improve malware detection accuracy of
the unknown advanced malware. Code obfuscation technique is a challenge for signature based techniques used by
advanced malware to evade anti-malware tools. Proposed approach uses Fisher Score method for the feature selection
and five classifiers used to uncover the unknown malware. In proposed approach Random forest, LMT, J48 Graft, and
NBT detect malware with 100% accuracy which is better than the accuracy (99.8%) reported by Ahmadi et al. (2016).
In future, we will implement proposed approach on different datasets and will perform in the deep analysis for the
classification of advanced malicious software.
Acknowledgement
Mr. Sanjay Sharma is thankful to Dr. Lini Methew, Associate Professor and Dr. Rithula Thakur Assistant Professor,
Department of Electrical Engineering for providing computer lab assistance time to time.
References
1. A. Sharma and S. K. Sahay, “Evolution and Detection of Polymorphic and Metamorphic Malware: A Survey,” International
Journal of Computer Application, vol. 90, no. 2, pp. 7–11, 2014.
2. E. S. Solutions and Q. Heal, “Quick Heal Quarterly Threat Report | Q1 2017,” 2017
url:http://www.quickheal.co.in/resources/threat-reports . [Accessed: 13-june-2017].
3. A. Govindaraju, “Exhaustive Statistical Analysis for Detection of Metamorphic Malware,” Master’s project report, Department
of Computer Science, San Jose State University, 2010.
4. M. G. Schultz, E. Eskin, and S. J. Stolfo, “Data Mining Methods for Detection of New Malicious Executables,” 2001.
5. D. Bilar, “Opcodes As Predictor for Malware,” International Journal of Electronic Security and Digital Forensics, vol. 1, no. 2,
pp. 156–168, 2007.
6. Y. Elovici, A. Shabtai, R. Moskovitch, G. Tahan, and C. Glezer, “Applying Machine Learning Techniques for Detection of
Malicious Code in Network Traffic,” Annual Conference on Artificial Intelligence. Springer Berlin Heidelberg, pp. 44–50, 2007.
7. R. Moskovitch, D. Stopel, C. Feher, N. Nissim, N. Japkowicz, and Y. Elovici, “Unknown malcode detection and the imbalance
problem,” Journal in Computer Virology, vol. 5, no. 4, pp. 295–308, 2009.
8. R. Moskovitch et al., “Unknown malcode detection using OPCODE representation,” Intelligence and Security Informatics.
Springer Berlin Heidelberg, vol. 5376 LNCS, pp. 204–215, 2008
9. I. Santos, J. Nieves, and P. G. Bringas, “Semi-supervised learning for unknown malware detection,” International Symposium on
Distributed Computing and Artificial Intelligence. Springer Berlin Heidelberg, vol. 91, pp. 415–422, 2011.
10. I. Santos, F. Brezo, X. Ugarte-Pedrero, and P. G. Bringas, “Opcode sequences as representation of executables for data-mining-
based unknown malware detection,” Information Sciences, vol. 231, pp. 64–82, 2013.
11. A. Shabtai, R. Moskovitch, C. Feher, S. Dolev, and Y. Elovici, “Detecting unknown malicious code by applying classification
techniques on OpCode patterns,” Security Informatics, vol. 1, no. 1, p. 1, 2012.
12. A. Sharma and S. K. Sahay, “An effective approach for classification of advanced malware with high accuracy,” International
Journal of Security and its Applications, vol. 10, no. 4, pp. 249–266, 2016.
13. S. K. Sahay and A. Sharma, “Grouping the Executables to Detect Malwares with High Accuracy,” Procedia Computer Science,
vol. 78, no. June, pp. 667–674, 2016.
14. Kaggle, “Microsoft Malware Classification Challenge (BIG 2015)” Microsoft, URL: https://www.kaggle.com/c/malware-
classification , [Accessed : 10/December/2016].
15. M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, “Novel Feature Extraction, Selection and Fusion for
Effective Malware Family Classification,” ACM Conference Data Application Security Priv., pp. 183–194, 2016.
16. J. Drew, M. Hahsler, and T. Moore, “Polymorphic malware detection using sequence classification methods and ensembles,”
EURASIP J. Inf. Secur., vol. 2017, no. 1, p. 2, 2017.
17. J. Derrac, S. García, and F. Herrera, “A first study on the use of co evolutionary algorithms for instance and feature selection,”
International Conference on Hybrid Artificial Intelligence Systems. Springer Berlin Heidelberg, pp. 557–564, 2009.
18. A. L. Blum and P. Langley, “Selection of relevant features and examples in machine learning Artificial intelligence, vol. 97, no.
1–2, pp. 245–271, 1997.
19. T. R. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,”
Science, vol. 286, no. 5439, pp. 531–537, 1999.
20. T. G. Dietterich, “Machine learning in ecosystem informatics and sustainability,” IJCAI, pp. 8-13 2009.