0% found this document useful (0 votes)
54 views8 pages

Detection of Advanced Malware by Machine Learning Techniques

This document discusses detecting advanced malware through machine learning techniques. It reviews previous research on using opcode frequencies and machine learning classifiers. The methodology presented extracts opcode frequency features from a dataset containing Microsoft malware and benign files. Feature selection methods are applied to the features before training machine learning classifiers like Random Forest, Decision Trees, and Naive Bayes. The goal is to accurately detect unknown malware samples.

Uploaded by

Aditya Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views8 pages

Detection of Advanced Malware by Machine Learning Techniques

This document discusses detecting advanced malware through machine learning techniques. It reviews previous research on using opcode frequencies and machine learning classifiers. The methodology presented extracts opcode frequency features from a dataset containing Microsoft malware and benign files. Feature selection methods are applied to the features before training machine learning classifiers like Random Forest, Decision Trees, and Naive Bayes. The goal is to accurately detect unknown malware samples.

Uploaded by

Aditya Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Detection of Advanced Malware by Machine Learning Techniques

Sanjay Sharma1, C. Rama Krishna2 and Sanjay K. Sahay3


1
M.E. Scholar, Department of Computer Science and Engineering,
2
Professor and Head, Department of Computer Science and Engineering,
3
Assistant Professor, Department of Computer Science and Information System,
1,2
National Institute of Technical Teachers Training and Research, Chandigarh, India
3
BITS, Pilani, Goa Campus, India
1
[email protected],2 [email protected], [email protected]

Abstract
In today’s digital world most of the anti-malware tools are signature based which is ineffective to detect advanced
unknown malware viz. metamorphic malware. In this paper, we study the frequency of opcode occurrence to detect
unknown malware by using machine learning technique. For the purpose, we have used kaggle Microsoft malware
classification challenge dataset. The top 20 features obtained from fisher score, information gain, gain ratio, chi-square
and symmetric uncertainty feature selection methods are compared. We also studied multiple classifiers available in
WEKA GUI based machine learning tool and found that five of them (Random Forest, LMT, NBT, J48 Graft and
REPTree) detect the malware with almost 100% accuracy.
Keywords: Metamorphic, Anti-malware, WEKA, Machine Learning.
1. Introduction
A program/code which is designed to penetrate the system without user authorization and takes inadmissible action is
known as malicious software or malware [1]. Malware is a term used for Trojan Horse, spyware, adware, worm, virus,
ransomware, etc. As the cloud computing is attracting the user day by day, the servers are storing enormous data of
the users and thereby luring the malware developers. The threats and attacks have also increased with the increase in
data at Cloud Servers. Figure 1 shows the top 10 windows malware reported by quick heal [2].

Malwares are classified into two categories - first generation malware and second generation malware. The category
of malware depends on how it affects the system, functionality of the program and growing mechanism. The former
deals with the concept that the structure of malware remains same, while the later states that the keeping the action as
is, the structure of malware changes, after every iteration resulting in the generation of new structure [3]. This dynamic
characteristic of the malware makes it harder to detect, and quarantine. The most important techniques for malware
detection are signature based, heuristic based, normalization and machine learning. In past years, machine learning
has been an admired approach for malware defenders.
In this paper, we investigate the machine learning technique for the classification of malware. In the next section, we
discuss the associated work; section 3 describes our approach comprehensively, section 4 includes experimental
outcomes and section 5 contains inference of the paper.
30.00%

25.09% W32.Sality.U
25.00% 22.79% Trojan.Starter.YY4

19.71% Trojan.NSIS.Miner.SD
20.00%
TrojanDropper.Dexel.A5
Worm.Mofin.A3
15.00%
PUA.Mindsparki.Gen

8.84% Trojan.EyeStye.A
10.00%
6.39% Trojan.Suloc.A4

5.00% 4.03% 3.91% 3.75% Worm.Necast.A3


2.80% 2.69%
PUA.Askcom.Gen
0.00%

Figure 1. Top 10 windows malware


2. Related Work
In 2001 Schultz et al. [4] introduced machine learning for detection of unknown malware based on static features, for
feature extraction author used PE (Program Executables), byte n-gram & Strings. In the year 2007, Danial Bilar [5]
introduced opcode as a malware detector, to examine opcodes frequency distribution in malicious and non-malicious
files. In year 2007, Elovici et al. [6] used Program Executable (PE) and Fisher Score (FS) method for feature selection
and used Artificial Neural Network (5grams, top 300, FS), Bayesian Network (5-grams, top 300, FS), Decision Tree
(5-grams, top 300, FS) , BN (using PE) and Decision Tree (using PE) and obtained 95.8 % accuracy. In the year 2008
Moskovitch et al. [7] used filters approach for feature selection. They used Gain Ratio(GR) and Fisher Score for
feature selection and Artificial Neural Networks (ANN), Decision Tree (DT), Naïve Bayes (NB), Adaboost.M1
(Boosted DT and Boosted NB) and Support Vector Machine (SVM) classifiers and got 94.9 % accuracy.
In the year 2008 again, Moskovitch et al. [8] presented an approach in which they used n-gram (1,2,3,4,5,6 gram) of
opcodes as features and used Document Frequency(DF), GR and FS feature selection method. They used ANN, DT,
Boosted DT, NB and Boosted NB classification algorithms, out of this ANN, DT, BDT out-performed, preserving the
low level of false positive rate.
In 2011 Santos et al. [9] inferred that supervised learning requires labelled data, so they proposed semi-supervised
learning to detect unknown malware. In 2011, Santos et al. [10] again come with the frequency of the appearance of
operational codes. They used information gain method for feature selection, and different classifiers, i.e. DT, k-
nearest neighbor (KNN), Bayesian Network, Support Vector Machine (SVM), among them SVM outperforms with
92.92 % for one opcode sequence length and 95.90 for two opcode sequence length. In the year 2012, Shabtai et al.
[11] used opcode n-gram pattern feature and to identify the best feature they used Document Frequency (DF), G-mean
and Fisher Score method. In their approach, they used many classifiers, in which Random Forest outperforms with
95.146 % accuracy.
In 2016 Ashu et al. [12] presented a novel approach to identify unknown malware with high accuracy. They analyzed
the occurrence of opcodes and by grouping the executables. Authors studied thirteen classifiers found in the WEKA
machine learning tool out of them a Random forest, LMT, NBT, J48, and FT examined in depth and got more than
96.28% malware detection accuracy. In 2016 Sahay et al. [13] grouped executables on the base of malwares size by
using Optimal k- means clustering algorithm, and these groups used as promising features for training (NBT, J48,
LMT, FT and Random Forest) the classifiers to identify unknown malware. They found that detection of unknown
malware by proposed the approach gives accuracy up to 99.11%.
Recently some authors worked on malware dataset released for kaggle dataset [14]. In the year 2016, Ahmadi et al.
[15] took Microsoft malware dataset and used hex dump-based features (n-gram, Metadata, entropy, image
representation and string length ) as well as features extracted from disassembled file (Metadata, Symbol frequency,
opcodes, register, etc. ) and XGBoost classification algorithm. They reported ~99.8 % detection accuracy. In 2017
Drew et al. [16] used The Super Threaded Reference Free Alignment-Free Nsequence Decoder (STRAND) classifier
to perform classification of polymorphic malware. In their approach, they presented ASM sequence model and
obtained accuracy greater than 98.59 % using 10-fold cross-validation.
3. Methodology
To detect the unknown malware using machine learning technique, a flow chart of our approach is shown in fig. 2. It
includes preprocessing of dataset, promising feature selection, training of classifier and detection of advanced
malware.

Figure 2. Flow Chart for Malware Detection


3.1 Building the dataset
Microsoft released approximately half terabyte for kaggle Microsoft Malware Classification Challenge (2015) [14]
containing malware (21653 assembly codes). We downloaded malware dataset from kaggle Microsoft and collected
benign programs (7212 files) for the windows platform (checked from virustotal.com) from our college’s lab. In our
experiment, we found that as dataset grows, there is an issue of scalability. This issue increases time complexity,
storage requirement and decreases system performance. To overcome these issues, reduction of data set is necessary.
Two approaches can be used for data reduction viz. Instance Selection (IS) and Feature Selection (FS). In our
approach, Instance Selection (IS) is used to reduce the number of instances (rows) in dataset by selecting most
appropriate instances. On the other hand, Feature Selection is used for the selection of most relevant attributes
(features) in dataset These two approaches are very effective in data reduction as they filter and clean, noisy data
which results in less storage, time complexity and improve the accuracy of classifiers [17] [18].
3.2 Data Preparation
From the earlier studies [12] we have found that opcodes contain a more meaningful representation of the code, so in
proposed approach, we use opcodes as features. Malware dataset contains 21653 assembly codes of malware
representation, a combination of 9 different families, i.e., Ramnit, Lollipop, Kelihos_ver3, Vundo, Simda, Tracur,
Kelihos_ver1, Obfuscator.ACY, Gatak. Collected benign executables disassembled using objdump utility available in
Linux system to get the opcodes.
In the malware dataset, we have found that maximum size of assembly code is 147.0 MB, so all the benign assembly
above the 147.0 MB are not considered for the analysis. From earlier studies, we found that there are 1808 unique
opcodes [12] so in our approach, there are 1808 features for machine learning. Then the frequency of each opcode in
every malware and the benign file is calculated. After that in every malware and benign file total opcodes weight is
calculated. Then it is noticed that there are 91.3 % malware file and 66 % benign file which contains opcodes weight
below 40000. So to maintain the proportion of malware and benign all the files under 40000 weight is selected. After
this step, 19771 and 4762 malware and benign files are left for analysis.
The next step is to remove noisy data from malware for that we have calculated the malware and benign files in the
500 intervals of opcodes weight. Those intervals in which there are no benign files, malware files are also deleted in
that interval. In this way further intervals 100, 50, 10 and 2 of opcodes weights are created as shown in Fig. 3 to
remove the noise from malware. Finally, dataset contains 6010 Malware and 4573 benign files.

Figure 3. Opcode Weight Interval Over Period of 50


3.3 Feature Selection
Feature selection is an important part of machine learning. In proposed approach, there are 1808 features among them
many do not donate to the accuracy and even decrease it. In our problem reduction of features is crucial to maintaining
accuracy. Thus we first used Fisher Score (FS) [19] for feature selection and later four more feature selection
techniques were also studied. The five feature selection method employed in this approach which functions according
to the filters approach [20]. In this method, correlation of each feature with the class (Malware or benign) is quantified,
and its contribution to classification is calculated. This method is independent of any classification algorithm unlike
wrapper approach and allows to compare the performance of different classifiers. In this approach, Fisher Score (FS),
Information Gain (IG), Gain Ratio (GR), Chi- Square (CS) and Uncertainty Symmetric(US) is used. Based on these
feature selection measures we have selected top 20 features as shown in table 1.
Table 1. Top 20 Features
Rank Information Gain Gain Ratio Symmetrical uncertainty Fisher Score Chi Square
1 jne jne jne je jne
2 je je je jne je
3 dword dword dword start dword
4 retn retn retn cmpl retn
5 jnz jnz jnz retn jnz
6 jae jae jae dword jae
7 offset offset offset test offset
8 jz movl movl cmpb movl
9 movl cmpl jz xor jz
10 cmpl jz cmpl jae cmpl
11 int movzwl movzwl movzwl int
12 movzwl movb movb ret movzwl
13 movb sete sete jbe movb
14 sete int3 int3 movl sete
15 int3 testb testb andl int3
16 testb setne cmpb lea testb
17 setne cmpb setne cmp cmpb
18 cmpb andl andl testb setne
19 andl incl incl incl andl
20 incl movzbl movzbl setne incl

3.4 Training of the Classifiers


After the feature selection, next step is to find the best classifier for the detection of advanced malware. Next step is
to compare different classifiers on FS, IG, GR, CS and US using top 20 features. We studied nine classifiers viz.
Decision Stump, Logistic Model Tree (LMT), Random Forest, J48, REPTREE, Naïve Bayes Tree (NBT), J48 Graft,
Random Tree, Simple CART available in WEKA. WEKA is an open source GUI based machine learning tool. We
run all these classifiers on each feature selection technique using 10–fold cross-validation to train the classifiers. Fig.
4 shows the accuracy of each classifier concerning feature selection method. From the Fig. 4 it is clear that Fisher
score method is best in among all and got accuracy 100 % in case of Random Forest, LMT, NBT and Random Tree.
So in our proposed, Fisher Score performs better than other methods viz. Information Gain (IG), Gain Ratio (GR),
Symmetrical Uncertainty and Chi-Square.
100.02
100
99.98
99.96
Accuracy
99.94
99.92 InfoGain
99.9
GainRation
99.88
99.86 symmetrical.uncertainty
99.84
FisherScore
99.82
chi Square

Classifier

Figure 4. Accuracy of Classifiers concerning different Feature Selection Methods

3.5 Unknown Malware Detection


In an earlier section, we have noticed that Random Forest, LMT, NBT, J48 Graft and Random Tree achieved
maximum accuracy, so we selected these five classifiers for depth analysis. We have randomly selected 3005 malware
and 2286 benign programs which are 50% of the overall dataset. Table 2 shows the results of top 5 classifiers.
4. Experimental Results
As mentioned in section 3, malware is already in assembly code only benign are disassembled.
Then opcodes occurrence is calculated for all malware and benign programs. In next noise from malware, data is
removed by creating an interval of opcodes weight i.e. 500, 100, 50, 10, 5 and 2 for malware and benign files. Interval
in which there are no benign files, malware files are deleted. To find the dominant features or to remove irrelevant
feature we used five feature selection methods and found that there are 20 features which are dominating in the
classification process. Fig. 4 shows that Fisher Score outperforms among five feature selection methods.
The analysis of top five classifiers viz. Random Forest, LMT, Random Tree, J48Graft and REPTree done in WEKA
to find their effectiveness, regarding True Positive Ratio (TPR), True Negative Ratio (TNR), False Positive Ratio
(FPR), False Negative Ratio and Accuracy, defined as
𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝑻𝒓𝒖𝒆 𝑵𝒆𝒈𝒂𝒕𝒊𝒗𝒆 𝑭𝒂𝒍𝒔𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝑭𝒂𝒍𝒔𝒆 𝑵𝒆𝒈𝒂𝒕𝒊𝒗𝒆
𝑻𝑷𝑹 = 𝑻𝑵𝑹 = 𝑭𝑷𝑹 = 𝑭𝑵𝑹 =
𝑻𝒐𝒕𝒂𝒍 𝑴𝑨𝒍𝒘𝒂𝒓𝒆 𝑻𝒐𝒕𝒂𝒍 𝑩𝒆𝒏𝒊𝒈𝒏 𝑻𝒐𝒕𝒂𝒍 𝑩𝒆𝒏𝒊𝒈𝒏 𝑻𝒐𝒕𝒂𝒍 𝑴𝒂𝒍𝒘𝒂𝒓𝒆

𝑻𝑷+𝑻𝑵
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = × 𝟏𝟎𝟎
𝑻𝑴+𝑻𝑩

Where
True Positive: the no. of malware correctly detected.
True Negative: the no. of benign correctly detected.
False Positive: the no. of benign identified as malware.
False Negative: the no. of malware identified as benign.
Table 2 shows the result obtained by the top 5 classifiers. The study shows that the selected five classifiers accuracy
is more or less same.
Table 2. Performance of Top 5 Classifiers with Fisher Score Feature Selection Method
Classifiers True Positive False Negative False Positive True Negative Accuracy
Random Forest 100% 0 0 100% 100%
LMT 100% 0 0 100% 100%
NBT 100% 0 0 100% 100%
J48 Graft 100% 0 0 100% 100%
REPTREE 99.98% 0.04% 0.05% 99.95% 99.96%

5. Conclusion
In this paper, we have presented an approach based on opcodes occurrence to improve malware detection accuracy of
the unknown advanced malware. Code obfuscation technique is a challenge for signature based techniques used by
advanced malware to evade anti-malware tools. Proposed approach uses Fisher Score method for the feature selection
and five classifiers used to uncover the unknown malware. In proposed approach Random forest, LMT, J48 Graft, and
NBT detect malware with 100% accuracy which is better than the accuracy (99.8%) reported by Ahmadi et al. (2016).
In future, we will implement proposed approach on different datasets and will perform in the deep analysis for the
classification of advanced malicious software.
Acknowledgement
Mr. Sanjay Sharma is thankful to Dr. Lini Methew, Associate Professor and Dr. Rithula Thakur Assistant Professor,
Department of Electrical Engineering for providing computer lab assistance time to time.
References
1. A. Sharma and S. K. Sahay, “Evolution and Detection of Polymorphic and Metamorphic Malware: A Survey,” International
Journal of Computer Application, vol. 90, no. 2, pp. 7–11, 2014.
2. E. S. Solutions and Q. Heal, “Quick Heal Quarterly Threat Report | Q1 2017,” 2017
url:http://www.quickheal.co.in/resources/threat-reports . [Accessed: 13-june-2017].
3. A. Govindaraju, “Exhaustive Statistical Analysis for Detection of Metamorphic Malware,” Master’s project report, Department
of Computer Science, San Jose State University, 2010.
4. M. G. Schultz, E. Eskin, and S. J. Stolfo, “Data Mining Methods for Detection of New Malicious Executables,” 2001.
5. D. Bilar, “Opcodes As Predictor for Malware,” International Journal of Electronic Security and Digital Forensics, vol. 1, no. 2,
pp. 156–168, 2007.
6. Y. Elovici, A. Shabtai, R. Moskovitch, G. Tahan, and C. Glezer, “Applying Machine Learning Techniques for Detection of
Malicious Code in Network Traffic,” Annual Conference on Artificial Intelligence. Springer Berlin Heidelberg, pp. 44–50, 2007.
7. R. Moskovitch, D. Stopel, C. Feher, N. Nissim, N. Japkowicz, and Y. Elovici, “Unknown malcode detection and the imbalance
problem,” Journal in Computer Virology, vol. 5, no. 4, pp. 295–308, 2009.
8. R. Moskovitch et al., “Unknown malcode detection using OPCODE representation,” Intelligence and Security Informatics.
Springer Berlin Heidelberg, vol. 5376 LNCS, pp. 204–215, 2008
9. I. Santos, J. Nieves, and P. G. Bringas, “Semi-supervised learning for unknown malware detection,” International Symposium on
Distributed Computing and Artificial Intelligence. Springer Berlin Heidelberg, vol. 91, pp. 415–422, 2011.

10. I. Santos, F. Brezo, X. Ugarte-Pedrero, and P. G. Bringas, “Opcode sequences as representation of executables for data-mining-
based unknown malware detection,” Information Sciences, vol. 231, pp. 64–82, 2013.
11. A. Shabtai, R. Moskovitch, C. Feher, S. Dolev, and Y. Elovici, “Detecting unknown malicious code by applying classification
techniques on OpCode patterns,” Security Informatics, vol. 1, no. 1, p. 1, 2012.
12. A. Sharma and S. K. Sahay, “An effective approach for classification of advanced malware with high accuracy,” International
Journal of Security and its Applications, vol. 10, no. 4, pp. 249–266, 2016.
13. S. K. Sahay and A. Sharma, “Grouping the Executables to Detect Malwares with High Accuracy,” Procedia Computer Science,
vol. 78, no. June, pp. 667–674, 2016.
14. Kaggle, “Microsoft Malware Classification Challenge (BIG 2015)” Microsoft, URL: https://www.kaggle.com/c/malware-
classification , [Accessed : 10/December/2016].
15. M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, “Novel Feature Extraction, Selection and Fusion for
Effective Malware Family Classification,” ACM Conference Data Application Security Priv., pp. 183–194, 2016.
16. J. Drew, M. Hahsler, and T. Moore, “Polymorphic malware detection using sequence classification methods and ensembles,”
EURASIP J. Inf. Secur., vol. 2017, no. 1, p. 2, 2017.
17. J. Derrac, S. García, and F. Herrera, “A first study on the use of co evolutionary algorithms for instance and feature selection,”
International Conference on Hybrid Artificial Intelligence Systems. Springer Berlin Heidelberg, pp. 557–564, 2009.
18. A. L. Blum and P. Langley, “Selection of relevant features and examples in machine learning Artificial intelligence, vol. 97, no.
1–2, pp. 245–271, 1997.
19. T. R. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,”
Science, vol. 286, no. 5439, pp. 531–537, 1999.
20. T. G. Dietterich, “Machine learning in ecosystem informatics and sustainability,” IJCAI, pp. 8-13 2009.

You might also like