0% found this document useful (0 votes)
70 views10 pages

A Comparative Analysis of Malware

Abstract - We propose a classification model with various machine learning algorithms to adequately recognise malware files and clean (not malware-affected) files with an objective to minimise the number of false positives. Malware anomaly detection systems are the system security component that monitors network and framework activities for malicious movements. It is becoming an essential component to keep data framework protected with high reliability.

Uploaded by

Priyanka Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views10 pages

A Comparative Analysis of Malware

Abstract - We propose a classification model with various machine learning algorithms to adequately recognise malware files and clean (not malware-affected) files with an objective to minimise the number of false positives. Malware anomaly detection systems are the system security component that monitors network and framework activities for malicious movements. It is becoming an essential component to keep data framework protected with high reliability.

Uploaded by

Priyanka Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Comparative Analysis of Malware

Anomaly Detection
Priynka Sharma, Kaylash Chaudhary, Michael Wagner, M.G.M Khan

School of Computing, Information and Mathematical Sciences,


The University of the South Pacific, Suva, Fiji.

[email protected], [email protected],
[email protected], [email protected]

Abstract - We propose a classification model with various machine learning al-


gorithms to adequately recognise malware files and clean (not malware-affected) files
with an objective to minimise the number of false positives. Malware anomaly detec-
tion systems are the system security component that monitors network and framework
activities for malicious movements. It is becoming an essential component to keep data
framework protected with high reliability. The objective of malware inconsistency
recognition is to demonstrate common applications perceiving attacks through failure
impacts. In this paper, we present machine learning strategies for malware location to
distinguish normal and harmful activities on the system. This malware data analytics
process carried out using the Weka tool on the figshare dataset using the four most
successful algorithms on the preprocessed dataset through cross-validation [1]. Gar-
rett's Ranking Strategy has been used to rank various classifiers on their performance
level. The results suggest that Instance-Based Learner (IBK) classification approach is
the most successful.

Keywords: Anomaly, malware, data mining, machine learning, detection, analysis

1 Introduction

Malware has continued to mature in volume and unpredictability in remarkable dangers


to the security of computing machines and services. This has motivated the increased
use of machine learning to improve malware anomaly detection. The history of mal-
ware demonstrates to us that this malicious threat has been with us since the beginning
of computing itself [2]. The concept of a computer virus goes backto 1949, when emi-
nent computer scientist John von Neumann wrotea paper on how a computer program
could reproduce itself [3]. During the 1950s, workers at Bell Labs offered life to von
Neumann's thought when they made a game called "Center Wars." In the game, devel-
opers would release programming "life forms" that vied for control of the computer
system. From these basic and benevolent beginnings, a gigantic and wicked industry
was conceived [3]. Today, malware has tainted 33% of the world's computers [4]. Cy-
bersecurity Ventures report that the losses are due to cybercrime, including malware

adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
and are foreseen to hit $6 trillion each year by 2021 [4]. Malware is software designed
to infiltrate or harm a computer framework without the proprietor's informed assent.
Many strategies have been used to safeguard against different malware.
Among these, Malware Anomaly Detection (MAD) is the most encouraging strategy to
shield from dynamic anomaly practices. MAD System groups information into differ-
ent classifications known to be as typical and bizarre [5]. Different classification algo-
rithms have been proposed to plan a powerful Detection Model [6] [7]. The exhibition
of a classifier is a significant factor influencing the performance of MAD Model. Thus,
the choice of precise classifier improves the performance of malware detection frame-
work. In this work, classification algorithms have been assessed using WEKA tool.
Four different classifiers have been estimated through Accuracy, Receiver Operating
Characteristics (ROC) esteem, Kappa, Training time, False Positive Rate (FPR) and
Recall esteem. Positions have additionally been appointed to these algorithms by ap-
plying Garret's positioning strategy [8].
The rest of the paper is dependent by sections as follows. The next two sub-
sections discuss anomaly detection and classification algorithms. Section 2 discusses
related work, whereas Section 3 presents the chosen data-set and describes WEKA tool
and different classification algorithms. Results and discussions are obtainable in Sec-
tion 4. Finally, Section 5 leads with the conclusion of this research.

1.1 Anomaly Detection


Anomaly detection is a type of innovation that uses human-made intellect to recognise
unusual behaviour classified in the data set. Datch frameworks characterise anomaly
discovery as ‘a strategy used to distinguish unpredictable instances in a perplexing do-
main’. Ultimately, anomaly identification spots design such that a human reader cannot.
Anomaly detection is bridging any issues among matrices and business procedures to
give more proficiency [9]. Ever since the intensification in big data enterprises of all
extents has been in a condition of vulnerability. Anomaly detection is conquering pre-
vention among measurements and business procedures to give more proficiency.
There are two phases in Anomaly-based detection. Phase 1 is training, and
Phase 2 is detecting [10] [11]. In the first phase, the machine learns attacks and then
detects abnormal behaviour in the second phase. A key advantage of anomaly-based
detection is its ability to detect zero-day attacks [12]. The limitations of anomaly-based
detection are high false alarm rate and difficulty in deciding features to be used for
detection in the training phase [13]. Anomaly detection cracks these problems in nu-
merous diverse ways, as depicted in Figure 1 below:
Figure 1. Anomaly detection on various problem-solving domains

Anomaly detection stages can fall into the particulars of data identification where di-
minutive peculiarities that cannot be seen by users observing datasets on a dashboard.
Therefore, the best way to get continuous responsiveness to new data examples is to
apply a machine learning technique.

1.2 Classification Algorithms


Classification algorithms in data mining are overwhelmingly applied in Anomaly De-
tection System to order attacks from ordinary activities in the system. Classification
algorithms take a supervised learning approach that is it does not require class marks
for the forecast. There are essentially eight classifications of classifiers and, every clas-
sification contains diverse artificial intelligence algorithms. These classifications are:

Bayes Classifier: Also known as Belief Networks, has a place with the group of prob-
abilistic Graphical Models (GM'S) which are used to state in learning about uncertain
areas. In the graph, nodes mean random factors and edges are probabilistic conditions.
Bayes classifier depends on foreseeing the class based on the estimation of individuals
from the highlights [14].

Function Classifier: Develops the idea of a neural network and relapse [I]. Eighteen
classifiers fall under this classification. Radial Basis Function (RBF) Network and Se-
quential Minimal Optimization (SMO)are two classifiers which perform well the da-
taset used in this paper. RBF classifiers can present any nonlinear function effectively,
and it does not utilise crude information. The issue with RBF is the inclination to over
train the model [15].

Lazy Classifier: Requests to store total training information. While building the model,
new examples are not incorporated into the training set by these classifiers. It is mostly
utilised for classification on information streams [16].
Meta Classifier: Locates the ideal set of credits to prepare the base classifier. The pa-
rameters used in the base classifier will be used for predictions. There are twenty-six
classifiers in this category [8].

Mi Classifier: There are twelve Multi-Instance classifiers. None fits the dataset used in
this paper. This classifier is a variation of the directed learning procedure. These kinds
of classifiers are initially made accessible through a different programming bundle [17].

Misc or Miscellaneous Classifier: Three classifiers fall under Misc Classifier. Two
classifiers, Hyperpipes and Voting Feature Interval (VFI), are compatible with our da-
taset [8].

Rules Classifier: Association standards are used for the right expectations of class
among all the behaviours, and it is linked with the level of accuracy. They may antici-
pate more than one end. Standards are fundamentally unrelated. These are learnt one at
a time [17].

Trees: Famous classification procedures where a stream graph like tree structure is cre-
ated in which every hub signifies a test on characteristics worth and each branch ex-
presses the result of the test. Moreover, it is known as Decision Trees. Tree leaves char-
acterise to the anticipated classes. Sixteen classifiers fall under this category [8].

2 Related Work

Numerous analysts have proposed different strategies and algorithms for anomaly de-
tection on data mining classification methods.
Lei Li et. al. [19] presents a rule-based technique which exploits the compre-
hended examples to recognise the malignant attacks [18]. Fu et. al. [20] discusses the
use of data mining in the anomaly detection framework. It is a significant course in
Intrusion Detection System (IDS) research. The paper shows the improved affiliation
anomaly detection dependent on Frequent-Pattern Growth and Fuzzy C Means (FCM )
network anomaly detection. Wenguang et al. proposed a smart anomaly detection
framework dependent on web information mining, which is, contrasted with other con-
ventional anomaly detection frameworks [20]. However, for a total detection frame-
work, there are still some work left such as improving information mining algorithms,
best handling the connection between information mining module and different mod-
ules, improving the framework's versatile limit, accomplishing the representation of test
outcomes, improving continuous proficiency and precision of the framework. Like-
wise, Panda M, & Patra M. [22] present the study of certain information mining sys-
tems, for instance, Machine learning, feature selection, neural network, fuzzy logic,
genetic algorithm, support vector machine, statistical methods and immunological
based strategies [21].
Table 1 presents an overview of papers that applied machine learning techniques for
Android malware detection.

Table 1. An Overview of Publication That Applied ML Classifiers for Malware Detection

Publication/Year ML Algorithms Sample Optimal Clas-


Size sifier

Evaluation of machine learning classifiers


BN, MLP, J48, KNN, RF 3,450 RF
for mobile malware detection, 2016.

Using Spatio-Temporal Information in


IBK, J48, NB, Ripper,
API Calls with ML Algorithms for Mal- 516 SMO
SMO
ware Detection, 2009.

RT, J48, Rep Tree, VP, RF, 3,799


DroidFusion: A Novel Multilevel Classi-
R. Comm., R. Sub., Ada- 15,036
fier Fusion Approach for Android Mal- DroidFusion
boost, 36,183
ware Detection, 2018.
DroidFusion 36,183

3 Data Set and Tool Description

The android malware dataset from figshare, consists of 215 attributes feature vec-
tors detached from 15,036 applications (5,560 malware applications from Drebin ven-
ture and 9,476 amiable applications). Also, this dataset has been used to create multi-
level classifier fusion approach for [1]. Table 2 shows that the dataset contains two
classes, mainly, Malware and Benign. There are 5560 instances of Malware and 9476
instances of Benign.

Table 2. Data-Set Used for Malware Anomaly Detection

Data-Set Features Samples Class


Malware Benign
Drebin 215 15036 5560 9476

3.1 Waikato to Environment for Knowledge Analysis (WEKA)

WEKA is a data analysis tool developed in the University of Waikato, New Zealand in
1997 [22]. It consists of several machine learning algorithms that can be used to mine
data and extract meaningful information. This tool is written in Java language and con-
tains a graphical user interface to connect with information Files. It contains 49 infor-
mation pre-preparing tools, 76 classification algorithms, 15 trait evaluators and 10 quest
algorithms for highlight choice. It has three Graphical User Interfaces: "The Explorer",
"The Experimenter" and "The Knowledge Flow." WEKA bolsters information placed
in Attribute Relation File Format (ARFF) document group. It has a lot of boards that
can be utilised to perform explicit errands. WEKA gives the capacity to create and in-
corporate a new Machine Learning algorithm in it.

3.2 Cross-Validation
The cross-validation is equivalent to a single holdout validation set to evaluate the mod-
el's predictive performance on hidden data. Cross-validation does this more robustly,
by iterating the trial multiple times, using all the various fragments of the training set
as the validation sets. This gives an increasingly exact sign of how well the model sums
up to inconspicuous data, thus avoiding overfitting.

Figure 2. Cross-Validation (10 folds) Method Application

3.3 Performance Metrics Used


The performance measurements used to assess classification strategies depicted via
confusion matrix. It contains information regarding test dataset, which contains known
values. The confusion matrix displays results of prediction as follows:

1. False Positive (FP): The model predicted a benign class as a malware attack.
2. False Negative (FN): It means wrong expectation or prediction. The prediction
was benign, but it was a malware attack.
3. True Positive (TP): The model predicted a malware attack, and it was a
malware attack.
4. True Negative (TN): The model predicted as benign, and it was benign.

A confusion matrix, as shown in Figure 3, is a method for condensing the presentation


of a classification calculation. Classification accuracy alone can be misdirecting on the
off chance that you have an inconsistent number of perceptions in each class or if there
are multiple classes in your dataset. Computing a confusion matrix can give you a
superior thought of what the classification model is predicting [17].
Figure 3. Confusion Matrix Application in Machine Learning

 Accuracy = (TP+TN)/n
 True Positive Rate (TPR) = (TP+TN)/n
 False Positive Rate (FPR) =FP/(TN+FP)
 Recall = TP/(TP+ FN)
 Precision = TP/(TP+FP)

4 Results and Discussions

Android malware dataset was used to evaluate anomaly detection retrospectives. We


have used WEKA to implement and evaluate anomaly detection. Feature ranking and
file conversions in arff file format were additionally, completed using WEKA tool.
In every single investigation, we set K=4, where K represents the number of base clas-
sifiers. In this paper, we used four base classifier. Besides, we took N=10 for the cross-
validation and weight assignments separately. The four base classifiers are Instance-
Based Learner (IBK), Logistic, Rotation Forest and Sequential Minimal Optimization
(SMO). The classifier has been evaluated in WEKA environment using 215 attributes
detached from 15,036 applications (5,560 malware applications from Drebin venture
and 9,476 amiable applications). Garrett’s Ranking Technique has been used to rank
different classifiers according to their performance.

Figure 4 shows the predictive model evaluation using knowledge flow. The arff
loader was used to load the dataset. The arff loader was associated with "ClassAs-
signer" (permits to pick which segment or column to be the class) component from the

Figure 4. A Predictive Model Evaluation Using Knowledge Flow

toolbar and was eventually set on the layout. The class value picker picks a class value
to be considered as the "positive" class. Next was the "CrossValidationFoldMaker"
component from the Evaluation toolbar as described in Section 3.2. Upon completion,
the outcomes were acquired by option show results from the pop-up menu for the
TextViewer part. Tables 3 and 5 illustrates the evaluation results of the classifiers.

Table 3. Results Obtained with Algorithm Based Ranking (1=Highest-Rank)

Classifier ROC FPR Accuracy Kappa MAE Recall Precision Training Rank
Area Time(sec)
IBK 0.994 0.013 98.76 0.9733 0.013 0.988 0.988 0.01 1
Rotation 0.997 0.020 98.51 0.9679 0.0333 0.985 0.985 98.63 2
Forest
SMO 0.976 0.027 97.84 0.9535 0.0216 0.978 0.978 34.46 3
Logistic 0.995 0.027 97.81 0.953 0.0315 0.978 0.978 22.44 4

Table 3 shows the quantity of "statistically significant wins", each algorithm has against
all the other algorithms on the malware detection dataset used in this paper. A win
implies an accuracy that is superior to the accuracy of another algorithm, and the
difference was statistically significant. However, we can agree with the results table
that IBK has a notable success when compared to RF, SMO, and Logistic.
Table 4. Predictive Model Evaluation Using Knowledge Flow

Data-Set Mean (Standard Deviation)


Drebin - 215 IBK Rotation Forest SMO Logistic
98.76 (0.27) 98.50 (0.32)* 97.81 (0.37)* 97.86 (0.39)*
*Significance of 0.05

Each algorithm was executed ten times. The mean and standard deviation of the
accuracy is shown in Table 4. Therefore, the difference between the three accuracy
scores is significant for RF, SMO, and Logistic and is less than by 0.05, indicating that
these three techniques compared to IBK is statistically different. Henceforth, IBK leads
the algorithm accuracy level in determining the malware anomaly detection on the
Drebin dataset.

5 Conclusion and Future Work

In this paper, we proposed a novel machine learning anomaly malware detection


approach for Android malware data collection, which identifies malware attacks and
achieves zero false-positive rate. We achieved an accuracy rate as high as 98.76%.
Exclusively, if this system turns out to be a part of profoundly focused business entry,
various deterministic exemption components must be included. As we contemplate,
malware detection by means of machine learning will not substitute the standard
detection strategies used by anti-virus merchants; however, will come as an extension
to them. Any business against infection item is liable to a certain speed and memory
impediments. In this way, the most reliable algorithm among those introduced here is
the IBK.

6. Bibliography
1. Yerima Y, & Sezer S, et al. (2018) DroidFusion: A Novel Multilevel Classifier Fusion Approach
for Android Malware Detection. Journal IEEE Transactions on Cybernetics 49: 453 – 466.
2. You I, & Yim K. (2010) Malware Obfuscation Techniques: A Brief Survey. In: Proceedings of
the 5th International Conference on Broadband, Wireless Computing, Communication and Appli-
cations, Fukuoka, Japan, 4-6 November.
3. Grcar J. (2011) John von Neumann’s Analysis of Gaussian Elimination and the Origins of Modern
Numerical Analysis. Journal Society for Industrial and Applied Mathematics 53: 607–682.
4. John P & Mello J. (2014) Report: Malware Poisons One-Third of World's Computers. Retrieved
June 6, 2019, from Tech News World: https://www.technewsworld.com/story/80707.html.
5. Guofei G, & Porras A, et al. (2015) Method and Apparatus for Detecting Malware Infections.
Patent Application Publication, United Sates. 1-6.
6. Shamili A, & Bauckhage C, et al. (2010) Malware Detection on Mobile Devices using Distributed
Machine Learning. In: Proceedings of the 20th International Conference on Pattern Recognition,
Istanbul, Turkey, 4348-4351.
7. Hamed Y, & AbdulKader S, et al. (2019) Mobile Malware Detection: A Survey. Journal of Com-
puter Science and Information Security 17: 1-65.
8. India B, & Khurana S. (2014) Comparison of classification techniques for intrusion detection
dataset using WEKA. In: Proceedings of the International Conference on Recent Advances and
Innovations in Engineering, Jaipur, India, 9-11 May.
9. Goldstein M, & Uchida S. (2016) A Comparative Evaluation of Unsupervised Anomaly Detection
Algorithms for Multivariate Data. Journal of PLOS ONE 11: 1-31.
10. Ruff L, & Vandermeulen R, et al. (2019) Deep Semi-Supervised Anomaly Detection. ArXiv 20:
1-22.
11. Schlegl T, & Seeböck P, et al. (2017) Unsupervised Anomaly Detection with
12. Generative Adversarial Networks to Guide Marker Discovery. In: Proceedings of the International
Conference on Information Processing in Medical Imaging, Boone, United States, 25-30 June.
13. Patch A, & Park J. (2007) An Overview of Anomaly Detection Techniques: Existing Solutions
and Latest Technological Trends. The International Journal of Computer and Telecommunica-
tions Networking 51: 3448-3470.
14. Chandola V, & Banerjee A. (2009) Anomaly Detection: A Survey. Journal of ACM Computing
Surveys 50: 1557-7341.
15. Bouckaert R. (2004). Bayesian network classifiers in Weka. (Working paper series. University of
Waikato, Department of Computer Science. No. 14/2004). Hamilton, New Zealand: University of
Waikato: https://researchcommons.waikato.ac.nz/handle/10289/85.
16. Mehata R, & Bath S, et al. (2018) An Analysis of Hybrid Layered Classification Algorithms for
Object Recognition. Journal of Computer Engineering 20: 57-64.
17. Kalmegh S. (2019) Effective classification of Indian News using Lazy Classifier IB1And IBk
from weka. Journal of information and computing science 6: 160-168.
18. Pak I, & Teh P. (2016) Machine Learning Classifiers: Evaluation of the Performance in Online
Reviews. Journal of Science and Technology 45: 1-9.
19. Li L, & Yang D, et al. A novel rule-based Intrusion Detection System using data mining. In Pro-
ceeding of the International Conference on Computer Science and Information Technology.
Chengdu, China, 9-11 July.
20. Fu D, & Zhou S, et al. (2009) The Design and Implementation of a Distributed Network Intrusion
Detection System Based on Data Mining. In Proceeding of the WRI World Congress on Software
Engineering, Xiamen, China 19-21 May.
21. Chai W, & Tan C, et al. (2011) Research of Intelligent Intrusion Detection System Based on Web
Data Mining Technology. In Proceedings of the International Conference on Business Intelligence
and Financial Engineering, Wuhan, China, 17-18 October.
22. Panda M, & Patra M. (2009). Evaluating Machine Learning Algorithms for Detecting Network
Intrusions. Journal of Recent Trends in Engineering 1: 472-477.

You might also like