Analysis Study of Malware Classification Portable Executable Using Hybrid Machine Learning
Analysis Study of Malware Classification Portable Executable Using Hybrid Machine Learning
Abstract—Malware is a malicious program that executes de- stage before detecting malware using machine learning is by
structive functions to destroy the resources in a computer system, determining what method to analyze the malware to make
gain some financial benefits, steal the privacy and confidentiality malware detection more accurate. The method for analyzing
of data, and use computing resources to make a service unavail-
able in a computer system. One of the ways to prevent malware malware is divided into two methods: static and dynamic
attacks is by detecting Portable Executable (PE) malware files malware analysis. Based on references [4], [5], the static
using machine learning. However, not all machine learning malware analysis method is better compared to the dynamic
algorithms have optimal performance in detecting a malware malware analysis method as it can cover the shortcomings
PE File because some have several weaknesses that result in of dynamic malware and has high effectiveness in analyzing
low performance in detecting a malware PE File. However, these
shortcomings can be reduced by combining two or more two large-scale data.
different individual algorithms into one hybrid machine learning Based on several previous studies, each individual machine
algorithm, so the advantages of some individual algorithms can learning algorithm continuously produces different accuracy
cover the shortcomings of other individual algorithms. Therefore, values in detecting a malware [4], [6]. One of the leading
this research proposes research on the performance of the hybrid causes of each individual algorithm having different accuracy
machine learning algorithms in detecting malware PE File. The
hybrid machine learning algorithms use the voting classifier values is because each algorithm has its strengths and weak-
method and LightGBM, XGBoost, and Logistic Regression as nesses that can affect the final result of the accuracy value
their base model. This research proves that the hybrid machine of detecting malware. One of the ways to overcome the low
learning algorithm produces a higher recall value than the accuracy value in each individual machine learning algorithm
ensemble algorithm LightGBM. The hybrid machine learning
is by combining two or more different machine learning
algorithm produces the highest recall value with a recall value of
99.5026%, while the LightGBM algorithm only produces a recall algorithms. The combination of algorithms is called hybrid
value of 99.4480%. Furthermore, the recall value of another base machine learning. By using this hybrid machine learning,
model is 99.5004% for the XGBoost algorithm and 98.0539% for the weaknesses of a single algorithm can be covered by the
the Logistic Regression algorithm. advantages of other single algorithms [7].
Index Terms—malware detection, hybrid machine learning, This research focuses on analyzing the performance compar-
voting classifier, ensemble, PE file
ison between the ensemble algorithm and the hybrid machine
learning algorithm proposed in this research in detecting a
I. INTRODUCTION
malware PE File. The ensemble algorithm that was used as a
Malware is a malicious program that executes destructive reference in this research was the LightGBM algorithm used in
functions. It can be used to destroy any existing resources in the Bodmas dataset research, containing the bytes PE File from
computer systems, get some financial gain, steal the privacy 2019 to 2020 [5]. Meanwhile, the hybrid machine learning
and confidentiality of data, and take advantage of computing algorithm method used in this research was a voting classifier
system resources to make a service unavailable. Therefore, with a soft voting method and an algorithm consisting of
there is a need for a way to prevent a malware attack. One of the LightGBM, XGBoost, and Logistic Regression algorithms.
these ways is by detecting the malware files entering the user’s It is expected that the hybrid machine learning algorithm
device using machine learning [1]. Based on the references can produce a higher performance value than the ensemble
[2], many people use files with the format Portable Executable machine learning algorithm.
(PE) as malware files because almost all files that Windows
can execute are files with PE format. II. LITERATURE REVIEW
Based on reference [3], machine learning is a subfield of Reference [5] research was to propose the use of one of the
artificial intelligence that enables computer systems to learn gradient boosting decision tree models, LightGBM, to predict
from examples, data, and experiences without being pro- the Bodmas dataset and analyze the result of it. LightGBM
grammed. For this, many studies are using machine learning showed an accuracy above 98% on the top-2 and top-3 of
as an automated technique to detect malware [1]. The earliest known pe file malware families only, while other algorithms
978-1-6654-1777-8/21/$31.00
Authorized ©2021
licensed use limited to: DUBLIN IEEE
CITY UNIVERSITY. 86on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
used as a comparison on this research had a lower accuracy an accuracy of 87.33%, while its base classifier achieves an
value than LightGBM. This research stated that the other accuracy of 86.67% for MLP, 84.67% for RF, and 85% for
algorithm had a much worse accuracy than LightGBM because XGB. This research shows that the performance of a hybrid
another algorithm is not getting fine-tuning, and the perfor- machine learning algorithm outperforms the other algorithm
mance on detecting unknown malware families is not good. used as a comparison in this research.
This research showed that LightGBM had a better performance
A. Malware
than other algorithms used to compare this research.
Reference [8] research was to propose the use of one of Malware is a malicious program that executes destructive
the gradient boosting decision tree models, i.e., LightGBM, functions that can destroy existing resources in computer
to predict the ember dataset and analyze the result of it. systems, get a financial gain, steal privacy and confidentiality
LightGBM has a roc auc value exceeding 99.911% and the of data, and take advantage of computing system resources to
false positive rate value exceeding 92.99%, while the J48 algo- make a service unavailable [1].
rithm used in this dataset compared to LightGBM had a much B. File Portable Executable (PE)
worse performance with a 53% false positive rate and 8%
false-negative rate. This research stated that the J48 algorithm Based on reference [8], [13] PE file refers to the standard
format for executable files on Windows operating systems
had a worse performance than LightGBM because of dataset
bias, stale training data, or both. This research showed that and describes the dominant formats used by the Windows
operating system.
LightGBM had a better performance in comparison to the J48
algorithm. C. Bodmas Dataset
Reference [9] research proposed the use of XGBoost to The Bodmas dataset refers to a dataset containing the entire
predict the protein interaction sites datasets. IHT-XGBoost structure of the PE File bytes collected from 2019 to 2020. The
showed a better performance in detecting overlapping data contents of the Bodmas dataset consisted of 57293 malware
on this research with the highest accuracy value from other samples and 77142 good samples; thus, the total PE File used
machine learning used as a comparison. The accuracy value in the Bodmas dataset was 134435 data samples [5].
of IHT-XGBoost on this research was 80.7%, showing a better
performance of IHT-XGBoost in handling an overlapping D. Ensemble Machine Learning
dataset. This research stated that other algorithms had a Based on reference [14]–[16], ensemble machine learning
much worse performance than IHT-XGBoost because the other combines the results of two or more predictions of the same
algorithm does not have an imbalance treatment strategy that machine learning classification algorithm to improve the per-
can improve the prediction of interaction between protein sites. formance of a single machine learning classification algorithm.
Reference [10] research was to propose the use of a hybrid Moreover, ensemble machine learning is called a machine
model between Convolutional Neural Network (CNN) and learning method meta-learning.
XGBoost to predict the popularity of the social posts. As
shown in this research, the proposed hybrid model had an E. Hybrid Machine Learning
excellent performance in detecting overlapping data. The ac- Based on reference [12], [16], [17], a hybrid machine
curacy values of this hybrid model were 0.7406 for Spearman learning algorithm combines two or more different machine
Rho, 2.7293 for MSE, and MAE for 1.2475. This research learning algorithms. The goal of combining two or more
stated that the reason why the hybrid model used in this different machine learning algorithms is to compensate for one
research have a better performance rather than the other algo- algorithm’s shortcomings with the benefits of other individual
rithm used in this research as a comparison is that the hybrid algorithms.
model exploited the high-level features extracted by CNN
F. Voting Classifier
makes the hybrid model proposed in this research outperforms
the other algorithm used as a comparison in this research. A voting classifier is a meta-classifier that combines ma-
In this reference, [11] the harmonic mean of evaluation chine learning classifications that are similar or different into
criteria derived from a confusion matrix, an imbalanced data one classification algorithm using majority voting. In doing
classifier, specifically the confusion matrix based kernel lo- majority voting, there are two methods: hard voting and soft
gistic regression (CM-KLOGR), was proposed. This research voting [18].
found that CM-KLOGR outperformed the other data clas- Hard Voting
sifiers, KLOGR and support vector machine (SVM). The Hard Voting is the easiest majority voting method to do.
values of the CM-KLOGR evaluation criteria, according to In this method, the class label ŷ was predicted using
this research, can increase the values of the evaluation criteria. majority voting against each classifier Cj . The equation
Reference [12] research was to compare hybrid machine can be seen in (1).
learning with its base models Random Forest (RF), Multilayer
ŷ = mode{C1 (x), C2 (x), ..., Cm (x)} (1)
Perceptron (MLP), and Extreme Gradient Boost (XGB) in
classifying people as patient or healthy person. This research Soft Voting
shows that the proposed hybrid machine learning achieves In Soft Voting, class labels are predicted based on the
87on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
G. XGBoost
XGBoost uses the principle of gradient boosting, which
combines several weak learners into strong learners. But in
XGBoost, the creation of a tree is done in parallel, so the
creation of a tree is faster than gradient boosting. XGBoost
also predicts better than gradient boosting as it controls the
model complexity and minimizes the model overfitting via its
default regularization [3].
H. LightGBM
LightGBM is a Gradient Boosting Decision Tree (GBDT)
algorithm that makes use of Gradient-based One-Side
Sampling (GOSS) and Exclusive Feature Bundling (EFB)
[19]. The GBDT algorithm is a tree ensemble algorithm that
generates one regression tree regularly by fitting the rest of
the tree that came before it [20].
88on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
models got a hyper-parameter tuning, and the results TP is True Positive, TN is True Negative, FP is False Pos-
of the hyper-parameter tuning on the proposed base itive, and FN is False Negative [14]. In this research, the test
model algorithm were used to design a hybrid machine metric used to analyze the performance of an algorithm was
learning algorithm. the recall test metric. Because the recall test metric calculated
the percentages of the comparison between a malware PE File
3) Testing The Hybrid Machine Learning Algorithm predicted to be a malware PE File and a malware PE File
that has been Designed predicted to be a benign PE file.
At this stage, the performance of the proposed hybrid
machine learning algorithm was tested against the IV. EXPERIMENTS
compared algorithm, i.e., the LightGBM algorithm. The A. Visualization of The Data Distribution
test was done by comparing the values of accuracy, In this research, the visualization of the data distribution
precision, recall, and F1 score between the proposed was by visualizing one of the relationships between a feature
hybrid machine learning algorithm and the compared and other features (Figure 2 is a visualization between the
LightGBM algorithm. If the performance value of the ByteHistorgram-0 feature and the ByteEntropyHistogram-0)
proposed hybrid machine learning was greater than or using the library dtale. Dtale is the combination of flask back-
equal to LightGBM, then the proposed hybrid machine end and react front-end to make the data visualization much
learning algorithm design has been declared successful. easier to analyze and to see the structure of the pandas data
[21]. The results of dtale visualization were used to choose a
4) Designing Malware Detection by Applying The machine learning algorithm that was suitable for processing
Proposed Hybrid Machine Learning Algorithm to the dataset with that visualization results. Based
The Malware Detection System
At this stage, malware detection was designed by
applying the proposed hybrid machine learning
algorithm to the malware detection system. It was
conducted by creating a hybrid machine learning
algorithm and then applying it to the malware PE File
detection system that has been created.
Fig. 2. Distribution of The Data Visualization.
5) Writing The Study Report
At this stage, the author made a report related to the on the results from figure 2, the Bodmas dataset had data
research that has been carried out by following the that overlapped each other; it means that a literature study
scientific writing design method, and the result of this must be carried out on machine learning algorithms with
stage is the final assignment book. optimal performance in processing overlapping data. After
conducting a literature study, it was found that the LightGBM,
XGBoost, and Logistic Regression algorithms produced the
B. Metrics
highest performance in processing data with the results of
The test metrics used in testing this algorithm were metrics visualizing the distribution of data in figure 2. Therefore, the
used in previous studies, including accuracy, precision, recall, research on this research used these three algorithms as the
and F1-Score [14]. The accuracy metric in (3) was used to base models for the proposed hybrid machine learning model.
evaluate the machine learning classification model used, and
the precision metric in (4) was to see how often a machine B. Feature Selection
learning model correctly predicts positive data and recall From the results of feature selection using feature impor-
metric in (5) was used to see if false negatives produced a tance random forest algorithm, the selected features were 375
very large number, and while F1- score metric in (6) was used out of 2381 features in the Bodmas dataset.
to measure the average harmonization value between
precision and recall [3]. C. Accuracy, Precision, Recall, F1 Score
TP + TN In order to analyze the results of the accuracy, precision,
accuracy = (3) recall, and F1-score metrics, the library k-fold cross-validation
TP + TN + FP + FN from scikit-learn was used to validate the test metric results
TP generated from each proposed algorithm in this research [15],
precision = (4)
TP + FP [22], [23]. The value of k used for k-fold cross-validation in
TP this research was k = 5.
recall = (5) Based on the results of Table I, the hybrid machine learning
TP + FN algorithm does not always produce a higher value than the
precision × recall base model algorithm. It can be seen in the Table I that the
F1 -score = 2 × (6) accuracy, precision, and f1-score values for LGBM + XGB +
precision + recall
89on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
TABLE I
M ETRIC C ALCULATION T ABLE
90on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)
91on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded