0% found this document useful (0 votes)
25 views

Analysis Study of Malware Classification Portable Executable Using Hybrid Machine Learning

The document discusses using hybrid machine learning algorithms to classify malware in portable executable files. It proposes using a voting classifier method combined with LightGBM, XGBoost, and Logistic Regression models. The research found that the hybrid machine learning algorithm produced a higher recall value than using just LightGBM alone, demonstrating the potential benefit of combining different algorithms.

Uploaded by

Akshay Fedee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Analysis Study of Malware Classification Portable Executable Using Hybrid Machine Learning

The document discusses using hybrid machine learning algorithms to classify malware in portable executable files. It proposes using a voting classifier method combined with LightGBM, XGBoost, and Logistic Regression models. The research found that the hybrid machine learning algorithm produced a higher recall value than using just LightGBM alone, demonstrating the potential benefit of combining different algorithms.

Uploaded by

Akshay Fedee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)

Analysis Study of Malware Classification Portable


Executable Using Hybrid Machine Learning
Fauzan Hikmah Ramadhan Vera Suryani Satria Mandala
School of Informatics School of Computing Human Centric (HUMIC) Engineering
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA) | 978-1-6654-1777-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICICYTA53712.2021.9689130

Telkom University Telkom University & School of Computing


Bandung, Indonesia Bandung, Indonesia Telkom University
[email protected] [email protected] Bandung, Indonesia
[email protected]

Abstract—Malware is a malicious program that executes de- stage before detecting malware using machine learning is by
structive functions to destroy the resources in a computer system, determining what method to analyze the malware to make
gain some financial benefits, steal the privacy and confidentiality malware detection more accurate. The method for analyzing
of data, and use computing resources to make a service unavail-
able in a computer system. One of the ways to prevent malware malware is divided into two methods: static and dynamic
attacks is by detecting Portable Executable (PE) malware files malware analysis. Based on references [4], [5], the static
using machine learning. However, not all machine learning malware analysis method is better compared to the dynamic
algorithms have optimal performance in detecting a malware malware analysis method as it can cover the shortcomings
PE File because some have several weaknesses that result in of dynamic malware and has high effectiveness in analyzing
low performance in detecting a malware PE File. However, these
shortcomings can be reduced by combining two or more two large-scale data.
different individual algorithms into one hybrid machine learning Based on several previous studies, each individual machine
algorithm, so the advantages of some individual algorithms can learning algorithm continuously produces different accuracy
cover the shortcomings of other individual algorithms. Therefore, values in detecting a malware [4], [6]. One of the leading
this research proposes research on the performance of the hybrid causes of each individual algorithm having different accuracy
machine learning algorithms in detecting malware PE File. The
hybrid machine learning algorithms use the voting classifier values is because each algorithm has its strengths and weak-
method and LightGBM, XGBoost, and Logistic Regression as nesses that can affect the final result of the accuracy value
their base model. This research proves that the hybrid machine of detecting malware. One of the ways to overcome the low
learning algorithm produces a higher recall value than the accuracy value in each individual machine learning algorithm
ensemble algorithm LightGBM. The hybrid machine learning
is by combining two or more different machine learning
algorithm produces the highest recall value with a recall value of
99.5026%, while the LightGBM algorithm only produces a recall algorithms. The combination of algorithms is called hybrid
value of 99.4480%. Furthermore, the recall value of another base machine learning. By using this hybrid machine learning,
model is 99.5004% for the XGBoost algorithm and 98.0539% for the weaknesses of a single algorithm can be covered by the
the Logistic Regression algorithm. advantages of other single algorithms [7].
Index Terms—malware detection, hybrid machine learning, This research focuses on analyzing the performance compar-
voting classifier, ensemble, PE file
ison between the ensemble algorithm and the hybrid machine
learning algorithm proposed in this research in detecting a
I. INTRODUCTION
malware PE File. The ensemble algorithm that was used as a
Malware is a malicious program that executes destructive reference in this research was the LightGBM algorithm used in
functions. It can be used to destroy any existing resources in the Bodmas dataset research, containing the bytes PE File from
computer systems, get some financial gain, steal the privacy 2019 to 2020 [5]. Meanwhile, the hybrid machine learning
and confidentiality of data, and take advantage of computing algorithm method used in this research was a voting classifier
system resources to make a service unavailable. Therefore, with a soft voting method and an algorithm consisting of
there is a need for a way to prevent a malware attack. One of the LightGBM, XGBoost, and Logistic Regression algorithms.
these ways is by detecting the malware files entering the user’s It is expected that the hybrid machine learning algorithm
device using machine learning [1]. Based on the references can produce a higher performance value than the ensemble
[2], many people use files with the format Portable Executable machine learning algorithm.
(PE) as malware files because almost all files that Windows
can execute are files with PE format. II. LITERATURE REVIEW
Based on reference [3], machine learning is a subfield of Reference [5] research was to propose the use of one of the
artificial intelligence that enables computer systems to learn gradient boosting decision tree models, LightGBM, to predict
from examples, data, and experiences without being pro- the Bodmas dataset and analyze the result of it. LightGBM
grammed. For this, many studies are using machine learning showed an accuracy above 98% on the top-2 and top-3 of
as an automated technique to detect malware [1]. The earliest known pe file malware families only, while other algorithms

978-1-6654-1777-8/21/$31.00
Authorized ©2021
licensed use limited to: DUBLIN IEEE
CITY UNIVERSITY. 86on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)

used as a comparison on this research had a lower accuracy an accuracy of 87.33%, while its base classifier achieves an
value than LightGBM. This research stated that the other accuracy of 86.67% for MLP, 84.67% for RF, and 85% for
algorithm had a much worse accuracy than LightGBM because XGB. This research shows that the performance of a hybrid
another algorithm is not getting fine-tuning, and the perfor- machine learning algorithm outperforms the other algorithm
mance on detecting unknown malware families is not good. used as a comparison in this research.
This research showed that LightGBM had a better performance
A. Malware
than other algorithms used to compare this research.
Reference [8] research was to propose the use of one of Malware is a malicious program that executes destructive
the gradient boosting decision tree models, i.e., LightGBM, functions that can destroy existing resources in computer
to predict the ember dataset and analyze the result of it. systems, get a financial gain, steal privacy and confidentiality
LightGBM has a roc auc value exceeding 99.911% and the of data, and take advantage of computing system resources to
false positive rate value exceeding 92.99%, while the J48 algo- make a service unavailable [1].
rithm used in this dataset compared to LightGBM had a much B. File Portable Executable (PE)
worse performance with a 53% false positive rate and 8%
false-negative rate. This research stated that the J48 algorithm Based on reference [8], [13] PE file refers to the standard
format for executable files on Windows operating systems
had a worse performance than LightGBM because of dataset
bias, stale training data, or both. This research showed that and describes the dominant formats used by the Windows
operating system.
LightGBM had a better performance in comparison to the J48
algorithm. C. Bodmas Dataset
Reference [9] research proposed the use of XGBoost to The Bodmas dataset refers to a dataset containing the entire
predict the protein interaction sites datasets. IHT-XGBoost structure of the PE File bytes collected from 2019 to 2020. The
showed a better performance in detecting overlapping data contents of the Bodmas dataset consisted of 57293 malware
on this research with the highest accuracy value from other samples and 77142 good samples; thus, the total PE File used
machine learning used as a comparison. The accuracy value in the Bodmas dataset was 134435 data samples [5].
of IHT-XGBoost on this research was 80.7%, showing a better
performance of IHT-XGBoost in handling an overlapping D. Ensemble Machine Learning
dataset. This research stated that other algorithms had a Based on reference [14]–[16], ensemble machine learning
much worse performance than IHT-XGBoost because the other combines the results of two or more predictions of the same
algorithm does not have an imbalance treatment strategy that machine learning classification algorithm to improve the per-
can improve the prediction of interaction between protein sites. formance of a single machine learning classification algorithm.
Reference [10] research was to propose the use of a hybrid Moreover, ensemble machine learning is called a machine
model between Convolutional Neural Network (CNN) and learning method meta-learning.
XGBoost to predict the popularity of the social posts. As
shown in this research, the proposed hybrid model had an E. Hybrid Machine Learning
excellent performance in detecting overlapping data. The ac- Based on reference [12], [16], [17], a hybrid machine
curacy values of this hybrid model were 0.7406 for Spearman learning algorithm combines two or more different machine
Rho, 2.7293 for MSE, and MAE for 1.2475. This research learning algorithms. The goal of combining two or more
stated that the reason why the hybrid model used in this different machine learning algorithms is to compensate for one
research have a better performance rather than the other algo- algorithm’s shortcomings with the benefits of other individual
rithm used in this research as a comparison is that the hybrid algorithms.
model exploited the high-level features extracted by CNN
F. Voting Classifier
makes the hybrid model proposed in this research outperforms
the other algorithm used as a comparison in this research. A voting classifier is a meta-classifier that combines ma-
In this reference, [11] the harmonic mean of evaluation chine learning classifications that are similar or different into
criteria derived from a confusion matrix, an imbalanced data one classification algorithm using majority voting. In doing
classifier, specifically the confusion matrix based kernel lo- majority voting, there are two methods: hard voting and soft
gistic regression (CM-KLOGR), was proposed. This research voting [18].
found that CM-KLOGR outperformed the other data clas- Hard Voting
sifiers, KLOGR and support vector machine (SVM). The Hard Voting is the easiest majority voting method to do.
values of the CM-KLOGR evaluation criteria, according to In this method, the class label ŷ was predicted using
this research, can increase the values of the evaluation criteria. majority voting against each classifier Cj . The equation
Reference [12] research was to compare hybrid machine can be seen in (1).
learning with its base models Random Forest (RF), Multilayer
ŷ = mode{C1 (x), C2 (x), ..., Cm (x)} (1)
Perceptron (MLP), and Extreme Gradient Boost (XGB) in
classifying people as patient or healthy person. This research Soft Voting
shows that the proposed hybrid machine learning achieves In Soft Voting, class labels are predicted based on the

87on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)

probability value p from each classification used in the


voting classifier. The equation can be seen in (2).
Σm
ŷ = arg max wj pij , (2)
i
j=1

Where wj is the weight that can be classified into j.

G. XGBoost
XGBoost uses the principle of gradient boosting, which
combines several weak learners into strong learners. But in
XGBoost, the creation of a tree is done in parallel, so the
creation of a tree is faster than gradient boosting. XGBoost
also predicts better than gradient boosting as it controls the
model complexity and minimizes the model overfitting via its
default regularization [3].

H. LightGBM
LightGBM is a Gradient Boosting Decision Tree (GBDT)
algorithm that makes use of Gradient-based One-Side
Sampling (GOSS) and Exclusive Feature Bundling (EFB)
[19]. The GBDT algorithm is a tree ensemble algorithm that
generates one regression tree regularly by fitting the rest of
the tree that came before it [20].

Gradient-based One-Side Sampling (GOSS)


When doing down-sampling data instances, GOSS
maintains the accuracy of information gaining by
storing the data instances with a gradient that is higher
than the predefined threshold value or greater than the
values with the highest percentile, and GOSS randomly
selects instances with low gradient values to be dropped
from the data instances [19].
Fig. 1. Research Framework Diagram.

Exclusive Feature Bundling (EFB)


By using features as vertices and adding an edge to III. RESEARCH METHODOLOGY
every two not mutually exclusive features, EFB reduces
A. Research Framework
an optimal bundling problem to a graph colouring
problem. [19]. The methodology used in completing this research can be
seen in Figure 1. And the explanation of each stage to do
this research is presented as follows:
LightGBM also uses a leaf-wise growth strategy to avoid
overfitting [3].
1) Literature Review
I. Logistic Regression At this stage, the observations and summaries of the
previous research needed for this research were reviewed
Logistic regression is a linear classifier that uses a logistic by reading journals and articles relevant to this research.
function to predict probability (sigmoid). The output of the
logistic function is transformed into the probability value that 2) Designing The Proposed Hybrid Machine Learning
can be classified into two categories. [3]. Algorithm
At this stage, a hybrid machine learning algorithm was
J. Static Malware Analysis
designed based on the results of previous literature
Based on reference [1] static malware analysis technique studies, and the process of designing a hybrid machine
is a technique that extracts static features from malware files learning algorithm started by determining the base
without executing the malware files, and the extraction results model algorithms to be used for making hybrid
are being analyzed. machine learning algorithms. Then, all chosen-based

88on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)

models got a hyper-parameter tuning, and the results TP is True Positive, TN is True Negative, FP is False Pos-
of the hyper-parameter tuning on the proposed base itive, and FN is False Negative [14]. In this research, the test
model algorithm were used to design a hybrid machine metric used to analyze the performance of an algorithm was
learning algorithm. the recall test metric. Because the recall test metric calculated
the percentages of the comparison between a malware PE File
3) Testing The Hybrid Machine Learning Algorithm predicted to be a malware PE File and a malware PE File
that has been Designed predicted to be a benign PE file.
At this stage, the performance of the proposed hybrid
machine learning algorithm was tested against the IV. EXPERIMENTS
compared algorithm, i.e., the LightGBM algorithm. The A. Visualization of The Data Distribution
test was done by comparing the values of accuracy, In this research, the visualization of the data distribution
precision, recall, and F1 score between the proposed was by visualizing one of the relationships between a feature
hybrid machine learning algorithm and the compared and other features (Figure 2 is a visualization between the
LightGBM algorithm. If the performance value of the ByteHistorgram-0 feature and the ByteEntropyHistogram-0)
proposed hybrid machine learning was greater than or using the library dtale. Dtale is the combination of flask back-
equal to LightGBM, then the proposed hybrid machine end and react front-end to make the data visualization much
learning algorithm design has been declared successful. easier to analyze and to see the structure of the pandas data
[21]. The results of dtale visualization were used to choose a
4) Designing Malware Detection by Applying The machine learning algorithm that was suitable for processing
Proposed Hybrid Machine Learning Algorithm to the dataset with that visualization results. Based
The Malware Detection System
At this stage, malware detection was designed by
applying the proposed hybrid machine learning
algorithm to the malware detection system. It was
conducted by creating a hybrid machine learning
algorithm and then applying it to the malware PE File
detection system that has been created.
Fig. 2. Distribution of The Data Visualization.
5) Writing The Study Report
At this stage, the author made a report related to the on the results from figure 2, the Bodmas dataset had data
research that has been carried out by following the that overlapped each other; it means that a literature study
scientific writing design method, and the result of this must be carried out on machine learning algorithms with
stage is the final assignment book. optimal performance in processing overlapping data. After
conducting a literature study, it was found that the LightGBM,
XGBoost, and Logistic Regression algorithms produced the
B. Metrics
highest performance in processing data with the results of
The test metrics used in testing this algorithm were metrics visualizing the distribution of data in figure 2. Therefore, the
used in previous studies, including accuracy, precision, recall, research on this research used these three algorithms as the
and F1-Score [14]. The accuracy metric in (3) was used to base models for the proposed hybrid machine learning model.
evaluate the machine learning classification model used, and
the precision metric in (4) was to see how often a machine B. Feature Selection
learning model correctly predicts positive data and recall From the results of feature selection using feature impor-
metric in (5) was used to see if false negatives produced a tance random forest algorithm, the selected features were 375
very large number, and while F1- score metric in (6) was used out of 2381 features in the Bodmas dataset.
to measure the average harmonization value between
precision and recall [3]. C. Accuracy, Precision, Recall, F1 Score
TP + TN In order to analyze the results of the accuracy, precision,
accuracy = (3) recall, and F1-score metrics, the library k-fold cross-validation
TP + TN + FP + FN from scikit-learn was used to validate the test metric results
TP generated from each proposed algorithm in this research [15],
precision = (4)
TP + FP [22], [23]. The value of k used for k-fold cross-validation in
TP this research was k = 5.
recall = (5) Based on the results of Table I, the hybrid machine learning
TP + FN algorithm does not always produce a higher value than the
precision × recall base model algorithm. It can be seen in the Table I that the
F1 -score = 2 × (6) accuracy, precision, and f1-score values for LGBM + XGB +
precision + recall

89on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)

TABLE I
M ETRIC C ALCULATION T ABLE

Accuracy Precision Recall F1-Score


LGBM 0.995509 0.994980 0.994480 0.994730
XGB 0.996253 0.996199 0.995004 0.995601
LR 0.980055 0.972879 0.980539 0.976693
LGBM + XGB + LR 0.996253 0.996177 0.995026 0.995601

LR were 99.6253%, 99.6177%, and 99.5601%, respectively.


As one of the base model algorithms, i.e., XGB produced the
same accuracy and f1-score values with accuracy values and
even produced a higher precision value than the LGBM + XGB
+ LR algorithm. The reason why the hybrid machine learning
Fig. 3. Confusion Matrix Machine Learning LGBM.
base model algorithm can produce the value of the same metric
is that the voting classifier always takes the highest probability
value generated from each base model algorithm. Thus, if the
majority of the base model algorithm’s performance is low,
then the voting classifier will choose the base model algorithm
with the highest probability.
A recall is a matrix to measure the percentages of the
comparison between TP results and FN results [14]. In this
research, the recall was used to measure the percentages of the
comparison between malware PE File predicted to be malware
and PE File predicted to be benign PE files. Because if a
malware PE file is predicted to be a benign PE file is more
dangerous than benign PE Files predicted to be malware pe
file, then this research focused more on the recall matrix to see
the performance of a machine learning algorithm in detecting
Fig. 4. Confusion Matrix Hybrid Machine Learning (LGBM + XGB + LR).
a malware PE File. As seen in Table I, the LGBM + XGB +
LR hybrid machine-learning algorithm produced a recall value
of 99.5026%, which is the highest recall value compared to
machine learning algorithm got 11408 TP predictions and 51
the recall value of the base model algorithm. Thus, the hybrid
FN predictions. While the LGBM algorithm only got 11404
machine learning algorithm succeeded in increasing the recall
TP predictions and 55 FN predictions. This proved that the
value in detecting a malware PE File.
proposed hybrid machine-learning algorithm had a higher
D. Confusion Matrix performance in detecting a malware PE File compared to the
With a confusion matrix, the predicted results of an al- LGBM algorithm.
gorithm can be visualized into a matrix used to analyze the
performance of the visualized algorithm. The structure of the E. ROC-Curve
confusion matrix in figure 3 and 4 is divided into 4, the cell
in the first row and first column of confusion matrix refers to Figure 5 illustrates a ROC-Curve graph used to see the
the result of True Negative (TN), meaning a negative data capabilities of the proposed machine learning algorithm in
is predicted as negative data. The cells in the second row determining whether a PE file is malware. Random prediction
and first column of the confusion matrix are the results of on the ROC-Curve graph was used as a straight line compar-
False Negative (FN), meaning that positive data is predicted as ison with all proposed algorithms. If the proposed algorithm
negative data. The cell in the first row of the second column got an auroc value below 0.500 and the curve was below the
of the confusion matrix is the result of False Positive (FP), random prediction curve, then the proposed algorithm could
meaning that negative data is predicted as positive data, and the not be used on the Bodmas dataset used in this research.
cell in the second row of the second column of the confusion As shown in Figure 5, the proposed hybrid machine learning
matrix is the result of True Positive (TP), which means that a algorithm, i.e., LGBM + XGB + LR, produced an AUROC
positive data is predicted as positive data. value equal to 1 and produced a graph above the random
In this research, it is more dangerous if the proposed hybrid prediction line. Because the AUROC value was greater than
machine learning algorithm predicts malware files as benign the value of random prediction and the graph was above the
files is more often than predicts benign files as malware random prediction line, the hybrid machine learning algorithm
files. So the focused cell in this research was FN cells. proposed in this research can be used to process the Bodmas
Based on the results from Figure 4, the proposed hybrid dataset used in this research.

90on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded
2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)

[5] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas:


An open dataset for learning based temporal analysis of pe malware,”
in 4th Deep Learning and Security Workshop, 2021.
[6] P. Agrawal and B. Trivedi, “Machine learning classifiers for android
malware detection,” Data Management, Analytics and Innovation, vol. 1,
p. 311, 2020.
[7] S. Ardabili, A. Mosavi, and A. R. Várkonyi-Kóczy, “Advances in
machine learning modeling reviewing hybrid and ensemble methods,” in
International Conference on Global Research and Education. Springer,
2019, pp. 215–227.
[8] H. S. Anderson and P. Roth, “EMBER: An Open Dataset for Training
Static PE Malware Machine Learning Models,” ArXiv e-prints, Apr.
2018.
[9] A. Deng, H. Zhang, W. Wang, J. Zhang, D. Fan, P. Chen, and B. Wang,
“Developing computational model to predict protein-protein interaction
sites based on the xgboost algorithm,” International journal of molecular
sciences, vol. 21, no. 7, p. 2274, 2020.
[10] L. Li, R. Situ, J. Gao, Z. Yang, and W. Liu, “A hybrid model combining
convolutional neural network with xgboost for predicting social media
popularity,” in Proceedings of the 25th ACM international conference
on Multimedia, 2017, pp. 1912–1917.
[11] M. Ohsaki, P. Wang, K. Matsuda, S. Katagiri, H. Watanabe, and
A. Ralescu, “Confusion-matrix-based kernel logistic regression for im-
Fig. 5. ROC Curve of All Algorithms used in this Research. balanced data classification,” IEEE Transactions on Knowledge and
Data Engineering, vol. 29, no. 9, pp. 1806–1819, 2017.
[12] P. S. Rajawat, D. K. Gupta, S. S. Rathore, and A. Singh, “Predictive
analysis of medical data using a hybrid machine learning technique,” in
V. CONCLUSION 2018 First International Conference on Secure Cyber Computing and
Communication (ICSCCC). IEEE, 2018, pp. 228–233.
Based on the research that has been done, the proposed [13] L. Binxiang, Z. Gang, and S. Ruoying, “A deep reinforcement learning
hybrid machine-learning algorithm is shown to have a better malware detection method based on pe feature distribution,” in 2019 6th
International Conference on Information Science and Control Engineer-
performance in detecting malware than any of the base models ing (ICISCE). IEEE, 2019, pp. 23–27.
used to make the proposed hybrid machine learning model. [14] J. Kang and Y. Won, “A study on variant malware detection techniques
However, even though the other metrics besides recall have using static and dynamic features,” Journal of Information Processing
Systems, vol. 16, no. 4, pp. 882–895, 2020.
the same or even lower value than the base model or the [15] A. Burkov, The hundred-page machine learning book, 2019, oCLC:
compared algorithm (LightGBM), it still improves recall, and 1089445188.
recall metrics are the main focus of this research. The proposed [16] P. Sornsuwit and S. Jaiyen, “A new hybrid machine learning for
cybersecurity threat detection based on adaptive boosting,” Applied
hybrid machine learning recall value was 99.5026%, which Artificial Intelligence, vol. 33, no. 5, pp. 462–482, 2019.
was the highest recall value compared to each base model [17] B. T. Pham, I. Prakash, S. K. Singh, A. Shirzadi, H. Shahabi, D. T.
(LGBM + XGB + LR) that had 99.4480% 99.5004%, and Bui et al., “Landslide susceptibility modeling using reduced error prun-
ing trees and different ensemble techniques: Hybrid machine learning
98.0539%, respectively. However, even though the proposed approaches,” Catena, vol. 175, pp. 203–218, 2019.
hybrid machine learning performed better than LightGBM, [18] S. Raschka, “Mlxtend: Providing machine learning and data science
some improvements can still be made. In future research, utilities and extensions to python’s scientific computing stack,” The
Journal of Open Source Software, vol. 3, no. 24, Apr. 2018. [Online].
it is suggested to use much more datasets and the most Available: http://joss.theoj.org/papers/10.21105/joss.00638
updated datasets than this research, and it is suggested to use [19] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-
another algorithm as a hybrid machine learning algorithm, Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,”
Advances in neural information processing systems, vol. 30, pp. 3146–
and combine more than three algorithms into a hybrid ma- 3154, 2017.
chine learning algorithm, and use another machine learning [20] H. Zhang, S. Si, and C.-J. Hsieh, “Gpu-acceleration for large-scale tree
algorithm methods that may be able to produce a higher boosting,” arXiv preprint arXiv:1706.08359, 2017.
[21] A. Schonfeld, “man-group/dtale: Visualizer for pandas data structures,”
accuracy value than the proposed hybrid machine learning in Oct 2021. [Online]. Available: https://github.com/man-group/dtale
this research. [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
REFERENCES esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
[1] J. Singh and J. Singh, “A survey on machine learning-based malware de- Learning Research, vol. 12, pp. 2825–2830, 2011.
tection in executable files,” Journal of Systems Architecture, p. 101861, [23] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel,
2020. V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Van-
[2] A. Kumar, K. Kuppusamy, and G. Aghila, “A learning model to derPlas, A. Joly, B. Holt, and G. Varoquaux, “API design for machine
detect maliciousness of portable executable using integrated feature set,” learning software: experiences from the scikit-learn project,” in ECML
Journal of King Saud University-Computer and Information Sciences, PKDD Workshop: Languages for Data Mining and Machine Learning,
vol. 31, no. 2, pp. 252–265, 2019. 2013, pp. 108–122.
[3] B. Quinto, Next-Generation Machine Learning with Spark: Covers
XGBoost, LightGBM, Spark NLP, Distributed Deep Learning with Keras,
and More, 1st ed. Apress, 2020.
[4] A. Shalaginov, S. Banin, A. Dehghantanha, and K. Franke, “Machine
learning aided static malware analysis: A survey and tutorial,” in Cyber
Threat Intelligence. Springer, 2018, pp. 7–45.

91on October 21,2022 at 20:00:36 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded

You might also like