Tuning The K Value in K-Nearest Neighbors For Malware Detection
Tuning The K Value in K-Nearest Neighbors For Malware Detection
Corresponding Author:
Mosleh M. Abualhaj
Department of Networks and Cybersecurity, Faculty of Information Technology
Al-Ahliyya Amman University
Amman, 19328, Jordan
Email: [email protected]
1. INTRODUCTION
Businesses and individuals now depend more on technology and are more networked, which presents
a variety of cyber risks. Cyber risks are potential dangers and weaknesses in digital systems, networks, and
information that could result in unauthorized access, data breaches, monetary loss, reputational harm, or
business interruption [1]. Weak or stolen passwords, phishing, DoS attacks, and malicious software (Malware)
are a few typical instances of cyber risks [2], [3]. Any software intended to damage, exploit, or allow illegal
access to computer systems, networks, or devices is referred to as malware. Malware comes in a variety of
forms, each with unique traits and attack strategies. Viruses, Worms, Trojan horses, and Ransomware are
examples of common types of malware [3], [4]. The total number of malware attacks in 2022 was 5.5 billion
[5]. This inevitably results in significant data breaches, losses, or corruption.
It is crucial to adopt security measures such as updating operating systems and apps with security
patches, training users, and deploying reliable anti-malware software in order to safeguard against malware
[6], [7]. On the other hand, advanced malware is a cunning and elusive strategy that hackers use to circumvent
traditional security measures and commit crimes. These cutting-edge malware techniques demonstrate how
constantly changing cyber risks are and how urgent it is for businesses to use cutting-edge security measures
that go beyond typical anti-malware solutions [8], [9]. Machine learning (ML) approaches can be used to
protect against malware by improving the detection and prevention capabilities of security systems [9], [10].
The goal of ML is to create methods and models that let computers learn and make predictions or judgments
without having to be explicitly programmed. It entails building and training mathematical models on data,
which the models can then use to predict the future, spot patterns, or obtain new knowledge. Malware detection,
behavioral analysis, dynamic analysis, and feature extraction are just a few of the numerous ways that ML can
be used in the context of malware. ML methods can be broadly divided into three categories: reinforcement
learning, unsupervised learning, and supervised learning (SL) [9], [10].
SL is an ML technique where a model is trained on a labeled dataset that pairs input data with
corresponding target labels or outcomes. To correctly forecast or categorize new, unexplored data, the model
must learn the link between the input features and the target variable. Because it can identify patterns and traits
of malware from labeled data, SL is a widely utilized approach in malware detection. The labeled dataset's
quality, the feature selection, and the choice of the best methods all affect how well malware may be detected
using SL [10]–[12]. The data's type, the virus's complexity, and the required trade-offs between accuracy,
performance, and interpretability all play a role in choosing the best methods for malware detection. In the area
of malware detection, decision trees (DT), random forests (RF), support vector machines (SVM), naive bayes
(NB), and K-nearest neighbors (KNN) are some of the commonly employed methods. KNN is a straightforward
and understandable method that classifies data points based on the consensus of those points in the feature
space that are closest to them [13]–[17]. It is appropriate for scenarios in which neighborhood and local patterns
play a significant role in malware identification. In this study, an ML model called malware- KNN (MW-KNN)
that uses an optimized KNN algorithm will be built for malware detection.
2. RELATED WORKS
Goyal and Kumar [13] discuss the difficulty of identifying malware, particularly in light of the volume
of malware that is produced and spread on a daily basis. The major goal of the study is to minimize harm
through early malware detection. The pipeline procedure for both signature-based and behavior-based malware
detection algorithms is thoroughly explained by Goyal and Kumar [13]. Using a dataset of 1494 malware and
1347 benign samples, the authors ran an experiment. These samples were used to extract two different types
of features: non-repetitive consecutive application programming interface (API) calls for dynamic analysis and
string features for static analysis. After that, they used different ML methods on these attributes with
training/testing ratios of 80:20, 70:30, and 60:40. Gaussian NB, multi NB, DT, RF, KNN, and SVM are the
ML methods. The findings indicated that dynamic features are more promising than static features, as the
accuracy with the API calls feature was higher than the accuracy with the string feature. With an accuracy of
97.53%, the RF algorithm on API calls produced the best results. The behavior-based approach is more
promising, according to the authors, for identifying new malware.
Malware program classification is a challenge that Davuluru et al. [14] address. The purpose of this
paper is to investigate the performance of different convolutional neural network (CNN)-based architectures
(AlexNet, ResNet, and very deep convolutional [VGG16]) as feature extractors and classification tools after
the visualization of malware programs. The authors suggest a fusion strategy that combines CNN, which has
been producing cutting-edge results for image-based classification, with the pattern recognition approach,
which proved successful for classifying malware. They use classic ML methods like SVM and KNN to classify
by extracting features from the suggested CNN architectures. For a set of 2,174 test samples taken from the
BIG 2015 dataset, the suggested algorithm achieves an overall accuracy of 99.4%. The findings unambiguously
show that CNN is useful for categorizing malware programs as a feature extractor as well as a classification
tool. The performance of algorithms is studied to aid subject-matter experts in selecting the best algorithm for
their purposes.
According to Narayanan et al. [15], existing malicious groups and classes are polymorphic, which
makes it challenging for conventional malware detection techniques to work properly. By visualizing viruses
in an image format that captures minute changes while preserving a global framework, the study aims to
improve malware categorization. As a result, it will be clear that malware classification can be enhanced when
approached as an image classification issue. The principal component analysis (PCA) is implemented for
feature extraction. The performance of various artificial neural network (ANN) algorithms, along with KNN
and SVN methods, is studied for the identification of malware data into their respective classes. The findings
imply that each malware program in a family has a unique pattern. These patterns are easily distinct between
families and are relatively similar within one family. Because picture patterns for malware programs from the
same family tend to be similar, the authors discovered that the KNN classifier performs well. The outcomes
also show that PCA transformation is the best option in this case.
Hegedus et al. [16] discuss malware detection and pay particular attention to the drawbacks of
signature-based malware detection. Because of the growth of polymorphic and metamorphic malware, the
authors emphasize the necessity of execution-level identification. The study's goal is to enhance malware
detection by offering a two-stage process that makes use of the KNN method's random projections. According
to the study, a set of samples is first pruned, and only the samples that satisfy a particular requirement are kept.
They are then regarded as "unpredictable" and perhaps "clean" samples. To find potential false negatives, the
authors next employ the KNN method and Jaccard similarity measure. They also go through how the confusion
matrix is affected by the random projection dimension and how to leverage it to get better outcomes. The results
demonstrate that raising the projected vectors' dimensions improves the outcomes: the proportion of
unpredictable samples declines, while the true positives rise and the false positives fall. The authors also point
out that when dimensions increase, the true and false negatives seldom alter.
Şahın et al. [17] draw attention to the rising incidence of malware on Android devices. The study's
goal is to make Android malware detection better by advocating a permission weight strategy. The authors
provide a weighting mechanism that uses the KNN and NB methods for malware identification and gives each
permission a unique score. The relevance frequency method, a successful weighting method in text
categorization, is also covered by the authors as it relates to Android malware detection. The outcomes
demonstrate that the suggested strategy produces superior outcomes compared to earlier ones. Both accuracy
and F-score showed an average improvement of 2% with the KNN method. The accuracy and F-score metrics
revealed an average improvement of 4% and 7% with the NB method, respectively. When comparing the
classification methods, the KNN method produced the greatest results for the accuracy metric, whereas
Gaussian NB produced the best results for the F-score metric.
3. CIC-MALMEM-2022 DATASET
A malware dataset called CIC-MalMem-2022 is used to evaluate malware detection techniques in this
study. Using malware that is common in the real world, the dataset was developed to replicate a situation as
closely as possible to the real world. The dataset is balanced with 50% malicious memory dumps and 50%
benign memory dumps. There are 58,596 records total in the dataset, 29,298 of which are benign and 29,298
of which are malicious. The dataset includes the three primary malware categories of Trojan Horse,
Ransomware, and Spyware. There are five subcategories within each of these categories. The subcategories of
Trojan Horse are Zeus (1,950 samples), Emotet (1,967 samples), Refroso (2,000 samples), scar (2,000
samples), and Reconyc (1,570 samples). The subcategories of Ransomware are Conti (1,988 samples), MAZE
(1,958 samples), Pysa (1,717 samples), Ako (2,000 samples), and Shade (2,128 samples). The subcategories
of Spyware are 180Solutions (2,000 samples), Coolwebsearch (2,000 samples), Gator (2,200 samples),
Transponder (2,410 samples), and TIBS (1,410 samples). The dataset also includes 55 attributes that were
utilized to differentiate the various malware groups [18].
4. METHOD
This section discusses the proposed model that will be used to detect the Malware. This model
includes preparing the data for the classification algorithm and the used KNN algorithm. Two main steps will
be performed to prepare the data: transformation and normalization, as discussed in sections 4.1 and 4.2,
respectively. Section 4.3 discusses the operation and parameter values of the KNN classifiers.
4.1. Transformation
The term "transformation" refers to changing non-numerical data's value to a numerical one. Since
most ML methods work with numerical data, it is frequently required to convert data into numerical
representations. One of the most common methods of data transformation is label encoding. It gives each
category in the variable a special numerical value. When the category variable has an intrinsic order or ranking,
this encoding is often utilized [10]. Regarding the MalMem-2022 that is being utilized, the output field is
textual and contains 4 values of main categories and 16 values of sub categories [18]. The label encoding
approach was used to convert these values into numbers. The main categories are numbered 0, 1, 2, and 3. At
the same time, the sub categories are numbered by 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, and 15.
4.2. Normalization
Normalization is used in ML to scale numerical features to a standard range or distribution. It seeks
to level the scale between all features, preventing one feature from monopolizing learning due to its greater
magnitude. The popular data normalizing method known as "min-max scaling" is used in ML to scale
numerical features to a particular range, usually between 0 and 1. It maintains the relative ordering of the data
points while linearly transforming the original values to a normalized scale [10]. The min-max scaling formula
Tuning the K value in K-nearest neighbors for malware detection (Mosleh M. Abualhaj)
2278 ISSN: 2252-8938
is shown in (1). Where val is the original value of the feature, n_val is the new value after normalization, mi_val
is the minimum value in the feature, and ma_val is the maximum value in the feature.
(𝑣𝑎𝑙 – 𝑚𝑖_𝑣𝑎𝑙)
𝑛_𝑣𝑎𝑙 = (1)
(𝑚𝑎_𝑣𝑎𝑙 – 𝑚𝑖_𝑣𝑎𝑙)
ACCURACY ACCURACY
66.314%
99.974%
66.186%
99.949%
99.940%
65.956%
99.923%
99.923%
99.915%
99.906%
99.906%
99.881%
99.863%
99.863%
65.324%
64.701%
64.403%
64.164%
63.959%
63.584%
99.582%
63.430%
63.404%
3 5 7 11 15 19 23 27 31 34 37 217 3 5 7 11 15 19 23 27 31 34 37
VALUE OF K VALUE OF K
(a) (b)
In summary, the MW-KNN model has achieved the highest accuracy and recall with multiclass
classification when k is equal to 5. On the other hand, for all other k values of all four metrics (Accuracy,
Recall, Precision, and F1-Score) with the two classification types, the MW-KNN model has achieved the
highest results when k is equal to 3. Finally, in general, when the value of k is increased, the achievement of
the model decreases with all five metrics.
Tuning the K value in K-nearest neighbors for malware detection (Mosleh M. Abualhaj)
2280 ISSN: 2252-8938
RECALL RECALL
66.314%
66.186%
65.956%
99.974%
99.949%
99.940%
65.324%
99.923%
99.923%
99.915%
99.906%
99.906%
64.701%
99.881%
99.863%
99.863%
64.403%
64.164%
63.959%
63.584%
63.430%
63.104%
99.582%
61.007%
3 5 7 11 15 19 23 27 31 34 37 217 3 5 7 11 15 19 23 27 31 34 37 217
VALUE OF K VALUE OF K
(a) (b)
PRECISIO N PRECISION
68.107%
99.974%
99.949%
99.940%
99.923%
99.923%
99.915%
66.924%
99.906%
99.906%
66.370%
99.881%
99.863%
99.863%
65.525%
64.723%
64.395%
64.233%
64.059%
63.596%
63.437%
63.440%
61.340%
99.582%
3 5 7 11 15 19 23 27 31 34 37 217 3 5 7 11 15 19 23 27 31 34 37 217
VLAUE OF K VALUE OF K
(a) (b)
F 1-SCO RE F1-SCORE
66.411%
65.987%
99.974%
65.301%
99.949%
99.940%
99.923%
99.923%
99.915%
99.906%
99.906%
64.595%
99.881%
64.261%
99.863%
99.863%
64.045%
63.845%
63.416%
63.256%
63.221%
99.582%
60.401%
3 5 7 11 15 19 23 27 31 34 37 217 3 5 7 11 15 19 23 27 31 34 37 217
VALUE OF K VALUE OF K
(a) (b)
6. CONCLUSION
This study suggested utilizing the KNN method with parameter adjustment to create the MW-KNN
model for malware detection. The model tries to improve the performance and effectiveness of malware
detection systems by utilizing the KNN's advantages and tweaking its parameters. We have shown that
changing the parameters is a key part of improving the Accuracy, Recall, Precision, MCC, and F1-Score of the
KNN algorithm for finding malware. We did this by looking at the number of neighbors (K) in a systematic
way. The results highlight the significance of selecting proper parameter values with attention in order to obtain
the best outcomes. Tests carried out with datasets from MalMem-2022. We found that parameter tuning
considerably increases the accuracy of malware classification by comparing the performance of tuned
parameter values. The MW-KNN model has a lot of potential for cybersecurity since it addresses the urgent
need for effective malware detection techniques. The model provides a strong foundation for the precise
identification and classification of dangerous software by leveraging the KNN method, which is renowned for
its simplicity and efficacy in classification tasks. To fully assess the MW-KNN model's efficacy in real-world
circumstances, however, more investigation and testing are required. Thorough testing, benchmarking against
current detection systems, and computing efficiency analyses will determine its viability and scalability.
REFERENCES
[1] P. Lau, L. Wang, W. Wei, Z. Liu, and C.-W. Ten, “A novel mutual insurance model for hedging against cyber risks in power systems
deploying smart technologies,” IEEE Transactions on Power Systems, vol. 38, no. 1, pp. 630–642, Jan. 2023, doi:
10.1109/TPWRS.2022.3164628.
[2] O. I. Falowo, S. Popoola, J. Riep, V. A. Adewopo, and J. Koch, “Threat actors’ tenacity to disrupt: examination of major
cybersecurity incidents,” IEEE Access, vol. 10, pp. 134038–134051, 2022, doi: 10.1109/ACCESS.2022.3231847.
[3] D.-O. Won, Y.-N. Jang, and S.-W. Lee, “PlausMal-GAN: Plausible malware training based on generative adversarial networks for
analogous zero-day malware detection,” IEEE Transactions on Emerging Topics in Computing, vol. 11, no. 1, pp. 82–94, Jan. 2023,
doi: 10.1109/TETC.2022.3170544.
[4] K. A. Dhanya et al., “Obfuscated malware detection in IoT android applications using markov images and CNN,” IEEE Systems
Journal, vol. 17, no. 2, pp. 2756–2766, Jun. 2023, doi: 10.1109/JSYST.2023.3238678.
[5] A. Petrosyan, “Number of malware attacks per year 2022,” Statista, 2022, [Online]. Available:
https://www.statista.com/statistics/873097/malware-attacks-per-year-worldwide/
[6] Y. Zhang et al., “Looking back! Using early versions of android apps as attack vectors,” IEEE Transactions on Dependable and
Secure Computing, vol. 18, no. 2, pp. 652–666, Mar. 2021, doi: 10.1109/TDSC.2019.2914202.
[7] M. Belaoued, A. Derhab, S. Mazouzi, and F. A. Khan, “MACoMal: A multi-agent based collaborative mechanism for anti-malware
assistance,” IEEE Access, vol. 8, pp. 14329–14343, 2020, doi: 10.1109/ACCESS.2020.2966321.
[8] Y. Guo, C.-W. Ten, S. Hu, and W. W. Weaver, “Preventive maintenance for advanced metering infrastructure against malware
propagation,” IEEE Transactions on Smart Grid, vol. 7, no. 3, pp. 1314–1328, May 2016, doi: 10.1109/TSG.2015.2453342.
[9] A. Abusnaina et al., “DL-FHMC: Deep learning-based fine-grained hierarchical learning approach for robust malware classification,”
IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 5, pp. 3432–3447, Sep. 2022, doi: 10.1109/TDSC.2021.3097296.
[10] M. M. Abualhaj, A. A. Abu-Shareha, M. O. Hiari, Y. Alrabanah, M. Al-Zyoud, and M. A. Alsharaiah, “A paradigm for DoS attack
disclosure using machine learning techniques,” International Journal of Advanced Computer Science and Applications, vol. 13, no.
3, 2022, doi: 10.14569/IJACSA.2022.0130325.
[11] S. D. S.L and J. C.D, “Windows malware detector using convolutional neural network based on visualization images,” IEEE
Transactions on Emerging Topics in Computing, vol. 9, no. 2, pp. 1057–1069, Apr. 2021, doi: 10.1109/TETC.2019.2910086.
[12] M. Kolhar, F. Al-Turjman, A. Alameen, and M. M. Abualhaj, “A three layered decentralized iot biometric architecture for city
lockdown durin COVID-19 outbreak,” IEEE Access, vol. 8, pp. 163608–163617, 2020, doi: 10.1109/ACCESS.2020.3021983.
[13] M. Goyal and R. Kumar, “The Pipeline process of signature-based and behavior-based malware detection,” in 2020 IEEE 5th
International Conference on Computing Communication and Automation (ICCCA), Oct. 2020, pp. 497–502. doi:
10.1109/ICCCA49541.2020.9250879.
[14] V. S. P. Davuluru, B. Narayanan Narayanan, and E. J. Balster, “Convolutional neural networks as classification tools and feature
extractors for distinguishing malware programs,” in 2019 IEEE National Aerospace and Electronics Conference (NAECON), Jul.
2019, pp. 273–278. doi: 10.1109/NAECON46414.2019.9058025.
[15] B. N. Narayanan, O. Djaneye-Boundjou, and T. M. Kebede, “Performance analysis of machine learning and pattern recognition
algorithms for Malware classification,” in 2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio
Innovation Summit (OIS), Jul. 2016, pp. 338–342. doi: 10.1109/NAECON.2016.7856826.
[16] J. Hegedus, Y. Miche, A. Ilin, and A. Lendasse, “Methodology for behavioral-based malware analysis and detection using random
projections and K-nearest neighbors classifiers,” in 2011 Seventh International Conference on Computational Intelligence and
Security, Dec. 2011, pp. 1016–1023. doi: 10.1109/CIS.2011.227.
[17] D. O. Sahin, O. E. Kural, S. Akleylek, and E. Kilic, “New results on permission based static analysis for Android malware,” in 2018
6th International Symposium on Digital Forensic and Security (ISDFS), Mar. 2018, pp. 1–4. doi: 10.1109/ISDFS.2018.8355377.
[18] M. Dener, G. Ok, and A. Orman, “Malware detection using memory analysis data in big data environment,” Applied Sciences, vol.
12, no. 17, Aug. 2022, doi: 10.3390/app12178604.
[19] A. A. Kardan, A. Kavian, and A. Esmaeili, “Simultaneous feature selection and feature weighting with K selection for KNN
classification using BBO algorithm,” in The 5th Conference on Information and Knowledge Technology, May 2013, pp. 349–354.
doi: 10.1109/IKT.2013.6620092.
[20] D. O. Sahin and S. Demirci, “Spam filtering with KNN: Investigation of the effect of k value on classification Performance,” in
2020 28th Signal Processing and Communications Applications Conference (SIU), Oct. 2020, pp. 1–4. doi:
10.1109/SIU49456.2020.9302516.
[21] T. Kumar, “Solution of linear and non linear regression problem by K nearest neighbour approach: By using three sigma rule,” in
2015 IEEE International Conference on Computational Intelligence & Communication Technology, Feb. 2015, pp. 197–201. doi:
10.1109/CICT.2015.110.
[22] L. Chen, M. Li, W. Su, M. Wu, K. Hirota, and W. Pedrycz, “Adaptive feature selection-based AdaBoost-KNN With direct
optimization for dynamic emotion recognition in human–robot interaction,” IEEE Transactions on Emerging Topics in
Computational Intelligence, vol. 5, no. 2, pp. 205–213, Apr. 2021, doi: 10.1109/TETCI.2019.2909930.
[23] R. Ghosh, S. Phadikar, N. Deb, N. Sinha, P. Das, and E. Ghaderpour, “Automatic eyeblink and muscular artifact detection and
removal from EEG signals using K-nearest neighbor classifier and long short-term memory networks,” IEEE Sensors Journal, vol.
23, no. 5, pp. 5422–5436, Mar. 2023, doi: 10.1109/JSEN.2023.3237383.
[24] H. Al-Mimi, N. A. Hamad, M. M. Abualhaj, M. S. Daoud, A. Al-dahoud, and M. Rasmi, “An enhanced intrusion detection system
for protecting HTTP services from attacks,” International Journal of Advances in Soft Computing & Its Applications, vol. 15, no.
2, pp. 67–84, 2023.
[25] M. A. Alsharaiah et al., “A new phishing-website detection framework using ensemble classification and clustering,” International
Journal of Data and Network Science, vol. 7, no. 2, pp. 857–864, 2023, doi: 10.5267/j.ijdns.2023.1.003.
Tuning the K value in K-nearest neighbors for malware detection (Mosleh M. Abualhaj)
2282 ISSN: 2252-8938
BIOGRAPHIES OF AUTHORS
Dr. Ahmad Adel Abu-Shareha received his first degree in Computer Science
from Al Al-Bayt University, Jordan, 2004, Master degree from Universiti Sains Malaysia
(USM), Malaysia, 2006, and Ph. D degree from USM, Malaysia, 2012. His research focuses
on Data mining, artificial intelligent and Multimedia Security. He investigated many machine
learning algorithms and employed artificial intelligent in variety of fields, such as network,
medical information process, knowledge construction and extraction. He can be contacted at
email: [email protected].
Dr. Qusai Y. Shambour received the B.Sc. degree in Computer Science from
Yarmouk University, Jordan, in 2001, the M.S. degree in computer networks from University
of Western Sydney, Australia, in 2003, and the Ph.D. degree in software engineering from
the University of Technology Sydney, Australia, in 2012. Currently, he is a Professor at the
Department of Software Engineering, Al-Ahliyya Amman University, Jordan. His research
interests include information filtering, recommender systems, VoIP, machine learning, and
data science. He can be contacted at email: [email protected].