Malware Detection Using Ensemble Learning and File Monitoring
Malware Detection Using Ensemble Learning and File Monitoring
2023 2nd International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN) | 979-8-3503-4800-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICSTSN57873.2023.10151567
File Monitoring
Tilak Vignesh Sowhith Reddy Sonit Kumar
Department of CSE Department of CSE Department of CSE
PES University PES University PES University
Bangalore, India Bangalore, India Bangalore, India
[email protected] [email protected] [email protected]
Abstract—In essence, malware refers to harmful monitoring is a method in place to tackle the problem posed
programs that cybercriminals use to infiltrate a specific by all malware detecting ML models. Which is malicious files
machine or an organisation’s complete network. It takes bypassing the system as the ML model could not detect it. It’s
advantage of flaws in legitimate software (such a browser basically just keeping track of a file’s activities once it enters
or plugin for an online application) that can be hijacked. into a system to make sure it does not exploit the system.
ML is widely used to mitigate this problem which is an
excellent solution but the problem with this is that it’s II. LITERATURE REVIEW
possible for ML to falsely detect some files causing system The execution hardware-driven malware detection
exploits. This paper aims to provide a method to detect technique was an attempt made by the authors H. Sayadi et al
malware using ensemble learning and further monitor [2]. The HPC data was extracted using the Perf software,
files based on a probability value assigned to it by the
which is Linux-compatible. The result variable is the
model.
application’s class (like malware vs benign), whereas the
Keywords— Malware Detection, Ensemble Learning, Machine HPCs pulled at intervals of 10 ms from the running
Learning, File Monitoring, ML Classifiers programmes serve as the input factors for the classifiers. There
were 44 performance counters when they began. The features
are then given scores based on their importance and relevance
I. INTRODUCTION to the goal variable using the feature scoring technique. Using
a feature reduction technique, the 16 computer hardware
Malware intends to exploit systems. As technology is counters that were most closely linked to malware detection
advancing, malware is becoming more difficult to detect and were found and ranked.
reject. With the rise of modern cyber security, machine The majority of machine learning (ML) classifiers perform
learning is now being used to detect these advanced malware. well before feature reduction (16 HPCs), usually providing
Malware detection is vital at the time of entry to prevent a accuracy rates of above 80 percent. However, after using
cyber-attack. What about malwares that are unknown/have not ensemble learning, these accuracy values increased to 88
been detected? What about malwares that bypasses the percent. The efficiency of using ensemble methods to improve
security systems? Our paper works towards solving these 2 the performance of ML classifiers with fewer HPCs as
problems efficiently by building a model using ensemble opposed to extracting 16 or 8 hardware performance counters,
learning to predict whether the file is a malware or not and which would place a substantial implementation cost burden
monitoring files which enter into the system by assigning each on the systems in regard to resource use and energy
file a “likelihood” value. consumption.
To address a specific computational intelligence problem, They understood that using enormous HPCs to run ML
many models, such as classifiers or experts, are strategically algorithms would make it difficult and time-consuming.
developed and merged in a process known as ensemble Additionally, the classifier’s accuracy would decline if
learning [1]. Ensemble learning is mainly used to improve a irrelevant characteristics were added.
model’s performance (classification, prediction, utility In order to improve resistance to specific evasion
estimation, and so on) or decrease the likelihood of selecting techniques, the ensemble detector proposed by author M.
an inferior model unintentionally. The reason to use ensemble Ficco [3] makes use of the advantages of the main analysis
learning in specific is because it has a high accuracy. This is algorithms published in the literature. In order to enhance the
very important in the case of malware detection. A malware unpredictability of the analysis process in general and the
which has escaped detection is unacceptable. Another reason detection method in particular, the research provides a variety
is that ensemble learning is more stable and less noisy. File of strategies for combining general and specialised detectors.
Authorized licensed use limited to: VIT University. Downloaded on January 17,2024 at 05:43:09 UTC from IEEE Xplore. Restrictions apply.
The suggested methods further assist to increase detection malware with an accuracy rate of 0.998 and a false positive
rates when unidentified malware families are present and they rate of 0.002, according to experimental tests. The adoption of
provide better detection performance when re-training the ML-based methodologies to replace traditional signature-
detector on a regular basis is not necessary to keep up with based techniques was emphasised. These ML models make a
malware evolution. The performance of the specialised fortune off of the fast rising prevalence of undetected
ensemble detector, which includes the four best specialised malware, which has been a problem for commercial antivirus
low detectors, is compared to that of the alpha count software. The accuracy, adaptation, tweaking, and
ensemble, which consists of two generic and two specialised dependability of suggested machine learning models improve
low detectors. The results show that the alpha-count- as additional training is carried out using more malware
ensemble detector performs better than specialised ensemble training samples. Compared to signature-based detection, uses
detectors, particularly in terms of sensitivity and accuracy. more processing resources and has a more complicated model.
Additionally, the number of false positives has significantly
decreased, though it is still marginally higher than the III. PROPOSED METHODOLOGY
specialised ensemble detector.
Author H. Rathore et al [4] presented his work on malware A. Dataset
detection using (1) a variety of machine learning methods and The dataset [7] utilised the BIG 2015 model that was
(2) Deep learning models, According to the authors’ findings, proposed and made available by Microsoft on the Kaggle
Random forests beat deep neural networks with high opcode platform. The dataset is 0.5 terabytes in size and contains
frequencies. The deep auto encoder was overkill for the 10868 training samples and 10873 test samples. There are 9
dataset, even with feature reduction, and simple features like families of malware in the dataset namely,
variance cutoff outperformed others. Along with the suggested
approach, this needs to go over additional problems and
• Ramnit: It steals user credentials.
particular difficulties that are specific to the field, as well as
unanswered research topics, restrictions, and future directions. • Lollipop: It’s an adware and can also monitor user traffic.
It provides highly accurate training models. The authors used • Kelihosver1 and Kelihosver3: These are Trojans that take
threshold variance and random forest to attain a maximum full control of a system and their propagation is via
accuracy of 99.78 percent. The model will not work against email.
malware that has not been previously detected or used to train • Vundo: Install other malicious content and show pop up
the model. You can also discard non-malware files. ads.
• Simda: Steal user passwords and create a backdoor.
The ensemble learning technique (SMASH) developed by • Traceur: The attacker shows fake ads using this and gains
Y. Dai et al. [5] fundamentally combines software and money out of it.
hardware features, extracts the API call sequence, hardware • Obfuscator.ACY: These are obfuscated malwares.
performance counters, and memory dump from malware as • Gatak: Trojan that infects systems via malicious code.
detector features, and produces various feature vector types.
Therefore, in this instance, hardware features balance out the B. Malware Detection
susceptibility of software feature evasion while software This section discusses how to train the model using
features make up for a lack of hardware feature detection ensemble learning to classify a file into any of the 9 types of
precision. Using an existing neural network with good malwares mentioned above.
detection performance, each feature was assigned to a
particular detector for malware classification, and all detection 1) Combined Workflow:
results were added together to determine the maliciousness of • A File is downloaded. A python library called
the tested sample in accordance with the approach. Firstly, the ”watchdogs” is used to detect this file download.
technique combines low-level hardware characteristics like • The File is pre-processed. It’s decompiled to get the
resistance to evasion of the memory dump grayscale and ASM data and also the byte matrix.
hardware performance counters with software properties like • The ASM data is further converted into opcode
API call sequences with high detection precision. Secondly, frequencies which are stored in a csv. The byte matrix is
they tried to improve each feature based on the original stored in a csv as well.
research. They tried to select a more advanced classifier model • The csvs are combined.
to improve the detection precision of a single feature. Finally, • The combined csv is passed to the model which we
they came to the approach of using an ensemble learning trained earlier.
algorithm composed of multiple classification algorithms for • The model outputs 9 values between 0-1 depicting the
detection of the malware. This approach won’t work for the probability of the file begin one of the 9 malware or not.
types of malwares that were difficult to detect which are • We select the maximum value from the 9 and make a
highly threatening. decision based on it.
• If the value is less that 0.4 we conclude it isn’t any of the
The authors Amer et al [6] have provided an ensemble 9 types of malware.
learning based detection technique. The file header is mined • If the value is in between 0.4 to 0.7 we say it might be a
for the fewest possible significant attributes that can be used to malware and monitor the file.
train the model. Evaluations show that ensemble models • If the value is greater than 0.7 we discard the file.
outperform individual categorization models by a small
margin. The model they proposed could predict unknown
Authorized licensed use limited to: VIT University. Downloaded on January 17,2024 at 05:43:09 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. Opcode frequency csv
Authorized licensed use limited to: VIT University. Downloaded on January 17,2024 at 05:43:09 UTC from IEEE Xplore. Restrictions apply.
4) Final Model: Figure 7 shows how the final pre-processed data
is converted into a hybrid dataset and finally ANN is applied on it to
get the result.
Fig. 7. Flowchart Depicting Final Model Fig. 10. A comparison of the performance of different ML models for the
BIG2015 dataset
C. File Monitoring
The model has now been trained using the implementation As seen in figure 10 bar chart with the accuracy’s obtained
explained above and it can now be used to classify files. We from [8] the author Hemalata J et al, The proposed model
monitor the file based on its CPU usage and activities. We use poses the highest accuracy with an accuracy of 97.63 percent
a library called ”psutil” in python to do so. In case the compared to the rest of the models.
program detects a file exceeding its utilization, the process is Performance of the proposed methodology compared to other
killed and the program is discarded. deep learning models is shown in the figure 11
IV. RESULT
A. Malware Detection
The model classifies a file as one of the 9 types of malwares
with an accuracy of 97.6 percent with a test loss of 9 percent
as shown in figure 8.
Authorized licensed use limited to: VIT University. Downloaded on January 17,2024 at 05:43:09 UTC from IEEE Xplore. Restrictions apply.
conventional ML methods and also other deep learning
methods. The increased accuracy was due to the use of
multiple models and further improvement in accuracy for
malware detection will be made by combining different model
results as proposed in this paper.
The file monitoring part was crucial in the whole workflow to
make the system foolproof. Even if there were cases where the
Fig. 12. A screenshot of the brave executable passed to the application model might not be able to flag a file as a malware, the file
marked as safe. monitor took care of this. Hence the combination of the file
monitor with the model reduces system exploits and increases
The example taken here in figure 12 of brave.exe As seen the security in a system substantially.
probability of it belonging to one of the malware class is 0.25
hence the file is marked as safe. REFERENCES
When File is sent for monitoring: In figure 13 we see that the [1] https://en.wikipedia.org/wiki/Ensemblelearning
file might be a malware as it has a probability greater than 0.4, [2] H. Sayadi, N. Patel, S. M. P.D., A. Sasan, S. Rafatirad and H.
Hence it’s PIDs are monitored in order to track its activity. In Homayoun, ”Ensemble Learning for Effective Run-Time Hardware-
the above diagram, we can see the PIDs stored in an array and Based Malware
constantly tracked. Detection: A Comprehensive Analysis and Classification,” 2018 55th
ACM/ESDA/IEEE Design Automation Conference (DAC), 2018, pp.
16, doi: 10.1109/DAC.2018.8465828.
[3] M. Ficco, ”Malware Analysis By Combining Multiple Detectors and
Observation Windows,” in IEEE Transactions on Computers, doi:
10.1109/TC.2021.3082002.
[4] H. Rathore and S. K. Sahay, ”Towards Robust Android Malware
Detection Models using Adversarial Learning,” 2021 IEEE
International Conference on Pervasive Computing and
Communications Workshops and other Affiliated Events (PerCom
Workshops),2021,pp.424-425,doi:
10.1109/PerComWorkshops51409.2021.9430980.
Fig. 13. A screenshot of EasyBCD executable passed to the application [5] Y. Dai, H. Li, Y. Qian, R. Yang and M. Zheng, ”SMASH: A Malware
marked for monitoring Detection Method Based on Multi-Feature Ensemble Learning,” in
IEEE Access, vol. 7, pp. 112588-112597, 2019, doi:
10.1109/ACCESS.2019.2934012.
• When File is a malware
[6] Amer, Eslam and Zelinka, Ivan. (2019). An Ensemble-Based Malware
Detection Model Using Minimum Feature Set. MENDEL. 25. 1-10.
10.13164/mendel.2019.2.001.
[7] https://www.kaggle.com/competitions/malware- lassification/overview
[8] Hemalatha J, Roseline SA, Geetha S, Kadry S, Damaseviˇ cius R. Anˇ
Efficient DenseNet-Based Deep Learning Model for Malware
Detection. Entropy (Basel). 2021 Mar 15;23(3):344. doi:
10.3390/e23030344. PMID: 33804035; PMCID: PMC7998822
[9] https://www.eicar.org/download-anti-malware-testfile/
[10] Chandrashekhar Pomu Chavan and Pallapa Venkataram Designing a
Routing Protocol for Ubiquitous Networks using ECA Scheme in Fifth
Fig. 14. A screenshot of eicar text file passed to the application flagged as International Conference on Advances in Computing and Information
malware and deleted Technology during 25-26, 2015 at Chennai, India
[11] Chandrashekhar Pomu Chavan. Intelligent dynamic routing decisions
in ubiquitous network. In IEEE 2022 7th International Conference for
In figure 14 we see that the file is a malware as it has a Convergence in Technology (I2CT), Pune, Maharashtra, India., 7-9
probability of 0.895, Hence the file is deleted. The file is April 2022.
sourced from the [9-17] eicar website where we can obtain [12] Chandrashekhar Pomu Chavan, Srinivas Talabattula. Design and
anti malware test files. Development of Novel Routing Protocol for Ubiquitous Network. In
IEEE 2022 7th International Conference for Convergence in
V. CONCLUSION Technology (I2CT), Pune, Maharashtra, India., 7-9 April 2022
[13] Aratrika Ray, Akhil Khubchandan, Siddhartha Shenoy, Canute Rollin
The paper explained an approach to detect malware using a Cardoza, and Chandrashekhar Pomu Chavan. Smart emergency
type of machine learning called ensemble learning. reporting system for animals. In IEEE 2022 7th International
Throughout the course of this paper we proposed an Conference for Convergence in Technology (I2CT),Pune,
architecture to build a model that classifies a file into 1 of the Maharashtra, India., 7-9 April 2022 .
9 types of malware mentioned above. We used ensemble [14] Chandrashekhar Pomu Chavan and Pallapa Venkataram.
Design and Implementation of Event-based Multicast AODV Routing
learning to do so. The paper also took into account the Protocol for Ubiquitous Network. Elsevier Journal, Volume-
malware not detected by the model and provides a method to 14(25900056):100129,2022.DOI:
tackle this. The undetected malware were constantly https://doi.org/10.1016/j.array.2022.100129
monitored in a system and were discarded once malicious [15] Chandrashekhar Pomu Chavan and Pallapa Venkataram Feasible QOS
activity is detected. Routing in Ubiquitous Network Springer Journal of Wireless Personal
Communications, 2022 (In print)
[16]A. S. Alva, A. S. Dinesh and C. P. Chavan, “IoT for Enabling
As seen above, the proposed ML model worked better for Smart Environment System,” 2022 International Conference on Smart
malware detection and classification compared to
Authorized licensed use limited to: VIT University. Downloaded on January 17,2024 at 05:43:09 UTC from IEEE Xplore. Restrictions apply.
Generation Computing, Communication and Networking(SMART
GENCON),Bangalore,India,2022,pp.1-6,doi:
10.1109/SMARTGENCON56628.2022.10083922.
[17] Vignesh L, Nishanth J C, Hari Prasad H R, Jayanth Kumar A and
Chandrashekhar Pomu Chavan. Smart Farm Android Application
Using IoT and Machine Learning. In IEEE 2023 8th International
Conference for Convergence in Technology (I2CT), Pune,
Maharashtra, India., 7-9 April 2023
Authorized licensed use limited to: VIT University. Downloaded on January 17,2024 at 05:43:09 UTC from IEEE Xplore. Restrictions apply.