0% found this document useful (0 votes)
32 views4 pages

A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables

Executable files coming from the internet bring along with them many potential hazards and vul- nerabilities in the form of malware to computer systems. The executables can be of form raw binaries, mnemonics, libraries, and function calls/APIs. They can misguide many of the conventional malware detection techniques. This paper explores the potential of Machine Learning- based methods for malware detection problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views4 pages

A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables

Executable files coming from the internet bring along with them many potential hazards and vul- nerabilities in the form of malware to computer systems. The executables can be of form raw binaries, mnemonics, libraries, and function calls/APIs. They can misguide many of the conventional malware detection techniques. This paper explores the potential of Machine Learning- based methods for malware detection problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAR2188

A Framework for Detection of Malicious Code by


Exploiting Machine Learning Techniques on
Portable Executables
Yash Gajjar*1; Vaishnavi Sharma*2; Sanskruti Bhatt*3; Dr. Maitri Jhaveri4
1,2,3,4
Department of Computer Science, Gujarat University – 380009

Abstract:- Executable files coming from the internet Executable files coming from the internet bring the
bring along with them many potential hazards and vul- highest vulnerability to any computer system. The executable
nerabilities in the form of malware to computer systems. can be raw binaries, mnemonics, libraries, and API/function
The executables can be of form raw binaries, mnemonics, calls. They can misguide many of the traditional malware
libraries, and function calls/APIs. They can misguide detection techniques such as Signature, Check summing,
many of the conventional malware detection techniques. Reduced Masks, known Plain text Cryptanalysis, Statistical
This paper explores the potential of Machine Learning- analysis, Heuristics, and Sandboxing. The next-generation
based methods for malware detection problems. The techniques include AI/Machine-Learning-Based Static Anal-
scope of the work here is currently limited to Static Anal- ysis, NLP-based techniques, Application Whitelisting, End-
ysis of Executable files. Various feature selection tech- point Detection, and Response. Machine Learning algorithms
niques are implemented to reduce the size of the training can replace the rule-based approach of detecting malicious
data. Machine learning algorithms like K-Nearest Neigh- code, where different algorithms can be trained on the dataset
bors and Random Forest Classifier were trained on the consisting of the features of executable files. Such trained
curated feature sets. The outperforming experiment re- models can classify between Legitimate and Malicious files
sult was shown by the Random Forest Classifier having and can reduce the hectic work of analyzing executable files
an accuracy of 99.5%. We have developed a framework manually. Further, these trained models can be retrained on
as a two-step module; in the first step, a list of features new datasets for better predictions of malicious files.
are extracted from a given executable file, and then for
the next step, trained algorithm is integrated into the This paper focuses on Machine Learning based detec-
framework which will classify whether the given executa- tion using Portable Executable (PE) files. Windows (both
ble file is malicious or not. This framework is demon- x86 and x64) utilizes the PE file format, which serves as a
strated in the form of a Webapp developed in Python. structured data container that holds the necessary information
Furthermore, this framework is evaluated based on its needed for the Windows OS loader to manage the wrapped
performance on a small dataset containing 35 portable executable code. The PE format is a file format for executa-
executables (.exe) files and it is observed to be retaining ble, object code, DLLs, FON font files, and core dumps. The
the accuracy of the trained algorithm. kind of code which are malicious is attached to PE files.

Keywords:- Portable Executables (PE), Malicious Code, The techniques for identifying malware can be catego-
Machine Learning (ML). rized into static and dynamic analysis. In static analysis, exe-
cutable files are not executed but the tools and apps can be
I. INTRODUCTION used to get the required forensic information and the values
of its features can be extracted. While, in dynamic analysis,
Computers nowadays are an important part of every the executable files are executed in a safe environment and
sector. In this digital age, the transfer of data, information, then observed and classified. The work here is currently lim-
software, etc. between computer systems and external net- ited to the Static analysis of PE files. We have developed a
works is a common practice that can introduce malware, framework that extracts features from portable executable
vulnerabilities, or other risks. Any program or file that pur- files and will then classify these files as legitimate or mali-
posefully hurts a computer, network, or server is known as cious. We have applied binary classification algorithms on
malware or malicious software. Computer viruses, trojan labelled data in this work.
horses, ransomware, worms and spyware are a few examples
of malware. These malicious programs can steal, alter, en- II. RELATED WORKS
crypt, hijack, or delete sensitive data, core computing func-
tions and they can even monitor user’s computer activity. (Kim et al., 2020) proposed a static analysis automation
The way malware harms the users or endpoints can vary technique using machine learning to classify malicious code.
depending on its type, ranging from mild and harmless to Using variety of algorithms like Random Forest Classifier,
severe and catastrophic consequences. AdaBoost, Gaussian Naïve Bayes, Logistic Regression and
Decision Tree, they extracted and classified several distinc-
tive characteristics, including packer information, PE

IJISRT24MAR2188 www.ijisrt.com 2916


Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAR2188

metadata, and hash value. (Kumar et al., 2019) proposed a data mining techniques such as Information Gain and PCA
technique that uses static analysis to extract features with transformation, due to which the system extracts valuable
lower time and resource requirements than dynamic analysis. features from Windows PE files and achieves a high detec-
By combining raw and derived features based on various PE tion rate using machine learning and data mining concepts.
file header field values, they have produced an integrated (Schultz et al., 2001) extracted the information using PE files
feature set that has the classification accuracy of 98%. (Shijo and proposed a framework based on ML to detect the mali-
& Salim, 2015) have developed an integrated approach using cious PE files. The author’s dataset contained 4266 samples
both static and dynamic features for malware detection and from which 3265 are malicious and 1001 are benign files.
their results show that the support vector machine (SVM) They have used three Machine Learning algorithms – Ripper,
algorithm is best equipped to classify the data. (Chaudhary, Naïve Bayes, and Multi-Naïve Bayes out of which Multi-
2021) have identified the most suitable features to detect Naïve Bayes had the highest accuracy and detection rate of
malicious executable files using both static and dynamic about 97.76%. Their framework automatically detects mali-
analysis techniques. A simpler and faster method to distin- cious executables, significantly improving detection rates
guish between malware and legitimate .exe files by analyzing compared to traditional methods.
some key features from MS Windows PE headers was pro-
posed by (Liao, 2018). He also performed icon extraction to III. AVAILABILITY OF DATA AND MATERIALS
identify malware by extracting the embedded icons such as
the prevalent or misleading. (Abdessadki & Lazaar, 2019) The raw data was gathered from the malware security
extract features from the header of each file, which are then partner of Meraz'18, the annual techno-cultural festival of IIT
used as input for machine learning algorithms for classifying Bhilai. The information extracted from several PE files in the
PE files without executing them. (Baldangombo et al., 2013) form of 55 features, is contained in the raw data (CSV data).
developed a PE-Miner program to parse the PE format of the In our work, we have used two datasets namely, dataset-1
Windows executable in their dataset. The PE Miner extracts (75,502 Legitimate and 140,848 Malicious) and dataset-2
all PE header information, DLL names, and API function (41,323 Legitimate and 96,724 Malicious). The data is a
calls inside each DLL contained in a PE file. They utilize mixture of categorical and continuous values.

Table 1: Preview of both Datasets

IV. RESEARCH METHODOLOGY bles are highly correlated then we can drop the one which has
low correlation coefficient value with the target variable, as
We propose a machine learning-based model for detect- the model only needs one of them and the second one does
ing malicious executable files. The problem is implemented not add any information.
as a classic supervised learning problem that classifies an
input file into either of two classes, i.e., Malicious or Legiti-  Chi-Square (χ2) Method
mate. The Chi-square (χ2) test can be used as a feature selec-
tion method when our dataset contains categorical features.
A. Preprocessing and Feature Selection The chi-square distribution is a sampling distribution and is a
Preprocessing steps include the removal of string fea- family of probability distributions based on the number of
ture ‘md5’, which does not contribute to the classifier model. degrees of freedom (df). A chi-square variable cannot be
Detailed examination of the dataset reveals that out of 55 negative and the area under each chi-square distribution is
features, only a limited number of features have the infor- equal to 1.00, or 100%. Chi-square value is calculated be-
mation that can differentiate between malicious and legiti- tween each feature and the target variable. The features with
mate files. And hence, feature selection becomes an im- the best Chi-square values are selected according to one’s
portant part of the process. We have performed feature selec- necessity. The purpose of chi-square analysis is not to identi-
tion by following methods: fy the exact nature of a relationship between nominal varia-
bles but to simply test whether the variables could be inde-
 Correlation Coefficient Method pendent of each other.
Correlation calculates the linear relationship between
variables. Features which are important should be highly
correlated with the target variable. Furthermore, if two varia-

IJISRT24MAR2188 www.ijisrt.com 2917


Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAR2188

 Gini Impurity-based Method Table 2: Accuracy Scores of Algorithms from Experiment 1


This approach involves assessing the reduction in Gini
Algorithm Accuracy
impurity for each feature as it's utilized to partition the data.
The extent of this reduction is determined by the proportion Decision Tree Classifier (DTC) 98.21%
of data points influenced by the split. Features leading to Random Forest Classifier (RFC) 98.93%
greater drops in Gini impurity are considered more signifi-
cant. To ensure that feature importance values sum up to 1, Gaussian Naïve Bayes (GNB) 65.06%
they are normalized. The formula for calculating Gini im- K-Nearest Neighbors (KNN) 97.71%
portance for a given feature Xi involves various parameters
related to the nodes and splits in the decision trees (Breiman, Support Vector Machine (SVM) 65.06%
1984). This implementation utilizes the scikit-learn Python
package for training and extracting feature importance from Out of all the trained models, the accuracy of KNN and
Random Forest Classifier. RFC is very high. Random Forest is a combination of n deci-
sion trees (n - hyperparameter) and hence we are not consid-
B. Classification Framework ering the results based on DTC. Therefore, in the succeeding
The classification framework is built as a web applica- experiments, RFC and KNN are chosen, for training.
tion. To build the framework, machine learning models like
Random Forest Classifier, Decision tree Classifier, K-Nearest In experiment 2, the feature set used for training con-
Neighbors, Support Vector Machine, and Gaussian Naive- tains 10 features selected based on the Correlation method,
Bayes are trained on various feature sets curated based on and the training & testing are done on dataset 1. In experi-
feature selection. These algorithms are being trained and ment 3, the feature set used for training has 10 features se-
tested on CSV datasets. Outperforming combinations of fea- lected based on χ2 -test which shows how much a nominal
ture sets and trained models are then wielded for further inte- feature is dependent on the targets. Here as well, dataset-1 is
gration with the framework. The work here is split up into used for training and testing. In experiment 4, the feature set
two modules; the first module acts as an extractor that ex- used for training has 12 features (top 10 from the Correlation
tracts the values of features from the portable executable method and top 2 from χ2 -test). In experiment 5, the same
taken as input and then passes the extracted values to the feature-set of experiment 4 is used but the training is done on
other module where the trained model is integrated for classi- dataset-1 and testing is done on dataset-2. In experiment 6,
fication. the same feature set of experiment 4 is used but dataset-1 was
balanced by under-sampling malicious class for training and
then testing is done on dataset-2. Finally, in experiment 7, an
updated feature set is curated which has 12 features (top 8
from the Correlation method and top 4 from the χ2 -test). For
training, balanced dataset-1 is used, and for testing dataset-2.
The accuracy scores of both classifiers in above mentioned
experiments are shown in the figure below.

Fig 1: Flowchart of the Classification Framework

C. Experimentation
We have conducted several experiments; in the first ex-
periment, we chose to train five classifiers namely Decision
Tree Classifier, Random Forest Classifier, Gaussian Naïve
Bayes, K-Nearest Neighbors, and Support Vector Machine. Fig 2: Accuracy scores of RFC and KNN from Experiment 2
The data used for training has all the features (54 features) to Experiment 7
except md5. The train-test split criterion is kept the same for
all the experiments conducted on dataset-1, i.e., 70% of the V. RESULTS
data is used for training and 30% for testing. The accuracy
scores obtained by above mentioned algorithms in experi- In the creation of the web application, the trained model
ment-1 are shown in Table 2. Random Forest Classifier from experiment 7 is selected
which has an accuracy of 99.50%. We selected this model for
integration in Webapp as the features used while training in
this experiment, are computationally convenient to extract

IJISRT24MAR2188 www.ijisrt.com 2918


Volume 9, Issue 3, March – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24MAR2188

from the raw PE file as compared to other features. The fea- ACKNOWLEDGEMENTS
tures selected for experiment 7 and eventually for the frame-
work are shown in Table 3. The work in this paper is done on Google Colab note-
books and the authors are especially thankful to Ero Carrera
Table 3 : Updated Combined Feature Set Used in for pefile library (Carrera Ventura, 2022).
Experiment 7 and in our Framework
Feature names REFERENCES
Machine SectionsMeanEntropy
[1]. Abdessadki, I., & Lazaar, S. (2019). A New Classifica-
SizeOfOptionalHeader SectionsMaxEntropy tion Based Model for Malicious PE Files Detection. In-
Characteristics SizeOfStackCommit ternational Journal of Computer Network and Infor-
MajorSubsystemVersion SizeOfStackReserve mation Security, 11(6), 1–9.
Subsystem ImageBase https://doi.org/10.5815/ijcnis.2019.06.01
DllCharacteristics CheckSum [2]. Baldangombo, U., Jambaljav, N., & Horng, S. (2013). a
S Tatic M Alware D Etection S Ystem U Sing. 4(4),
113–126.
VI. DISCUSSION
[3]. Breiman, L. a. (1984). In Classification and Regression
Trees. Taylor \& Francis.
From the results obtained out of all the experiments, it
[4]. Carrera Ventura, E. (2022). pefile (2022.5.30).
can be concluded that the accuracy of the model Random
https://github.com/erocarrera/pefile
Forest Classifier which is an ensemble model, is compara-
[5]. Chaudhary, P. (2021). PE File-Based Malware Detec-
tively better which is integrated into our framework. Up until
tion Using Machine Learning PE File-Based Malware
now, the models have been tested on CSV files and hence,
Detection Using. January. https://doi.org/10.1007/978-
the framework needs to be evaluated based on performance
981-15-4992-2
on PE files. A total of 35 PE files were downloaded, among
[6]. Kim, S., Yeom, S., Oh, H., Shin, D., & Shin, D. (2020).
them 16 files are legitimate, obtained from
Automatic malicious code classification system through
www.exefiles.com [accessed on 12 August 2022], and 19
static analysis using machine learning. Symmetry,
files are malicious, obtained from www.tekdefense.com [ac-
13(1), 1–11. https://doi.org/10.3390/sym13010035
cessed on 12 August 2022]. The web app was able to predict
[7]. Kumar, A., Kuppusamy, K. S., & Aghila, G. (2019). A
all 35 files correctly which shows that on such a small dataset
learning model to detect maliciousness of portable exe-
model can maintain the accuracy of 99.50%.
cutable using integrated feature set. Journal of King
Saud University - Computer and Information Sciences,
VII. CONCLUSION
31(2), 252–265.
https://doi.org/10.1016/j.jksuci.2017.01.003
We observe that the Machine Learning-based classifica-
[8]. Liao, Y. (2018). PE-Header-Based Malware Study and
tion algorithms are successfully able to classify an executable
Detection. 4.
file into malicious or legitimate. Our best model is the Ran-
[9]. Schultz, M. G., Eskin, E., Zadok, E., & Stolfo, S. J.
dom Forest Classifier with 12 features (fusion of features
(2001). Data mining methods for detection of new mali-
selected from Correlation and Chi-square method) having an
cious executables. Proceedings of the IEEE Computer
accuracy of 99.50% which is further integrated into our
Society Symposium on Research in Security and Priva-
framework. The developed framework for detecting mali-
cy, February 2001, 38–49.
cious files is found to be robust. Current work which focuses
https://doi.org/10.1109/secpri.2001.924286
on static analysis of executable files, might be applied further
[10]. Shijo, P. V., & Salim, A. (2015). Integrated static and
to the executables of different extensions as well.
dynamic analysis for malware detection. Procedia
Computer Science, 46(Icict 2014), 804–811.
FUTURE SCOPE
https://doi.org/10.1016/j.procs.2015.02.149
The authors intend to execute the work using Deep
Learning Algorithms to have a better efficiency of the devel-
oped web application. Furthermore, this work can be
stretched for multi-class classification of malware and to the
executables of different extensions.

 Data can be Accessed through:


https://www.kaggle.com/competitions/malware-
detection/data.

IJISRT24MAR2188 www.ijisrt.com 2919

You might also like