0% found this document useful (0 votes)
5 views

Malware_Detection_in_PE_files_using_Machine_Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Malware_Detection_in_PE_files_using_Machine_Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Malware Detection in PE files using Machine

Learning
2022 OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON) | 978-1-6654-9294-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/OTCON56053.2023.10113998

Samarth Tyagi Achintya Baghela Kashif Majid Dar


Dept of CSE Dept of IT Dept of CSE
Symbiosis Institute of Technology, Symbiosis Institute of Technology, Symbiosis Institute of Technology,
Symbiosis International (Deemed Symbiosis International (Deemed Symbiosis International (Deemed
University) University) University)
Pune, India Pune, India Pune, India
[email protected] [email protected]. [email protected]
n in
Snehal Bhosale
Anwesh Patel Sonali Kothari Dept of CSE
Dept of CSE Dept of CSE Symbiosis Institute of Technology,
Symbiosis Institute of Technology, Symbiosis Institute of Technology, Symbiosis International (Deemed
Symbiosis International (Deemed Symbiosis International (Deemed University)
University) University) Pune, India
Pune, India Pune, India [email protected]
[email protected] [email protected]
n

Abstract: Malware has become one of the most challenging to train but it helps to detect zero-day attacks and also to
threats to the computer domain. Malware is malicious code detect unknown variants of malware.
mainly used to gain access and collect confidential information
without permission. The internet coverage has boomed a lot in Portable executable files or PE files are executable files used
today’s time leading to people downloading various files and in Windows OS systems. These files are very important for
installing executable files like .exe, .bat, and .msi files. This leads execution as they provide binary code that can be used by
to many complications as these files are the vector for malicious different executable files. PE files have a defined structure
code. Through this paper, we present a technique to detect that is followed by all the files it also contains a header that
executable files as malicious by a detailed search of the Portable contains a number of fields. These fields are very important
Executable (PE) files that come along with the executable files.
in our study as these features can be used as data for training
Our approach uses the static analysis technique to get features
from PE files. We use these with supervised learning algorithms machine learning algorithms.
to classify malware. We also compare the performance of
different algorithms to determine the best way to approach our
problem. II. LITERATURE REVIEW

Keywords— Malware Detection, Static malware analysis, Opcode The authors of [3] explain a learning model for determining
N-grams, Byte N-grams, PE Header, PE file, Machine Learning if portable executables are harmful (2019) investigated
utilizing machine learning methods to accurately and
I. INTRODUCTION efficiently classify whether a file is hacked or benign. In order
Malware has become a household term in the past few years. to categorize malware, several machine-learning techniques
There has been a substantial rise in the news about the were used, including Random Forest, Decision Tree, Logistic
damage done by malware to big companies. Multinational Regression, k-NN, Naive Bayes, and Linear Discriminant
companies have lost a fortune due to malware attacks on their Analysis. We assessed the performance of each classifier
systems and networks. Individuals and small businesses have using the current raw feature set and the suggested integrated
also been affected by malware in terms of data loss. Malware feature set. The recommended integrated feature set has a
is specially designed software that can self-replicate and hide classification accuracy of 98.4%, according to empirical data
in plain sight. This makes it very difficult to detect malware from 10-fold cross-validation. The accuracy for the set of
without having a large infrastructure. integrated features was observed to be 89.23% in the studies
on the unique test data set, which is a 15% improvement over
Malware detection is the process of discovering pieces of the accuracy obtained with the raw-feature set alone. It was
malicious code in networks or end devices. Today there are also tested to see how accurate classification could be with
two types of malware detection techniques that are being used just the top N features (N = 5, 10, 15, 20, 25). It was found
namely “Signature-based” and “Non Signature-based”. The that accuracy could be as high as 98% and 97% on integrated
Non-signature-based technique was developed to tackle the and raw features, using only the top 15 characteristics,
issue of morphing in malware that is being designed in this respectively.
age. This along with the diversity in malware has made the
highly efficient signature-based detection obsolete. The non- Another tactic examined in [4] (2018) discusses the technique
signature-based technique can be combined with machine known as obfuscation that malware programmers use to make
learning to achieve higher accuracy. Non-signature, detect the malware challenging to read and comprehend. This
malware by using tools of observation this makes them harder method's primary goal is to mask malware's destructive
activity. Different researchers have different categories for
obfuscation strategies. The six following categories—

978-1-6654-9294-2/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: PUNJAB UNIVERSITY. Downloaded on November 18,2024 at 04:23:31 UTC from IEEE Xplore. Restrictions apply.
instruction replacement, subroutine reordering, code 3. a Trojan horse: A software that disguises itself as a
transposition, code integration and dead code insertion— legitimate program and when downloaded and installed, is
include the most common tactics. Then he categorizes used to disrupt or damage data or a network.
various malware analysis methodologies and malware
4. Spyware: Malware that installs itself on a computer to
detection techniques.
monitor the behavior and to steal sensitive information from
Schultz et al. employed three distinct classification models a user. Additionally, attackers may be given remote access to
[7], These programmes all shared the ability to extract strings the infected system through spyware.
from executable files., selected byte-sequences from the exe
5. Adware: This kind of software is designed to help
file, along with DLLs and functions from the header of a
companies generate more revenue by automatically
Portable Executable. .It is thought that this is the first instance
displaying advertisement banners and pop-ups while another
of machine learning being utilised to circumvent malware
program is running.
detection issues. They had a remarkable detection rate, which
indicated that malware detection using machine learning 6. Ransomware: A program that threatens to perpetually
approaches would be possible in the future. block or publish personal user data unless a ransom is paid.
The attacker does so by using a disguised link to trick the user
A beautiful, straightforward statistical method based on
into downloading the malware file which then encrypts the
hypothesis testing was proposed by Merkel et al. [9]. Each
user's data. The data can then only be unlocked through a
executable received points based on the PE header data to
secret key which is usually promised to be revealed upon
indicate its level of hazard (increased maliciousness with
payment.
increased points). The classification was done using the
benchmark for the software as hazardous or benign. IV. MACHINE LEARNING
Conditional probability was used to establish the distribution
of the points. A. What is machine learning?
A component of the larger field of artificial intelligence,
machine learning enables a machine to use real-world data to
III. MALWARE solve a problem. Combining statistics, applied mathematics,
and computer science, machine learning enables computers
A. What is Malware
to automatically improve their performance based on past
A program that is intended to compromise a digital system is performance and without the need for new programming. A
referred to as malicious software or malware? Malware model that can generalize well to new data is created using a
samples were initially created to examine the flaws in model training procedure.
computer programming and architecture for testing reasons.
Malware has quickly progressed since attacking practically B. Types of Machine Learning
any modern device, including ATMs and mobile phones, Machine learning is generally classified into the following
from personal computers. three paradigms:
Once it is activated or the code is running in the computer, 1. Supervised Learning: This methodology uses real-world
then it becomes very difficult to find and a common man can datasets which consist of training features and associated
not find it till doomsday when all his data is stolen. Hackers labels to generate an inferred function that can then be used
make malware so good looking” in order to get people to to label unseen and unlabelled data.
install them. When malware is installed, it conceals itself in
2. Unsupervised Learning: In this technique, The training of
several computer folders and, if it's very sophisticated, can
a machine learning model is done on an unlabeled and
get direct access to the operating system. After that, it
uncategorized dataset that must independently find patterns
"records sensitive data and encrypts files."
to correlate characteristics with labels.
According to a number of studies, malware will continue to
3. Reinforcement Learning: This approach involves training
diversify and become more complex in the future.
through reward and punishment of the behavior of an
B. Types of Malware intelligent agent, which takes actions intending to maximize
Various organizations broadly classify malware into the cumulative reward.
following categories: A subset of machine learning known as deep learning
1. Worms: Malware that self-replicates and spreads to other systematically extracts characteristics from raw data at ever
computers through a computer network. Worms can corrupt higher levels of abstraction. The layers of deep learning
and steal data, install backdoors for hackers, and consume algorithms are organized in a hierarchy of increasing
memory and bandwidth. complexity.

2. Virus:”Frederick B. Cohen introduced the word "virus" in V. PORTABLE EXECUTABLE FILE FORMAT
his Ph.D. thesis in 1986.. His definition of the term is: “A A. What is a Portable Executable format?
program that can infect other programs by modifying them to
include a, possibly evolved, version of itself.” There are no mandatory constraints in many fields of PE field
and contain many redundant fields and spaces, creating
opportunities for malware propagation and malware attacks.
A PE file contains the PE file header, section table, and

Authorized licensed use limited to: PUNJAB UNIVERSITY. Downloaded on November 18,2024 at 04:23:31 UTC from IEEE Xplore. Restrictions apply.
section data. They have many valuable pieces of data, 8. The .rsrc section is employed to store the resources that the
including as imports, exports, time-data stamps, subsystems, programs are currently using. Strings are typically kept in this
sections, and resources, for malware analysts. A PE file's area to support multi-language use.
fundamental structure is as follows:
9. .bss- This represents the application's uninitialized data.
10. .edata- Sections also include data and functions that will
be exported to DLLs in the export table.
11. .idata- Section Sections additionally provide a table for
exporting data and functions to DLLs.
VI. RESEARCH WORKFLOW
Step 1 Data Collection: Downloading dataset from virusshare
and kaggle.

Step 2 Data Analysis and Feature Extraction: Extracting


Features from the downloaded files.

Step 3 Normalization of the features.

Fig. 1: PE File Format Step 4 Feature reduction: Selecting Best features from
extracted features as there are so many features.
Source: Basic structure of PE file, oreilly.com
Step 5 Merging the features extracted.
The header portion of the code comprises details about the
code, the application type, required library or kernel Step 6 Dividing feature set into two sets training feature set
functions, and the required amount of memory.” The first 64 and test feature set and training different classifier models
bytes are taken up by the DOS header of each Portable and testing them.
Executable. The” program can be verified by DOS, and then
run in DOS stub mode. The PE Header has information, Step 7 Model validation.
including the size and the code’s location. The PE part of the
file contains the majority of its material.
VII. METHODOLOGY
A. Malware Dataset
B. Section Segment Composition
Dataset is downloaded from Kaggle. The dataset contains PE
The section segment contains the following components:
files of exe, dll, byte, and asm file types. Our total dataset
1. DOS MZ - It specifies that the provided type is an contains 137444 files of which 40918 are malware files and
executable file. The magic number is on it. This value is set 96526 are benign files
to 0x54AD in all executable files that are MS-DOS-
compatible.” Among various headers, after doing a features selection we
got 13 important features:
2. Dos stub- merely prints "program can’t run in DOS" which
checks for compatibility.
● DllCharacteristics
3. Imageheader- This section contains details about the ● Characteristics
executable file, such as entry points and subsystems. ● Machine
● VersionInformationSize
4. Section Table- explains the properties of the section,
● Subsystem
virtual size sections, the size of raw data, a pointer to raw
● ImageBase
data, and the process for loading executable files into memory
● SizeOfOptionalHeader
containers.
● MajorSubsystemVersion
5. Only the.text part includes code, and it is this area that the ● SectionsMaxEntropy
central processing unit uses to carry out its operations. ● ResourcesMaxEntropy
● ResourcesMinEntropy
6. The .rdata section stores information that the application
● MajorOperatingSystemVersion
can access globally but only in read-only mode. The export
● SectionsMinEntropy
and import data are stored in the.idata and.edata sections,
which are occasionally present. If these don't exist separately, B. Feature Extraction
its data is kept in the.rdata area.
PE files have a lot of format properties, however, the majority
7. The .data section to keep track of the information that has
of them are useless for telling malicious software from safe
been accessed by the software. The PE file does not have any
software. According to our extensive research of the format
sections where the local data is kept.
characteristics of the Portable Executable files and our

Authorized licensed use limited to: PUNJAB UNIVERSITY. Downloaded on November 18,2024 at 04:23:31 UTC from IEEE Xplore. Restrictions apply.
practical investigations, we found 54 elements from the ● Image_Optional_Header: The majority of the
supplied PE files that may distinguish between safe software important details about the image, such as the initial stack
and malware. We provided a summary of the features that size, program entry point location, preferred base address,
were extracted for our investigation in the discussion that operating system version, section alignment information, and
follows. Additionally listed are 9 significant aspects. so on, are contained in this optional header. The data is
displayed in the image below.
● DOS Header: A file type that is compatible with ● ImageBase: the image's preferred address after
MS-DOS is designated by this field. This value is always set memory loading. The default address is 0x00400000. An
to 0x54AD in MS-DOS-compatible executable files, which is attacker can modify this address using an option like "-BASE:
the ASCII character MZ. For this reason, MS-DOS headers linker" to his requirements.
are also known as MZ headers. Beginning at offset 0 ● SectionAllignment: the alignment of the region after
(viewable with a hex-editor). memory loading. Section alignment and page size cannot be
less than one another.
DOS Stub: The typical output of the DOS stub is ● FileAlignment: the level of detail in the file's section

only a string, such as the warning "This program cannot be alignment. Each section must begin at multiples of 512 bytes,
launched in DOS mode." It might be a complete DOS for instance, if the number entered in this field is 512 (200h).
software. The linker instructs a binary called winstub.exe to File offsets 522 and 1024 have a blank or undefined space
provide instructions to the executable file while creating between them. Suppose the first section's size is 10 bytes and
Windows apps. The address of this file is 0x3c, which is it is located at file offset 200h. At file offset 400h, the
offset to the following PE header section. subsequent section must begin.
● MajorSubSystemVersion: The Windows NT Win32
subsystem major version number is represented by the value
● PE File Header: A PE file, similar to other 3 for Windows NT version 3.10.
executable files, contains several fields that specify how the ● SizeOfImage: memory in its whole, including all
rest of the file should be organized. As we previously headers. The picture must be a multiple of SectionAllignment
discussed, the header provides information such as the size when it is put into memory.
and placement of the code. The MS-DOS stub takes up the
first fe 100 bytes of a normal PE file. By indexing the VIII. IMPLEMENTATION
e_ifanew of the MS-DOS header, the PE file can be found.
To obtain the actual memory-mapped location, the offset ○ Importing Modules
supplied by e ifanew must be appended to the file's memory- Using Pip to install all the packages required for the project.
mapped address. The header section is where we get the Then importing modules into files
following list of essential sub-sections:
● Signature For learning.py -
● Machines For checker.py-
● NumberOfSections
● SizeOfOptionalHeader A. Handling Data
As we can see, there are many headings. Since there isn't Our data includes the file name and the rest is PE file headers
enough room to go into detail about each one, we'll just talk and sections separated by ‘|’ respectively.
about some of the most crucial elements.
We initially import the.csv file into our VScode IDE using
the Pandas module before training the data in supervised
C. Characteristics
learning.
·Signature: It only includes the signature to make it simple
This way we are now just working with ‘Legitimate’ files.
for the Windows loader to understand. Everything is implied
This is the first step in creating a machine-learning model.
by the letters P.E. followed by two zeros.

● NumberOfSections: “The section table, which B. Algorithm training


comes right after the header, will have this size defined. We create a dictionary key = Name of the algorithm, Value=
● SizeOfOptionalHeader: This is situated between the Algorithm classifier code and loop over it
beginning of the section table and the top of the optional algorithms = { }.
header. The optional header's size, which is necessary for
executable files, is this. An object file should have this value
set to zero.
● Characteristics: These identifying flags display a
characteristic of the item or picture file. The image is
identified in the file as a DLL by the flag Image File dll,
which has the value 0x2000. Additionally, it has additional
flags that are not necessary for us at this moment.

Authorized licensed use limited to: PUNJAB UNIVERSITY. Downloaded on November 18,2024 at 04:23:31 UTC from IEEE Xplore. Restrictions apply.
● Random Forest: consists of several decision trees and is a
classification algorithm. By employing feature randomness
and bagging for the creation of each tree, it produces a group
of unrelated trees whose collective forecast is more accurate
than any single tree's[10].
● Decision Trees: It trains more fast than the neural network
algorithm. The temporal complexity of decision trees is
dependent on the amount of records and characteristics in the
input data. A non-parametric or distribution-free approach,
the decision tree does not rely on the premises of a probability
distribution. Decision trees are capable of handling high-
dimensional data with accuracy.
● Gradient Boosting and AdaBoost: Ada mean
"adaptive" Both algorithms are boosters, making one strong
learner out of a batch of poor ones. Both of them begin with Fig. 2. Working of Machine Learning Algorithm
a strong learner, which is frequently a decision tree, then
gradually add a weak learner to the strong learner. They differ Source: Flowchart of typical cross-validation workflow in model training,
in how weak learners are produced by the repetitious scikit-learn.org
procedure.
b. Entropy calculation
● Bayes Theorem: Conditional probability is the Entropy is generally understood to be the measurement of
foundation of Bayes theorem. The conditional probability specific data in digital numbers. Similarly to this, File entropy
aids in calculating the likelihood that a given event will occur. refers to the representation of data sets in a certain file. In
other terms, "File Entropy" refers to the amount of data that
C. Algorithm testing is included in a specific file. There are several security-related
Iteratively looping over the dictionary of algorithms. programmes that you may use to scan the file and extract
A methodological error is made when the parameters of a different bits of information to see if it contains malware or
prediction function are learned and the same set of data is not. Malware prevention also makes advantage of File
used to assess it. Although it might perform well, Any Entropy [12]. In the process of malware analysis, it is also a
predictions regarding yet-to-be-seen data would be good idea to quickly check if the malware file has been
impossible for a model that just repeats the labels of the packed with any other software.
samples it has just seen. The phrase for this situation is
overfitting. It is common to set aside a portion of the available Storing resources in dictionary res{} of our test file where we
data as a test set (X test, y test) when conducting a are supposed to test our machine learning model which we
(supervised) machine learning experiment in order to avoid had developed in learning.py For this, We call 3 functions all
this problem[11]. using PE file module
It should be emphasised that the term "experiment" does not
just apply to academic endeavours because machine learning 1. def get_resources(peFile)- to Extract resources [entropy,
experiments can start in business-related settings. The size]
following flowchart displays the typical cross-validation 2. def get_version_info(pe)- Returns version information of
process used in model training. The best parameters may be test.exe file
discovered using grid search techniques. 3. Get all the header files and features in a dictionary called
res{}
a. Saving the features list def extract_infos(path):

The feature list and the algorithm should be saved for future Now we have extracted all the features and header files out
predictions. of test.exe, we will run our machine learning model to predict
Save any Python object into a single file. if test.exe is malicious or legitimate [13].
Depending on the type of classifier you use, using joblib may
improve performance and size because it works particularly
D. Analyzing alien file
well with NumPy arrays, which Skearn uses.
Otherwise, Pickle does function properly, so regardless of Here, we upload the alien PE file and run malware analysis
which serialization library you use, saving and loading a on it.
trained classifier will result in the same results.

Authorized licensed use limited to: PUNJAB UNIVERSITY. Downloaded on November 18,2024 at 04:23:31 UTC from IEEE Xplore. Restrictions apply.
programs. In the future, this accuracy can be improved, if we
add a much larger number of files in the data set to drive the
algorithms. Each algorithm has several parameters that can
be tested with different values to increase their accuracy. This
project can reach the application level with the help of a
library called a pickle, to save what the algorithm has learned.
Static analysis has been shown to be safer and free from the
overhead of execution time, and we may use it to test a new
file to check if it is clean or infected.

REFERENCES

[1] Malware Analysis: basic static analysis, Aditya Anand, 2019


[2] Deep learning-based Malware detection using PE headers. Arnas
Nakrosis, 2022
[3] Learning model to detect maliciousness of portable executables using
an integrated feature set, Kumar A, Kuppusamy K, Aghila G, 2019
[4] A Study on Malware and Malware Detection Techniques, Rabia
Tahir, 2018
Fig. 3 Sample training and testing results
[5] Improved Deep Learning model for static PE files malware detection
and classification, Ajay Kumar, Kumar Abhishek, Divy Patel, Yash
Source: Primary Jain, Harsh Chheda, Pranav Kumar, 2022
[6] Static analysis and machine learning-based malware detection system
IX. RESULT using PE header feature values. Chang Keun Yuk, Chang Jin Seo,
2022
Output of the learning.py file: The following is the output of [7] Schultz, M. t. G.; Eskin, E.; et al. Data mining methods for detection
the learning.py code: Shortlisting 13 important features from of new malicious executables. In Security and Privacy, 2001. S&P
2001. Proceedings. 2001 IEEE Symposium on, IEEE, 2001
the set of 54 features available in the header of a PE file [8] Kozachok, A.; Kozachok, V. Construction, and evaluation of the new
Testing machine learning algorithms on the given dataset to heuristic malware detection mechanism based on executable files
determine a winner algorithm Also calculating the false static analysis. Journal of Computer Virology and Hacking
negative and false positive rates. The False Negative Rate is Techniques, 2017
the rate of incorrect predictions the model makes that an is [9] Merkel, R.; Hoppe, T.; et al. Statistical Detection of Malicious
PEExecutables for Fast Offline Analysis. Berlin, Heidelberg:
attribute absent. The rate of incorrect predictions the model Springer Berlin Heidelberg, 2010
makes that an attribute is present is the False Positive Rate. [10] Wang, T.-Y.; Wu, C.-H.; et al. Detecting unknown malicious
Output of the checkpe.py: The following image is the output executables using portable executable headers. In INC, IMS and IDC,
of the checkpe.py python file, which tells the user whether 2009. NCM’09. Fifth International Joint Conference on, IEEE, 2009,
pp. 278–284.
the file entered is malicious or not [11] Tina Rezaei and Ali Hamze. An efficient approach for malware
detection using pe header specifications. In 2020 6th International
Conference on Web Research (ICWR).
[12] Kebede, S.D., Tiwari, B., Tiwari, V. and Chandravanshi, K.,
“Predictive machine learning-based integrated approach for DDoS
detection and prevention”. Multimedia Tools and Applications, pp.1-
27, 2021, Springer
[13] A. Shabtai, R. Moskovitch, C. Feher, S. Dolev, Y. Elovici, and Feb.
'detecting unknown malicious code by applying classification
Fig. 4 Sample output techniques on opcode patterns.". Security Informatics, 1(1), 2012.

Source: Primary

X. CONCLUSION
The aim of this paper is to present a machine learning
approach to the malware problem. We require automated
approaches to find corrupted files since malware has grown
so quickly lately. In the first phase of the work, the data set is
created using infested and clean executables, in order to
extract the data necessary for the creation of the data set, we
used a script created in Python. For machine learning
algorithms to be created and trained, the data collection must
be ready. Decision trees, Random Forest, Naive Bayes,
Gradient Boost, and AdaBoost are the methods that were
utilised and were compared.After applying the best accuracy
algorithms, it had a Random Forest algorithm with an
accuracy of 99.406012 %. This work demonstrates that
Random Forest is the best algorithm for detecting malicious

Authorized licensed use limited to: PUNJAB UNIVERSITY. Downloaded on November 18,2024 at 04:23:31 UTC from IEEE Xplore. Restrictions apply.

You might also like