Malware Analysis Using Memory Forensics
Malware Analysis Using Memory Forensics
https://doi.org/10.22214/ijraset.2022.47021
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue X Oct 2022- Available at www.ijraset.com
Abstract: Malware is still the most dangerous issue facing internet users in today's online environment. The newly created
malware is separate from the traditional kind, has a more dynamic design, and typically combines traits from two or more
different malware types. comparing the various memory acquisition tools that are available, each of which has a varying
performance dependent on the setups, installed hardware, and operating system version. If the ending character is not present.
To address the growing malware issue, new methodologies like machine learning must be employed. Investigate how
cybersecurity is used in this study for malware detection and machine learning. In this study will look at the PE (portable
executable) headers of malware and non-malware samples in order to build a malware classifier that can identify if malware is
there or not. The development of behavior-based malware detection and classification methods using various machine learning
approaches is addressed in this study along with behavior-based detection methods itself.
Keywords: Cybersecurity, detection, malware, machine learning, PE headers, classifier, preparation, memory forensics
I. INTRODUCTION
The usage of the internet, computers, and smart gadgets is widespread nowadays, and many people use them on a daily basis. In the
same way that there are good people and bad people everywhere we travel, the online world certainly has its share of nefarious
characters that wish to use loyal users for their own gain [19]. Malware attacks have become increasingly complicated in recent
years. Malware is the most potent menace to the cyber world despite advances in detection and classification of the threat into its
correct family class and ongoing evolution. Computers and the internet have been widely used by everyone from individuals to the
most secure and confidential corporations in the world due to the wide range of services given, including communication, e-
learning, entertainment, e-banking, shopping, and similar. [1]. Private information is frequently stored on these systems, which are
vulnerable to assault. These assaults might harm these systems and result in large fines [2]. Malicious software, also referred to as
malware, is one of the most dangerous risks. Malware is any program that has the ability to disrupt the system by removing, adding,
or changing data or system software [3]. The usage of the internet, computers, and smart gadgets is widespread nowadays, and many
people use them on a daily basis. In the same way that there are people with good and ill intentions everywhere we travel, this is
also true in the online world, where such services are being used by an increasing number of people [5]. As in the physical world,
there are persons on the web with malicious motives who prey on trusting customers whenever money is involved [9]. Examples of
commercial services on the internet include online banking and advertising. These individuals are able to carry out their objectives
thanks to malicious software, such as malware. Malware assaults have evolved in complexity over the past few years. Malware
continues to evolve, making it the most significant threat to the online world, despite advancements in detection and classification of
malware into their correct family classes throughout time. Malware comes in a variety of forms depending on the reason for its use,
such as ransomware used for financial gain, spying spyware, etc. [1]. We need a fundamental understanding of the types of
malwares and the methods they employed in order to analyze malware using machine learning. Their disagreement over the school
justified their behaviors. as described below [9]
1) Virus: Virus is simply like other programs. The main difference is that the program runs within the system without prior
permission of the user & also replicates itself to infect other programs within the machine.
2) Worm: Worms are nothing but the improved version of the virus. The main difference is that everyone the machines connected
over the network are in danger and there’s the likelihood that they get infected with the worm.
3) Trojan: The main aim of the Trojan design is that it should seem like legitimate software, so users were tricked into considering
malicious as a legitimate one.
4) Adware: The main aim of this sort of malware is to display advertisements on the target machine.
5) Spyware: The name suggests, the malware that spies on the target and delivers sensitive, vulnerable information to the attacker.
6) Backdoor: The backdoor may be a style of malware that’s used for creating a back door for entry into the target machine. It
doesn’t affect the system badly.
7) Ransomware: Ransomware is that the most well-liked form of malware nowadays. It encrypts all user data within the machine
and asks for a ransom to urge them back machine to normal.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 488
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue X Oct 2022- Available at www.ijraset.com
A. Forensics of Memory
The posterior is the general approach in memory forensics, regardless of whether you use it for malware investigation or incident
response.[7]
Memory Development This entails obtaining (or jilting) a target machine's memory in order to fragment it. The target machine will
either be a system (on your network) that you simply suspect to be infected or it may be an analysis machine in your lab
environment where you executed the malware sample, depending on whether you're probing an infected system or using memory
forensics as part of your malware analysis [9]. Memory evaluation This stage entails analyzing the memory after it has been
abandoned to fragment in order to look for and value forensic remnants.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 489
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue X Oct 2022- Available at www.ijraset.com
One set of suspicious features might include things like "connection established to odd destination," "file or programme permission
altered implicitly," "unwanted registry key modifications made," etc. Then, if any software that activates the aforementioned
features exceeds the defined threshold, it can be deemed hazardous to the machine. Although this approach has some success, this is
not always the case because each feature's influence varies depending on the situation. For instance, "Connection formed" typically
has more of an effect on the system's danger or damage than "registry key changed." It should be highlighted that combining these
qualities efficiently may be less likely than considering them separately [13]. These associations can be taken into account by
machine learning algorithms to have a more accurate identification of harmful programs.
Machine Literacy has created a drastic change in numerous diligences, including cybersecurity, over the last decade. Among
cybersecurity experts, there's a general belief that AI- poweredanti-malware tools can help descry ultramodern malware attacks and
boost surveying machines. evidence of this belief is the number of studies on malware discovery strategies that exploit machine
literacy reported in the last many times. The number of exploration papers released in 2019 is 7720, a 95 percent rise over 2018 and
a 476 percent increase over 2020, according to Google Scholar,1. This rise in the number of studies is the product of several factors,
including but not limited to the increase in intimately labelled malware feeds, the increase in computing capacity at the same time as
its price drop, and the elaboration of the field of machine literacy, which has achieved ground- breaking success in a wide range of
tasks similar as computer vision and speech recognition( 6). Depending on the type of analysis, conventional machine literacy styles
can be distributed into two main orders, static and dynamic approaches. The primary difference between them is that static styles
prize features from the static malware analysis, while dynamic styles excerpt features from the dynamic analysis. A third order may
be considered, known as cold-blooded approaches. mongrel styles incorporate rudiments of both static and dynamic analysis. In
addition, learning features from raw inputs in different fields have outdistanced neural networks. The performance of neural
networks in the malware sphere is imaged by recent developments in machine literacy for cybersecurity. (6)
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 490
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue X Oct 2022- Available at www.ijraset.com
3) Third, the field of machine literacy has advanced more quickly over the past few decades, reaching ground-breaking
achievements in terms of finesse and scalability on a variety of tasks, including computer vision, speech recognition, and
natural language processing. A workflow in machine literacy is an iterative process that includes acquiring available data,
structuring and preparing the data, building models, verifying the models, and integrating the models into the final product. The
data treatment procedure of classic machine learning approaches includes preprocessing the executable to prize a set of features
that give an abstract view of the software rather than dealing with raw malware. Later, the features are trained into a model to
help it divide the task at hand. Due to the diversity of malware characteristics, it's crucial to distinguish between various types
of malwares in order to better comprehend their capabilities. This goes beyond simply describing malicious software. The
outcome returned is the primary distinction between machine learning outcomes for malware discovery or classification.
VII. IMPLEMENTATION
The machine learning methods described here are used for the analysis of malware in this section. Objectives, Features, and
Machine Learning algorithms are the three essential steps in malware analysis, according to the author [17]. In malware detection,
the emphasis is on determining if a particular sample is harmful. In malware similarity analysis, we can determine if a file is harmful
by looking for attributes that can categorize it clearly and aid in malware detection. The second phase is feature extraction
depending on the requirement after the analysis's goal has been decided. Applying a machine learning method to reach the goal
comes after that final phase [17]. The primary goal of malware analysis is detection. Malware is discovered using a special
signature that was created based on earlier samples that were appropriately identified as harmful or benign [17]. Identifying whether
a particular sample is malicious or not is always the first and foremost objective. The majority of review work focuses on finding
malware [2], [3], [14].
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 491
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue X Oct 2022- Available at www.ijraset.com
I developed a command-line tool for malware detection from memory that uses python code to extract the active processes from the
suspected system's live memory. I used Python to construct a machine learning-based antivirus solution. The machine learns from
training data sets and predicts results based on what it has learned.I import and read the dataset at first, and then I clean it up by
removing extra features and null values. The dataset was then divided into training and testing. I import the required software to
create six distinct classifiers. Once finished, I fit the data appropriately and provide predictions.
I am able to obtain a highly accurate prediction using a combination of these algorithms after training and testing the algorithms.
A. Dataset
Two datasets were employed: a training dataset and a test dataset. The clean files in this dataset (and in the scale-up dataset) are
substantially bigger than the number of malware files because, as was already mentioned, my major objective is to achieve malware
identification with as few (if any) false positives as feasible. 57 columns serve as features in the dataset, which contains roughly
138000 entries of legitimate and malicious PE headers. 96724 malicious files and 41323 legitimate ones make up the dataset. These
data sets are used to test my machine and train it. Similar value-generating files for the selected feature set were only counted once.
It should be noted that there are far less clean combinations—that is, combinations of feature values for clean files—in the datasets
than there are malware-specific combinations. We require actual malware samples to prepare the dataset. Numerous malware
samples have been gathered, largely through Virus Share, from open-source GitHub sources [2]. The majority of the malware that
will be utilized for supervised learning is already classified in these sources.
1) Data pre-processing: One of the crucial steps prior to feature selection for the model is data pre-processing. If null values are
present, delete them from the data collection during data pre-processing. Remove the columns that contain the categorical
information. The first columns, Name and md5, which contain string values, are removed. Additionally, the last column
legitimate contains a value that indicates if the binary is malicious or not.
B. Feature Extraction
There are two methods for removing the functionality from malware binary, which is primarily static or dynamic analysis utilized
alone. The malware file is statically analyzed and examined without being put into action, every feature extracted is mostly from the
PE header or by breaking down and examining the executable file assembled code. The executable file is used in dynamic analysis.
In order to prepare dataset, we need to extract all the Import functions used by the malware. A small Python program is written
which will extract all the imports from all the PE32 files present in the directory. MD5 hashing is used to prevent data duplication.
Initially the program will create a file to store the imports used by all the executable of the same category. Creation to extract the
static information of the program or software with the help of PE file python library and generate excel for storing information for
every program or software. According to [14] the most form of malware falls under Portable Executable (PE) file format so I am
proposing a system which classifies the malware which comes under portable executable file format. Some of the features of the
datasets and the visualization of Imports used by various malware are listed below:
1) SizeOfOptionalHeader - Gets the size of the optional header, which is required for executable files but not for object files. This
value should be zero for an object file.
2) MajorLinkerVersion - Gets the linker major version number.
3) SizeOfCode - Gets the size of the code (text) section, or the sum of all code sections if there are multiple sections.
4) SizeOfInitializedData - Gets the size of the initialized data section, or the sum of all such sections if there are multiple data
sections.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 492
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue X Oct 2022- Available at www.ijraset.com
5) SizeOfUninitializedData - Gets the size of the uninitialized data section (BSS), or the sum of all such sections if there are
multiple BSS sections.
6) AddressOfEntryPoint - Gets the address of the entry point relative to the image base when the PE file is loaded into memory.
7) BaseOfCode - Gets the address of the beginning-of-code section relative to the image base when the image is loaded into
memory.
8) BaseOfData - Gets the address of the beginning-of-data section relative to the image base when the image is loaded into
memory.
9) ImageBase - Gets the preferred address of the first byte of the image when it is loaded into memory.
10) SectionAlignment - Gets the alignment (in bytes) of sections when they are loaded into memory.
11) FileAlignment - Gets the alignment factor (in bytes) that is used to align the raw data of sections in the image file.
12) MajorOperatingSystemVersion - Gets the major version number of the required operating system.
13) MinorOperatingSystemVersion - Gets the minor version number of the required operating system
14) MajorImageVersion - Gets the major version number of the image.
15) MinorImageVersion - Gets the minor version number of the image.
16) MajorSubsystemVersion - Gets the major version number of the subsystem.
17) MinorSubsystemVersion - Gets the minor version number of the subsystem.
18) SizeOfImage - Gets the size (in bytes) of the image, including all headers, as the image is loaded in memory.
19) SizeOfHeaders - Gets the combined size of an MS DOS stub, PE header, and section headers rounded up to a multiple of
20) FileAlignment - Gets the alignment factor (in bytes) that is used to align the raw data of sections in the image file.
21) CheckSum - Gets the image file checksum.
22) Subsystem - Gets the name of the subsystem that is required to run this image.
23) DllCharacteristics - Describes the characteristics of a dynamic link library. This enumeration supports a bitwise combination of
its member values.
24) SizeOfStackReserve - Gets the size of the stack to reserve.
25) SizeOfStackCommit - Gets the size of the stack to commit.
26) SizeOfHeapReserve - Gets the size of the local heap space to reserve.
27) SizeOfHeapCommit - Gets the size of the local heap space to commit.
28) NumberOfRvaAndSizes - Gets the number of data-directory entries in the remainder of the PEHeader. Each describes a location
and size.
In this part data is prepared for learning process. First columns containing string values are dropped and also last column containing
value which tells about whether binary is malicious or legitimate the remaining data and data-frame is store in variable X. Similarly
in variable y values from column legitimate are stored.
C. Feature Selection
Select the features which are closely related to the output or important for the prediction or classification of the given input. step
Extra Tree Classifier from the sklearn library features is selected. Out of 57 features, 13 features are selected by the Extra tree
classifier. To perform feature selection, the Extra-Trees classifier follows the conventional top-down approach and generates a
collection of unpruned decision trees. In essence, it entails splitting a node in a tree while severely randomizing both attribute and
cut-point selection. In the worst-case scenario, it generates entirely randomized trees with structures independent of the training
sample's output values. It notably varies from conventional tree-based ensemble approaches on two counts: first, it separates nodes
by selecting cut-points completely at random; second, it grows the trees using the entire training sample rather than a bootstrap
replica. By using a majority vote, the forecasts of all the trees are pooled to determine the final prediction. Here, information gain
will be the deciding factor. To generate independent Decision Trees for each and every feature, first determine the entropy of the
data. Calculate the overall knowledge obtained for each feature, then choose the ones with the highest scores.
D. Training
After choosing the proper features, divide the data set into two halves for training and testing, as necessary. Ready to train the model
once the dataset has been generated. There are 13 features in my model. The model is trained using four different algorithms:
Decision Tree, Random Forest, Gradient Boost, and Linear Regression. I can easily see that Random Forest has the best accuracy
with the fewest false positive and negative rates.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 493
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue X Oct 2022- Available at www.ijraset.com
Algorithms Accuracy
Random Forest 99.40%
Decision Tree 99.06%
Gradient Boost 98.81%
Linear Regression 52.92%
Now the Random Forest model and all the selected feature will be used for later prediction.
E. Classification
This is the last step where actual classification is predicted based on training done in the previous step. The algorithm selected for
classification predicts whether the given input is a malicious or benign sample. In order to test the model on unseen file, it’s required
to extract the characteristics of the given file. Python’s pefile.PE library is used to construct a build the above-mentioned feature
vector and a ML model is used to predict the class for the given file based on already trained model.
VIII. CONCLUSION
The machine learning approach considered were evaluated successfully.
The static features that were retrieved from both good and bad executable files were used by the machine learning method. This
method makes it possible to determine if a particular file is harmful or not very quickly. It will be effective to utilize this method
before signature-based ones in scenarios with high load to lessen the burden on dynamic analysis of executable. Although it is
recommended that false negatives in these investigations be nil or almost zero. In the industrial context, the model is useless if the
false-negative rate is larger. This program only runs on Windows. Additionally, it is unable to determine the malware's family. My
major goal was to develop a machine learning system that can normally identify as many malware samples as feasible while
adhering to the rigorous requirement of having a zero false positive rate. Despite coming very near to our goal, I still have a non-
zero false positive rate. For this system to be included in a highly competitive commercial product, a variety of deterministic
exemption mechanisms must be incorporated. In my opinion, machine learning-based malware detection will be an addition to, not a
replacement for, the current malware detection techniques currently employed by anti-virus providers. Any commercial anti-virus
program has memory and speed restrictions, so the most precise methods should be employed. In order to draw a conclusion, I'd like
to say that this study needs more research in a few specific areas. As the number of internet users increases and new malware is
created daily, new strategies are required to combat it.
REFERENCES
[1] Maryam Al-Janabi, Ahmad Mousa Altamimi, "A Comparative Analysis of Machine Learning Techniques for Classification and Detection of Malware", 2021
21st International Arab Conference on Information and Technology (AICT)
[2] Sana Qadir, Basirah Noor, "Applications of Machine Learning in Digital Forensics", 2021 International Conference on Digital Futures and Transformative
Technologies (ICoDT2)
[3] Sanket Agarkar, Soma Ghosh, "Malware Detection & Classification using Machine Learning", 2020 IEEE International Symposium on Sustainable Energy,
Signal Processing and Cyber Security (iSSSC) | 978-1-7281-8880-5/20/$31.00 ©2020 IEEE | DOI: 10.1109/iSSSC50941.2020.935883
[4] Adam Duby, Teryl Taylor, Yanyan Zhuang, "Malware Family Classification via Residual Prefetch Artifacts", 2022 IEEE 19th Annual Consumer
Communications & Networking Conference (CCNC)
[5] Rajvardhan Patil, Wei Deng, "Malware Analysis using Machine Learning and Deep Learning techniques", SoutheastCon 2020 | 978-1-7281-6861-6/20/$31.00
©2020 IEEE | DOI:10.1109/SoutheastCon44009.2020.936826
[6] Ashub Bin Asad, Raiyan Mansur, Safir Zawad, Nahian Evan, Muhammad Iqbal Hossain "Analysis of Malware Prediction Based on Infection Rate Using
Machine Learning Techniques", 2020 IEEE Region 10 Symposium (TENSYMP), 5-7 June 2020, Dhaka, Bangladesh
[7] AV-Test, “The AV-test security report,” 2020. [Online]. Available: https://www.av-test.org/.
[8] R. Sihwail, K. Omar and K. A. Z. Ariffin, “A survey on malware analysis techniques: Static, dynamic, hybrid and memory analysis,” International Journal on
Advanced Science, Engineering and Information Technology, vol. 8, no. 4–2, pp. 1662–1671, 2018.
[9] A. Khanan, S. Abdullah, A.H.M.E.D. Mohamed, A. Mehmood and K. A. Z. Ariffin, “Big data security and privacy concerns: A review,” in Smart Technologies
and Innovation for a Sustainable Future, Advances in Science, Technology & Innovation, Cham: Springer, pp. 55–61, 2019.
[10] Y. Ye, T. Li, D. Adjeroh and S. S. Iyengar, “A survey on malware detection using data mining techniques,” ACM Computing Survey, vol. 50, no. 3, pp. 1–40,
2017.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 494
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue X Oct 2022- Available at www.ijraset.com
[11] Mouhammd Al-kasassbeh Mohammad A. Abbadi, Ahmed M. Al- Bustanji. Robust intelligent malware detection using lightgbm algo- rithm. International
Journal of Innovative Technology and Exploring Engineering (IJITEE), 9, 04 2020.
[12] Edward Raff and Charles Nicholas. An alternative to ncd for large sequences, lempel-ziv jaccard distance. In Proceedings of the 23rd ACM SIGKDD
international conference on knowledge discovery and data mining, pages 1007–1015, 2017.
[13] Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. Au- tomatic analysis of malware behavior using machine learning. Journal of Computer
Security, 19(4):639–668, 2011.
[14] Igor Santos, Jaime Devesa, Felix Brezo, Javier Nieves, and Pablo Garcia Bringas. Opem: A static-dynamic approach for machine-learning-based malware
detection. In International Joint Conference CISIS’12-ICEUTE 12-SOCO 12 Special Sessions, pages 271–280. Springer, 2013.
[15] Joshua Saxe and Konstantin Berlin. Deep neural network-based malware detection using two-dimensional binary program features. In 2015 10th International
Conference on Malicious and Unwanted Software (MALWARE), pages 11–20. IEEE, 2015.
[16] Michael Sikorski and Andrew Honig. Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software. No Starch Press, USA, 1st edition,
2012.
[17] Daniele Ucci, Leonardo Aniello, and Roberto Baldoni. Survey of machine learning techniques for malware analysis. Computers & Security, 81:123–147, 2019.
[18] Dolly Uppal, Rakhi Sinha, Vishakha Mehra, and Vinesh Jain. Malware detection and classification based on extraction of api sequences. In 2014 International
conference on advances in computing, communications and informatics (ICACCI), pages 2337–2342. IEEE, 2014.
[19] Yanfang Ye, Tao Li, Donald Adjeroh, and S Sitharama Iyengar. A survey on malware detection using data mining techniques. ACM Computing Surveys
(CSUR), 50(3):1–40, 2017.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 495