0% found this document useful (0 votes)
3 views

OS-Independent Malware Detection Applying Machine Learning and Computer Vision in Memory Forensics

The paper presents a novel OS-independent malware detection system that integrates memory forensics with machine learning and computer vision techniques. It introduces a MemGen system for generating datasets of benign and malicious memory dumps, enabling effective training of machine learning models. The proposed system achieves high accuracy in detecting malware, demonstrating its applicability across various operating systems without requiring specific knowledge of their profiles.

Uploaded by

lohisa9422
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

OS-Independent Malware Detection Applying Machine Learning and Computer Vision in Memory Forensics

The paper presents a novel OS-independent malware detection system that integrates memory forensics with machine learning and computer vision techniques. It introduces a MemGen system for generating datasets of benign and malicious memory dumps, enabling effective training of machine learning models. The proposed system achieves high accuracy in detecting malware, demonstrating its applicability across various operating systems without requiring specific knowledge of their profiles.

Uploaded by

lohisa9422
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2021 17th International Conference on Computational Intelligence and Security (CIS)

OS-independent Malware Detection: Applying Machine Learning and


Computer Vision in Memory Forensics

Anh-Duy Tran, Ngoc-Huy Vo, Quang-Khai Tran, Hai-Dang Nguyen, Minh-Triet Tran
Faculty of Information Technology, University of Science, VNU-HCM
2021 17th International Conference on Computational Intelligence and Security (CIS) | 978-1-6654-9489-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/CIS54983.2021.00134

Vietnam National University, Ho Chi Minh City, Vietnam


{taduy,nhdang94,tmtriet}@hcmus.edu.vn,{1712504,1712514}@student.hcmus.edu.vn

Abstract—Malware detection is an essential task to protect images, inspecting malicious processes [5], kernel integrity
computing systems, and it is crucial to detect potential mali- checking [6], and virtual machine introspection [7]. There
cious code in memory. Thus, we utilize the memory forensics are two main phases in memory forensics: memory acqui-
approach to build an OS-independent malware detection sys-
tem. To accomplish this goal, we integrate fundamental ma- sition and memory analysis [1]. In the memory acquisition
chine learning techniques with memory forensics for building a phase, we utilize the hardware or software approaches to
classification tool and apply computer vision for preprocessing capture and take a snapshot of the memory of the running
data. Our system needs a huge data set from both benign system, forming a memory image. Then, many techniques
and malicious memory dumps for building a machine learning are applied in the memory analysis phase to extract the
model. Therefore, we also build a MemGen system to simulate
any scenario for computers and dump benign or malicious desired artifacts in the memory images.
memory snapshots. We use the MemGen system to create a Most of the current forensics tools and introspections
new dataset that includes types of 2750 samples of 8 different employ specific knowledge such as kernel source and infor-
newest malware and benign memory dumps. The results are mation of the system to reconstruct a profile which contains
obtained by applying the machine learning algorithms SVM the information of a memory image structure. Based on
based on RBF kernel, Random Forest, and Decision Tree on
the generated dataset by MemGen that has an accuracy of that knowledge, forensics tools can be used to analyze the
93.42%, 93.75%, 92.83% respectively. Moreover, we test the memory quickly and precisely. However, in many cases, we
trained models to recognize unknown malware and obtained cannot have that information, for example, the closed source
quite impressive results with accuracy up to 87.44%, 84.78%, OSs. Moreover, with an impressive development of IoT
80% on average for Random Forest, Decision Tree, and SVM devices, Android smartphones, and upcoming Fuchsia, many
algorithms, on a dataset of 900 malware samples from 3 types
of malware: OskiStealer, RedLineStealer, and SnapKeylogger. versions of OS exist. Their memory images also play an
essential role in the realm of modern digital forensics for in-
Keywords-memory forensics; malware detection; machine vestigating compromised devices and smartphones. In these
learning; computer vision;
cases, sophisticated reverse engineering efforts are needed,
and it takes several months to create the development and
I. I NTRODUCTION
maintenance tools for each OS or device.
Memory forensics is a process of extracting evidence or One of the goals of memory forensics is to inspect
artifacts in the memory images (or memory dump) when its malicious processes, which means we can leverage that goal
computer is compromised. Memory images contain critical for detecting malware in working or compromised machines.
evidence that stands as important clues when analyzing the Detecting malware in the systems is an essential task in
compromised system. Extracting the evidence is the key cybersecurity, and there are many approaches for helping to
phase in digital forensics. A memory image captures the accomplish that target, such as signature-based or heuristic-
current state of the system. It contains useful information based. However, almost all these methods cannot be applied
such as the running processes, network connections, opening to detect new kinds of malware or could not work in every
files, etc., which help identify illegal behavior and recon- operating system. Therefore, it is crucial to building a system
struct the crime scenarios [1], [2]. The main memory in that can recognize the behavior and presence of malware on
most systems is volatile, which means it will lose all the data the computer for the current context, especially the ability
when the machine is switched off. Therefore, using the “pull to adapt to recognize new malware or variants of existing
and plug” tactic does not work in this situation [3]. Instead, malware.
the memory images (snapshots or dumps) were taken before Today, malware is developed using many sophisticated
shutting down the system will help tackle this problem. techniques such as encryption or packing to hide the pres-
Investigating those snapshots in the later stage contributes ence of malware on the computer or even fileless malware to
to the primary role in capturing criminal activities. We can reduce the need to store malware on storage. No matter what
identify instances of the data structure [4] from the memory technology is supported, all its behavior and activities will be

978-1-6654-9489-2/21/$31.00 ©2021 IEEE 616


DOI 10.1109/CIS54983.2021.00134
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl
visible and exposed in memory when malware is executed. system, virtual machine, and it is hard to apply in analyzing
Therefore, dealing with malware with memory forensics other operating systems. Bozkir et al. [11], Petrik et al.
becomes more feasible and practical. However, the biggest [12] proposed a system that integrates machine learning to
challenge when working with memory forensics is building a memory forensics to build malware detection. This approach
profile for each operating system that we extract its memory. can reduce the need for computer security knowledge, but
Therefore, with the help of combining machine learning and it did not create a framework for producing datasets.
memory forensics, we can build a system capable of malware
III. M EMORY D UMP G ENERATION S YSTEM - M EM G EN
detection and bypass the need of the profiles. Moreover, that
system can be applied to many Windows OSs and detect new Our approach for building a malware detection system by
kinds of malware or its variants. using memory forensics is leveraged the power of machine
In this paper, we propose our novel solution for the learning techniques. The very first problem for machine
processing of building OS-independent Malware Detector. learning methodology is the dataset. Every machine learning
Our main contributions are as follows: system cannot be built if it does not have a dataset for
• We develop a MemGen system that can simulate the
training. Moreover, the dataset in memory forensics is the
virtual environments with provided activities, run the memory dump (memory image or memory snapshot) itself
newest malware samples inside virtual machines and and is extremely big in modern computers. The memory
then capture its memory for both benign and malicious dump also captures the randomness of the computer status.
machines (Section III). Our process memory dump data Therefore, we come up with a memory dump generation sys-
set can be used for further memory analysis research, tem - MemGen. MemGen comprises many parts that help
or the researchers can build up their data set from that. simulate a virtual machine with a specific operating system,
• We preprocess data by converting memory dumps to
run typical applications inside, deploy targeted malware, and
PNG images using computer vision techniques (Section capture its memory as an image file. Hence, every researcher
IV). can use the MemGen system to create a dataset with the
• We build machine learning model for detecting malware
newest malware when they use our system, or they can use
in memory without knowledge of OS and conduct our MemGen to generate memory dump with any scenario they
experiments (Section V). want to simulate in the virtual machine for further processing
research in the memory forensics area.
II. R ELATED W ORK To build a malware detection system, we have to create
Started from Digital Forensics Research Workshop two separate datasets: the first one is captured from benign
(DFRWS, 2008), memory forensics has been a juicy and machine memory, and the second one is captured from
attractive part of the digital investigation field. Since then, compromised machine memory with malware is running
many research works have focused on building the profile inside. Both benign and compromised machines have to
and analyzing the memory dumps of both Windows and simulate the typical applications and run common activities
Linux, mostly in Windows, because it has much fewer that normal users can start or perform. Figure 1 depicts the
variant versions than Linux. A memory profile is considered processing flow of the MemGen system, and it performs the
a skeleton of a memory dump, and the tools can use profiles simulation through 3 main steps:
to map or segment binary data into desired objects. • Step 1: the Host machine create a list of tasks that
Much research has been conducted to build malware the virtual machine will run. Those tasks consist of
detection and analysis using memory forensics to achieve opening common applications, accessing some specific
that goal. Mosli et al. [5] extract the registry activity, websites. Optionally, the Host machine can also pick
imported libraries, and API function calls from memory a random malware sample that belongs to one targeted
dumps to build a dataset for a machine learning model malware family if the users want to build a dataset for
to detect unknown malware. Another research by Case et compromised machines. The task list is store in shared
al. [8] focuses on detecting keystroke loggers by upgrading storage between the Host and the Guest machines.
the Hooktracer, which is a Volatility plugin. Bajpai et al.’s • Step 2: the Guest machine (virtual machine) is auto-
research [9] retrieve cryptographic keys during encryption matically started by the Host machine, and it also runs
in the ransomware process memory to help recover en- the script to read the task list and performs those task
crypted data. Hue et al. [10] utilize the virtual machine lists. Besides, it can run some malware if necessary.
introspection mechanism to inspect the low-level state of • Step 3: the Guest machine uses ProcDump to dump
the targeted machine (virtual machine), then reconstructs the memory of a specific running process. There are
the guest OS data structures to identifies the lack of the three memory extraction modes that MemGen supports:
critical processes and the target hidden process. In those extract whole memory, extract memory of running
approaches, researchers should have strong knowledge in malicious process and extract all memory part of all
loading a process to memory, the structure of the operating running processes in the virtual machine.

617

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl
https://github.com/ESultanik/bin2png, to first convert binary
data (memory dump) of both benign and compromised
snapshots to PNG files without reducing data in binary files.
Visualizing the binary data helps bridge the gap between
computer vision and the sequence of bytes of the executable
files. Each pixel of an output image is composed of three
bytes of the given binary file. The three bytes represent the
values in three color channels which are red, green, and blue.
In this research, we leverage the encoding algorithm in RGB
to convert the memory dump into 300x300 pixels images,
and that method helps us to visualize more information in
every row of the output image. Hence, we can effortlessly
recognize the similarity between two images of the processes
produced by one malware or its variances. For example, the
contents of Loki and Tesla images are different from each
other because they represent the processes of two different
Figure 1. The workflow of the memory dump generation system - MemGen malware, whereas two nearly indistinguishable images vi-
sualize the processes of one malware in Figure 3. We also
modify the source code of bin2png to add a new function
To configure MemGen, we use a host machine running that helps to resize the output image to use in the training
Ubuntu version 20.04, and installed a guest machine that model.
runs the version 20H2 of Windows 10. Besides, we used We extract two types of features, which are HOG and
an external hard drive to be able to share the to-do activity GIST, and feed them into the model during the training
list, the malware code to execute, the list of malware, and process. HOG (histogram of oriented gradients) is a feature
where memory dumps are stored after execution. We can descriptor used in computer vision and image processing to
create a memory dump sample of the malware process identify an object. The essence of the HOG method is to
executing on RAM within approximately 150 seconds. We use information about the distribution of gradient intensities
create a dataset with a capacity of 296GB, including 2750 or edge directions to describe local features in the image.
samples of 8 different malware types: Loki-Bot, Agent The GIST descriptor represents a low-dimensional image
Tesla, FormBook, StormKitty, OskiStealer, RedlineStealer, that contains enough information to identify the scene in
SnakeKeylogger which are collected from MalwareBazar an image. Global GIST descriptors allow a minimal size
(https://bazaar.abuse.ch/) and benign memory dumps. representation of an image. In our work, we use both
extracted features from GIST and HOG to build the feature
IV. P RE - PROCESSING vector.
To create input data for the machine learning model, we Finally, we use UMAP to reduce the dimensions of the
design a way to extract features from the raw data, which input data. Due to the structural similarity between the
is the memory dump in this scenario. The features represent benign software and the malware samples, it is difficult to
critical meaning of the data in a specific domain. From this separate them in the feature space with 2724 dimensions
point, there are two main approaches for feature extraction (960 dimensions from HOG feature descriptor and 1764
phase. If the way of generating the data is revealed or known dimensions from GIST feature descriptor). We use the
in advance, we can simply apply our domain knowledge to UMAP (Uniform Manifold Approximation and Projection)
extract features from the data. In digital memory forensics, dimension reduction with its beneficial supervised metric
an expert can analyze the memory dumps where artifact learning property to overcome this limitation. With this
should be retrieved, and he can understand the mechanism of feature, it helps to improve the model’s performance by
that kernel objects, processes, network connection, malware, including the training model and the prediction with a
etc., scattered in the memory dump of the compromised lower-dimensional feature space that is more discriminatory
system. On the other hand, an alternative way is to feed and separate than the original. Moreover, it also helps us
that data in a simplistic format to a machine learning model visualize the distribution of the data.
and allow the model to learn. In our approach, we leverage V. OS- INDEPENDENT M ALWARE D ETECTOR
our knowledge in computer vision and computer security to
reduce the dataset’s size and extract its feature. To build an OS-independent malware detector, we use
the processed memory dumps data and traditional machine
Figure 2 illustrates the overall workflow of the mal-
learning models: Decision Tree, Random Forest, and Support
ware detection system. In the preprocessing phase,
Vector Machine (SVM). Python’s libraries and packages
we utilize the bin2png tool, which is published in

618

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl
Figure 2. The workflow of OS-independent Malware Detector

In general, the classification accuracy of the algorithms is


relatively good. The Decision Tree algorithm gives the best
classification results for the training set with an accuracy up
to 100%. In the meanwhile, the algorithm that gives good
results for the test set is the Random Forest algorithm, with
Figure 3. The converted images of process’s memory-dumps of Loki (left) an accuracy rate of 93.75%. In summary, all algorithms give
and Tesla 1 (middle) and Tesla 2 (right) malware relatively high results for the test set with accuracy rates
greater than 93%.
Table II shows that the coverage (recall) for malware
such as Scikit-Learn, UMAP, GIST, HOG in python are uti- is relatively high, and the highest is the Random Forest
lized for implementation. We set up a training machine with algorithm with coverage up to 99%. However, the coverage
Linux OS kernel, Intel core i7-7500U, and 8GB of RAM. for benign software is 0.78, 0.77, and 0.75 for SVM based on
The dataset used for building classification models has 6134 RBF kernel, Random Forest, and Decision Tree algorithms,
images in PNG format. We take approximately 80% of data respectively. The numbers express that the detector is able
as training set and 20% as the test set. Then, all the PNG to easily recognize malicious software while also potentially
images in both training and testing sets will be imported confuse benign software as malicious software.
in the feature vector calculation (with 2724 dimensions for In fact, new types of malware or a variant of previously
each image) using GIST and HOG descriptors, and they will
be labeled Malware if it is produced by malicious software,
and Benign if it is produced by benign software, accordingly. Table I
Next, the training set is fed into UMAP to perform ACCURACY, PRECISION , RECALL AND F 1- SCORES FOR EACH MACHINE
LEARNING MODEL
the supervised dimension reduction. The supervised metric
learning feature from UMAP creates a learned transformer
ML method Accuracy Precision Recall F1-Score
model that has learned the data with corresponding labels in SVM(rbf) 99.32 % 0.99 0.99 0.99
the training set. The feature vectors in the training set were Train Random Forest 99.43 % 0.99 0.99 0.99
Decision Tree 100 % 1 1 1
reduced in size from 2724 components to 5 components SVM(rbf) 93.42 % 0.92 0.88 0.90
per vector. This converted training set is used in the SVM Test Random Forest 93.75 % 0.94 0.88 0.90
based on RBF kernel, Random Forest, and Decision Tree Decision Tree 92.83 % 0.93 0.87 0.89
algorithms.
In the next stage, we test our trained models on the Table II
test dataset. First, we need to put the vectors in the test P RECISION , RECALL AND F 1- SCORES OF BENIGN AND MALICIOUS
PROCESS PREDICTION
set into the learned transformer model created by UMAP
previously, and it fixes the test set data correctly into its
ML method Class Precision Recall F1-Score Support
structure to compute data dimension reduction regardless of SVM(rbf) Malware 0.94 0.98 0.96 935
the labels. After this process, the vectors in the test set have Benign 0.91 0.78 0.84 265
Random Malware 0.94 0.99 0.96 935
the same size as the training set. Finally, the test set is be Forest Benign 0.94 0.77 0.84 265
imported to validate the effectiveness of the classification Decision Malware 0.93 0.98 0.96 935
models built with the previous training set. Experimental Tree Benign 0.92 0.75 0.83 265
results are described in table I and table II.

619

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl
Table III ACKNOWLEDGMENT
T ESTING ON UNTRAINED MALWARE
We would like to thank Dung Nguyen for his valuable
ML method Malware Accuracy True False Total support. This research is funded by the University of Sci-
SVM(rbf) OskiStealer 90 % 270 30 300 ence, VNU-HCM, under grant number CNTT 2020-01.
RedLineStealer 86.67 % 260 40 300
SnakeKeylogger 63.33 % 190 110 300 R EFERENCES
Average Accuracy 80 %
Random OskiStealer 90.33 % 271 29 300 [1] M. H. Ligh, A. Case, J. Levy, and A. Walters, The art of
Forest RedLineStealer 94 % 282 18 300
SnakeKeylogger 78 % 234 66 300
memory forensics: detecting malware and threats in windows,
Average Accuracy 87.44 % linux, and Mac memory. John Wiley & Sons, 2014.
Decision OskiStealer 90.67 % 272 28 300
Tree RedLineStealer 90.67 % 272 28 300 [2] A. Case and G. G. Richard III, “Memory forensics: The path
SnakeKeylogger 73 % 219 82 300 forward,” Digital Investigation, vol. 20, pp. 23–33, 2017.
Average Accuracy 84.78 %

[3] G. G. Richard III and V. Roussev, “Next-generation digital


forensics,” Communications of the ACM, vol. 49, no. 2, pp.
76–80, 2006.

discovered malware evolve very quickly and are created [4] Z. Lin, J. Rhee, X. Zhang, D. Xu, and X. Jiang, “Siggraph:
every day. Therefore, timely detection of zero-day attacks is Brute force scanning of kernel data structure instances using
becoming more and more important. This is the reason why graph-based signatures.” in Ndss, 2011.
we conduct an experiment to see if this model could be used [5] R. Mosli, R. Li, B. Yuan, and Y. Pan, “Automated malware
to predict new types of malware that had not been trained on detection using artifacts in forensic memory images,” in 2016
the model. We test on three new types of malware, namely IEEE Symposium on Technologies for Homeland Security
OskiStealer, RedLineStealer, and SnapKeylogger. For each (HST). IEEE, 2016, pp. 1–6.
type of malware, we create 300 data samples for testing.
[6] A. Srivastava, I. Erete, and J. Giffin, “Kernel data integrity
The test results are presented in Table III. protection via memory access control,” Georgia Institute of
Technology, Tech. Rep., 2009.
VI. C ONCLUSION
In this paper, we present the best of our knowledge, apply- [7] C.-W. Tien, J.-W. Liao, S.-C. Chang, and S.-Y. Kuo, “Memory
forensics using virtual machine introspection for malware
ing memory forensics, several data preprocessing techniques, analysis,” in 2017 IEEE Conference on Dependable and
and machine learning algorithms to build an OS-independent Secure Computing. IEEE, 2017, pp. 518–519.
malware detector. We fundamentally convert memory dumps
to PNG images by using RGB-based encoding. Then, we [8] A. Case, R. D. Maggio, M. Firoz-Ul-Amin, M. M. Jalalzai,
extract two type of descriptors, namely GIST and HOG, A. Ali-Gombe, M. Sun, and G. G. Richard III, “Hooktracer:
Automatic detection and analysis of keystroke loggers using
from the images, and apply traditional machine learning memory forensics,” Computers & Security, vol. 96, p. 101872,
techniques to build an OS-independent malware detector to 2020.
classify malicious or benign processes, and possibly find any
unknown malware. We employ the state-of-the-art dimension [9] P. Bajpai and R. Enbody, “Memory forensics against ran-
reduction technique named UMAP to improve the robustness somware,” in 2020 International Conference on Cyber Se-
curity and Protection of Digital Services (Cyber Security).
of the classifier. In addition, we build a system called IEEE, 2020, pp. 1–8.
MemGen and utilize it to generate the dataset used in our
experiments. MemGen can either generate datasets with an [10] Q. Hua and Y. Zhang, “Detecting malware and rootkit
entire memory or memory dump of a specific process. In via memory forensics,” in 2015 International Conference
particular, we randomly generate a list of typical applications on Computer Science and Mechanical Automation (CSMA).
IEEE, 2015, pp. 92–96.
that run on the system beside the newest targeted malware to
simulate realistic behaviors of possible scenarios for a real [11] A. S. Bozkir, E. Tahillioglu, M. Aydos, and I. Kara, “Catch
user. Moreover, we consider the memory dumps generated them alive: A malware detection approach through memory
by MemGen as a driving force and an essential resource for forensics, manifold learning and computer vision,” Computers
our future research in the memory forensics field. Dataset & Security, vol. 103, p. 102166, 2021.
and source code will be made public. [12] R. Petrik, B. Arik, and J. M. Smith, “Towards architecture
As a result, we test the trained models with unknown and os-independent malware detection via memory forensics,”
malware and obtain quite satisfactory results, specifically the in Proceedings of the 2018 ACM SIGSAC Conference on
Random Forest algorithm achieves the highest results with Computer and Communications Security, 2018, pp. 2267–
average accuracy is 87.44% and the accuracy for each type 2269.
is 90.33%, 94%, and 78% respectively for three types of
malware: OskiStealer, RedLineStealer, and SnapKeylogger.

620

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl

You might also like