OS-Independent Malware Detection Applying Machine Learning and Computer Vision in Memory Forensics
OS-Independent Malware Detection Applying Machine Learning and Computer Vision in Memory Forensics
Anh-Duy Tran, Ngoc-Huy Vo, Quang-Khai Tran, Hai-Dang Nguyen, Minh-Triet Tran
Faculty of Information Technology, University of Science, VNU-HCM
2021 17th International Conference on Computational Intelligence and Security (CIS) | 978-1-6654-9489-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/CIS54983.2021.00134
Abstract—Malware detection is an essential task to protect images, inspecting malicious processes [5], kernel integrity
computing systems, and it is crucial to detect potential mali- checking [6], and virtual machine introspection [7]. There
cious code in memory. Thus, we utilize the memory forensics are two main phases in memory forensics: memory acqui-
approach to build an OS-independent malware detection sys-
tem. To accomplish this goal, we integrate fundamental ma- sition and memory analysis [1]. In the memory acquisition
chine learning techniques with memory forensics for building a phase, we utilize the hardware or software approaches to
classification tool and apply computer vision for preprocessing capture and take a snapshot of the memory of the running
data. Our system needs a huge data set from both benign system, forming a memory image. Then, many techniques
and malicious memory dumps for building a machine learning are applied in the memory analysis phase to extract the
model. Therefore, we also build a MemGen system to simulate
any scenario for computers and dump benign or malicious desired artifacts in the memory images.
memory snapshots. We use the MemGen system to create a Most of the current forensics tools and introspections
new dataset that includes types of 2750 samples of 8 different employ specific knowledge such as kernel source and infor-
newest malware and benign memory dumps. The results are mation of the system to reconstruct a profile which contains
obtained by applying the machine learning algorithms SVM the information of a memory image structure. Based on
based on RBF kernel, Random Forest, and Decision Tree on
the generated dataset by MemGen that has an accuracy of that knowledge, forensics tools can be used to analyze the
93.42%, 93.75%, 92.83% respectively. Moreover, we test the memory quickly and precisely. However, in many cases, we
trained models to recognize unknown malware and obtained cannot have that information, for example, the closed source
quite impressive results with accuracy up to 87.44%, 84.78%, OSs. Moreover, with an impressive development of IoT
80% on average for Random Forest, Decision Tree, and SVM devices, Android smartphones, and upcoming Fuchsia, many
algorithms, on a dataset of 900 malware samples from 3 types
of malware: OskiStealer, RedLineStealer, and SnapKeylogger. versions of OS exist. Their memory images also play an
essential role in the realm of modern digital forensics for in-
Keywords-memory forensics; malware detection; machine vestigating compromised devices and smartphones. In these
learning; computer vision;
cases, sophisticated reverse engineering efforts are needed,
and it takes several months to create the development and
I. I NTRODUCTION
maintenance tools for each OS or device.
Memory forensics is a process of extracting evidence or One of the goals of memory forensics is to inspect
artifacts in the memory images (or memory dump) when its malicious processes, which means we can leverage that goal
computer is compromised. Memory images contain critical for detecting malware in working or compromised machines.
evidence that stands as important clues when analyzing the Detecting malware in the systems is an essential task in
compromised system. Extracting the evidence is the key cybersecurity, and there are many approaches for helping to
phase in digital forensics. A memory image captures the accomplish that target, such as signature-based or heuristic-
current state of the system. It contains useful information based. However, almost all these methods cannot be applied
such as the running processes, network connections, opening to detect new kinds of malware or could not work in every
files, etc., which help identify illegal behavior and recon- operating system. Therefore, it is crucial to building a system
struct the crime scenarios [1], [2]. The main memory in that can recognize the behavior and presence of malware on
most systems is volatile, which means it will lose all the data the computer for the current context, especially the ability
when the machine is switched off. Therefore, using the “pull to adapt to recognize new malware or variants of existing
and plug” tactic does not work in this situation [3]. Instead, malware.
the memory images (snapshots or dumps) were taken before Today, malware is developed using many sophisticated
shutting down the system will help tackle this problem. techniques such as encryption or packing to hide the pres-
Investigating those snapshots in the later stage contributes ence of malware on the computer or even fileless malware to
to the primary role in capturing criminal activities. We can reduce the need to store malware on storage. No matter what
identify instances of the data structure [4] from the memory technology is supported, all its behavior and activities will be
617
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl
https://github.com/ESultanik/bin2png, to first convert binary
data (memory dump) of both benign and compromised
snapshots to PNG files without reducing data in binary files.
Visualizing the binary data helps bridge the gap between
computer vision and the sequence of bytes of the executable
files. Each pixel of an output image is composed of three
bytes of the given binary file. The three bytes represent the
values in three color channels which are red, green, and blue.
In this research, we leverage the encoding algorithm in RGB
to convert the memory dump into 300x300 pixels images,
and that method helps us to visualize more information in
every row of the output image. Hence, we can effortlessly
recognize the similarity between two images of the processes
produced by one malware or its variances. For example, the
contents of Loki and Tesla images are different from each
other because they represent the processes of two different
Figure 1. The workflow of the memory dump generation system - MemGen malware, whereas two nearly indistinguishable images vi-
sualize the processes of one malware in Figure 3. We also
modify the source code of bin2png to add a new function
To configure MemGen, we use a host machine running that helps to resize the output image to use in the training
Ubuntu version 20.04, and installed a guest machine that model.
runs the version 20H2 of Windows 10. Besides, we used We extract two types of features, which are HOG and
an external hard drive to be able to share the to-do activity GIST, and feed them into the model during the training
list, the malware code to execute, the list of malware, and process. HOG (histogram of oriented gradients) is a feature
where memory dumps are stored after execution. We can descriptor used in computer vision and image processing to
create a memory dump sample of the malware process identify an object. The essence of the HOG method is to
executing on RAM within approximately 150 seconds. We use information about the distribution of gradient intensities
create a dataset with a capacity of 296GB, including 2750 or edge directions to describe local features in the image.
samples of 8 different malware types: Loki-Bot, Agent The GIST descriptor represents a low-dimensional image
Tesla, FormBook, StormKitty, OskiStealer, RedlineStealer, that contains enough information to identify the scene in
SnakeKeylogger which are collected from MalwareBazar an image. Global GIST descriptors allow a minimal size
(https://bazaar.abuse.ch/) and benign memory dumps. representation of an image. In our work, we use both
extracted features from GIST and HOG to build the feature
IV. P RE - PROCESSING vector.
To create input data for the machine learning model, we Finally, we use UMAP to reduce the dimensions of the
design a way to extract features from the raw data, which input data. Due to the structural similarity between the
is the memory dump in this scenario. The features represent benign software and the malware samples, it is difficult to
critical meaning of the data in a specific domain. From this separate them in the feature space with 2724 dimensions
point, there are two main approaches for feature extraction (960 dimensions from HOG feature descriptor and 1764
phase. If the way of generating the data is revealed or known dimensions from GIST feature descriptor). We use the
in advance, we can simply apply our domain knowledge to UMAP (Uniform Manifold Approximation and Projection)
extract features from the data. In digital memory forensics, dimension reduction with its beneficial supervised metric
an expert can analyze the memory dumps where artifact learning property to overcome this limitation. With this
should be retrieved, and he can understand the mechanism of feature, it helps to improve the model’s performance by
that kernel objects, processes, network connection, malware, including the training model and the prediction with a
etc., scattered in the memory dump of the compromised lower-dimensional feature space that is more discriminatory
system. On the other hand, an alternative way is to feed and separate than the original. Moreover, it also helps us
that data in a simplistic format to a machine learning model visualize the distribution of the data.
and allow the model to learn. In our approach, we leverage V. OS- INDEPENDENT M ALWARE D ETECTOR
our knowledge in computer vision and computer security to
reduce the dataset’s size and extract its feature. To build an OS-independent malware detector, we use
the processed memory dumps data and traditional machine
Figure 2 illustrates the overall workflow of the mal-
learning models: Decision Tree, Random Forest, and Support
ware detection system. In the preprocessing phase,
Vector Machine (SVM). Python’s libraries and packages
we utilize the bin2png tool, which is published in
618
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl
Figure 2. The workflow of OS-independent Malware Detector
619
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl
Table III ACKNOWLEDGMENT
T ESTING ON UNTRAINED MALWARE
We would like to thank Dung Nguyen for his valuable
ML method Malware Accuracy True False Total support. This research is funded by the University of Sci-
SVM(rbf) OskiStealer 90 % 270 30 300 ence, VNU-HCM, under grant number CNTT 2020-01.
RedLineStealer 86.67 % 260 40 300
SnakeKeylogger 63.33 % 190 110 300 R EFERENCES
Average Accuracy 80 %
Random OskiStealer 90.33 % 271 29 300 [1] M. H. Ligh, A. Case, J. Levy, and A. Walters, The art of
Forest RedLineStealer 94 % 282 18 300
SnakeKeylogger 78 % 234 66 300
memory forensics: detecting malware and threats in windows,
Average Accuracy 87.44 % linux, and Mac memory. John Wiley & Sons, 2014.
Decision OskiStealer 90.67 % 272 28 300
Tree RedLineStealer 90.67 % 272 28 300 [2] A. Case and G. G. Richard III, “Memory forensics: The path
SnakeKeylogger 73 % 219 82 300 forward,” Digital Investigation, vol. 20, pp. 23–33, 2017.
Average Accuracy 84.78 %
discovered malware evolve very quickly and are created [4] Z. Lin, J. Rhee, X. Zhang, D. Xu, and X. Jiang, “Siggraph:
every day. Therefore, timely detection of zero-day attacks is Brute force scanning of kernel data structure instances using
becoming more and more important. This is the reason why graph-based signatures.” in Ndss, 2011.
we conduct an experiment to see if this model could be used [5] R. Mosli, R. Li, B. Yuan, and Y. Pan, “Automated malware
to predict new types of malware that had not been trained on detection using artifacts in forensic memory images,” in 2016
the model. We test on three new types of malware, namely IEEE Symposium on Technologies for Homeland Security
OskiStealer, RedLineStealer, and SnapKeylogger. For each (HST). IEEE, 2016, pp. 1–6.
type of malware, we create 300 data samples for testing.
[6] A. Srivastava, I. Erete, and J. Giffin, “Kernel data integrity
The test results are presented in Table III. protection via memory access control,” Georgia Institute of
Technology, Tech. Rep., 2009.
VI. C ONCLUSION
In this paper, we present the best of our knowledge, apply- [7] C.-W. Tien, J.-W. Liao, S.-C. Chang, and S.-Y. Kuo, “Memory
forensics using virtual machine introspection for malware
ing memory forensics, several data preprocessing techniques, analysis,” in 2017 IEEE Conference on Dependable and
and machine learning algorithms to build an OS-independent Secure Computing. IEEE, 2017, pp. 518–519.
malware detector. We fundamentally convert memory dumps
to PNG images by using RGB-based encoding. Then, we [8] A. Case, R. D. Maggio, M. Firoz-Ul-Amin, M. M. Jalalzai,
extract two type of descriptors, namely GIST and HOG, A. Ali-Gombe, M. Sun, and G. G. Richard III, “Hooktracer:
Automatic detection and analysis of keystroke loggers using
from the images, and apply traditional machine learning memory forensics,” Computers & Security, vol. 96, p. 101872,
techniques to build an OS-independent malware detector to 2020.
classify malicious or benign processes, and possibly find any
unknown malware. We employ the state-of-the-art dimension [9] P. Bajpai and R. Enbody, “Memory forensics against ran-
reduction technique named UMAP to improve the robustness somware,” in 2020 International Conference on Cyber Se-
curity and Protection of Digital Services (Cyber Security).
of the classifier. In addition, we build a system called IEEE, 2020, pp. 1–8.
MemGen and utilize it to generate the dataset used in our
experiments. MemGen can either generate datasets with an [10] Q. Hua and Y. Zhang, “Detecting malware and rootkit
entire memory or memory dump of a specific process. In via memory forensics,” in 2015 International Conference
particular, we randomly generate a list of typical applications on Computer Science and Mechanical Automation (CSMA).
IEEE, 2015, pp. 92–96.
that run on the system beside the newest targeted malware to
simulate realistic behaviors of possible scenarios for a real [11] A. S. Bozkir, E. Tahillioglu, M. Aydos, and I. Kara, “Catch
user. Moreover, we consider the memory dumps generated them alive: A malware detection approach through memory
by MemGen as a driving force and an essential resource for forensics, manifold learning and computer vision,” Computers
our future research in the memory forensics field. Dataset & Security, vol. 103, p. 102166, 2021.
and source code will be made public. [12] R. Petrik, B. Arik, and J. M. Smith, “Towards architecture
As a result, we test the trained models with unknown and os-independent malware detection via memory forensics,”
malware and obtain quite satisfactory results, specifically the in Proceedings of the 2018 ACM SIGSAC Conference on
Random Forest algorithm achieves the highest results with Computer and Communications Security, 2018, pp. 2267–
average accuracy is 87.44% and the accuracy for each type 2269.
is 90.33%, 94%, and 78% respectively for three types of
malware: OskiStealer, RedLineStealer, and SnapKeylogger.
620
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on July 26,2023 at 18:57:49 UTC from IEEE Xplore. Restrictions appl