Comparing Transformers and CNN Approaches For Malware Detection A Comprehensive Analysis
Comparing Transformers and CNN Approaches For Malware Detection A Comprehensive Analysis
Abstract—Detecting malicious software, known as malware, is and identify malicious software. This method is ineffective
crucial in cybersecurity due to the constantly evolving threats and in cases of new and unknown malware. Heuristic methods
the ways in which malware tries to avoid detection. This research analyze the behaviour of the programs to detect malware based
investigates the efficacy of deep learning models for multi-class
malware classification using the MaleVis image dataset. Research on suspicious patterns. In an image-based malware detection
on using transformer architectures for malware detection and approach, the binary data are converted to RGB or greyscale
classification is limited. The proposed study explores using the images, followed by machine learning or deep learning models
Convolutional Vision Transformer (CvT) for detecting malware, to analyze these images for patterns characteristic of var-
comparing its performance with the Vision Transformer (ViT) ious malware families. The transformation of binaries into
and the pre-trained Convolutional Neural Network, EfficientNet
B0. Each model is fine-tuned on the MaleVis dataset to distin- visual format allows for the application of vision transformers
guish between different malware categories and benign samples. and convolutional neural networks (CNNs) to uncover subtle
A comprehensive assessment using multiple evaluation metrics anomalies and correlations in malware images providing a
suggests CvT outperformed the other models, with an F1-Score dynamic and robust detection method [8].
of 0.96054. ViT followed closely with a score of 0.95821, while Previous research works such as in [9], were among the
EfficientNet B0 scored 0.87386. The research aims to contribute to
cybersecurity advancements by leveraging modern deep learning earliest to use malware byteplot images for classification
techniques for enhanced malware detection. through image processing for feature extraction and faster
Index Terms—Malware, Convolutional Vision Transformer, classification. Subsequently, CNN algorithms prevalent in
Vision Transformer, EfficientNet B0, Convolutional Neural Net- computer vision, were adapted for malware detection and
work. classification. The seminal work by Dosovitskiy et al., [10]
established the basis for employing transformer architecture
I. I NTRODUCTION in image classification, subsequently facilitating its application
Malware can be considered as any malicious software in malware detection. More recent studies, such as [11] em-
disrupting the intended operation of a system or gathers sensi- ploying CNNs and [12] utilizing vision transformers, primarily
tive information. It encompasses viruses, worms, ransomware, focus on datasets composed exclusively of malware samples,
spyware, trojan, etc., categorized depending on the type of hampering the model’s capability to effectively differentiate
attack and its functioning [1]. Some of the widely known between benign and malicious samples within a detection
cyber attacks performed include the Stuxnet worm, WannaCry framework. Moreover, the research in [13] achieves a notable
ransomware, NotPetya Malware, Equifax data breach, etc accuracy of 97% in binary classification but performs poorly
[2]. Every year, malware attacks create substantial risks to with 0.497 macro F1-score in multiclass classification.
computer systems, networks, and mobile devices. Security This research work investigates the application of the Con-
organizations are processing a growing number of malware volutional Vision Transformer (CvT) for malware detection.
samples, with some handling over 450,000 malware and CvT, introduced in [14], leverages the combined advantages
potentially unwanted applications (PUAs) per day [3]. As of CNN and Vision Transformer, demonstrably improving
the volume of malware continues to rise, manual analysis image recognition tasks. Fine-tuning a pre-trained CvT model
becomes impractical. Machine learning and deep learning on the MaleVis dataset is proposed, which encompasses 25
offer promising solutions to accelerate the analysis process malware classes and a benign class. Furthermore, a compar-
[4][5]. Deep learning methods are capable of determining the ative analysis will be conducted to evaluate the performance
prominent features leveraging its deep neural networks rather of CvT against Vision Transformers and EfficientNet B0, a
than relying solely on the proficient domain knowledge for conventional CNN model noted for its effectiveness in [11].
identifying the features for classification [6][7].
Former approaches for malware classification depended on II. R ELATED W ORK
signature-based and heuristics-based methods. Signature-based Continued proliferation of malware necessitates the need
relied on an existing database of known malware to compare for diverse detection methods. In response to this challenge,
researchers have investigated various techniques, particularly a separate study [13], Sachith et al. introduce SHERLOCK, a
the transformation of malware binaries into image data for novel deep learning model based on the vision transformer ar-
classification. Nataraj et al. were among the first to use this chitecture and self-supervised learning for malware detection.
method [9], observing visual patterns in malware images SHERLOCK demonstrates robust learning from unlabeled
could effectively distinguish between families of malware, data, enhancing generalization and the ability to detect new
through image processing techniques to classify these patterns. malware types, achieving 97% accuracy in binary classifica-
This method promises a reduction in the feature extraction tion. However, its macro F1-scores for classifying up to 47
complexity and potentially faster classification times compared types and 696 families were 0.497 and 0.491, respectively.
to traditional methods requiring detailed binary or behavioural Limited research currently applies transformer architec-
analysis. Another study by Ben et al. [15] builds upon this, by ture specifically for malware detection. Previous studies have
employing the K-Nearest Neighbors (KNN) algorithm in along predominantly focused on datasets composed exclusively of
with GIST descriptors for feature extraction from malware malware samples. Although these studies provide significant
images. The study uses a dataset consisting of malware from insights into malware classification, the lack of benign samples
25 different families. The model achieved a high classifica- in the model training dataset impedes the model’s ability
tion accuracy of 97%, demonstrating the efficacy of image- to distinguish between malware and non-malicious samples
based features for malware classification. The authors suggest effectively. In this study, the MaleVis dataset is utilized, which
better performance can be achieved using deep learning tech- includes 25 malware classes and one benign class. Moreover,
niques. In [16], the authors propose a novel lightweight vision exploration is conducted on Convolutional Vision Transform-
transformer for malware detection applications in resource- ers (CvT), integrating the strengths of both convolutional
constrained devices. The proposed model transforms exe- neural networks (CNNs) and Vision Transformers (ViTs) to
cutable bytecode into images for the transformer model to improve the efficiency and effectiveness of image recognition
learn. The results suggested that the achieved accuracy of tasks [14]. Additionally, a comparative analysis of CvT, ViT,
94% outperforms the traditional CNN models also confirming and a traditional CNN model (EfficientNet B0) is perfomed to
that the ViTs do not require deeper network layers to achieve assess it’s performance in malware detection.
similar performance.
Jayasudha et al. [11] explore the effectiveness of six popular III. M ETHODOLOGY
CNN architectures across three datasets with varying class A. System Model
imbalances: Malimg (highly imbalanced), MaleVis (balanced),
The proposed system model is illustrated in figure 1. The
and a blended dataset (intermediately imbalanced). The au-
research utilizes the MaleVis dataset, partitioned into three
thors highlight the shortcomings of traditional signature-based
subsets for training, validation, and testing. Data preprocessing
methods in modern malware detection and propose transfer
includes resizing images, applying data augmentation to the
learning as a solution. Transfer learning is suggested for its
training data, and normalizing pixel values. This preprocessed
ability to automate feature extraction and recognize patterns
data is subsequently fine-tuned using three pre-trained models:
indicative of malicious behavior. Results indicate that model
the Vision Transformer, Convolutional Vision Transformer,
performance varies significantly with class imbalance, with
and EfficientNet-B0. The performance of these fine-tuned
fewer training epochs needed for convergence on more imbal-
models is assessed on the test data using multiple evaluation
anced datasets. ResNet50, EfficientNetB0, and DenseNet169
metrics.
performed well across all dataset types, while VGG16 and
XceptionNet showed sensitivity to imbalances. The study
achieved up to 97% precision on the highly imbalanced dataset
and 95% on the intermediate and balanced datasets. However,
the absence of benign byteplot images in training data poses a
challenge for real-world application, potentially hindering the
model’s ability to distinguish between malicious and legitimate
software.
Following the successful adoption of transformer architec-
tures in image classification tasks, exemplified by Dosovitskiy
et al. [10], researchers began exploring its application in Fig. 1. Proposed System Model
malware classification. Ben et al. [12] compared the vision
transformer with CNN models using the Malimg dataset, em-
phasizing the transformer’s improved performance, especially B. Dataset
with large and complex datasets. The study highlight the The MaleVis dataset offers a collection of malware vi-
importance of choosing the optimal model based on specific sualizations specifically designed for malware analysis and
task requirements and computational resources. However, like detection research. The concept behind the dataset is to trans-
previous studies, this research is limited to malware samples, form malware binaries into visual image formats, allowing
potentially reducing its applicability in real-world settings. In for the examination of distinctive patterns and characteristics
associated with malicious software. The Multimedia Informa- uniformity in the content of the images being analyzed. By
tion Lab under the Department of Computer Engineering of implementing these preprocessing techniques, which are inte-
Hacettepe University in collaboration with Comodo Inc. came grated using PyTorch’s ‘torchvision.transforms’ module, deep
up with the MaleVis dataset [17]. It comprises 14,226 RGB learning models can be trained more effectively, enhancing the
images across 26 classes, including 25 malware categories performance and reliability.
and 1 legitimate class. The binary data sourced from Comodo
Inc. were transformed into 3-channel RGB images using the D. Model Fine-Tuning
’bin2png’ tool and resized into square formats of 224x224 and The research work uses two pre-trained transformer archi-
300x300 pixels. For the purposes of this study, the dataset tectures Vision Transformer (ViT) and Convolutional Vision
utilized is of the 224x224 format. It features a broad range of Transformer (CvT) from the Huggingface transformers library.
malware types, providing a robust base for the training and Fine-tuning of these transformer models is performed using
testing of machine learning and deep learning models. Each the preprocessed MaleVis training data to adapt them to spe-
dataset entry consists of an image representing the original cialized malware byteplot image classification tasks. Initially,
binary data of either malware or legitimate software, complete ‘ViTForImageClassification’ and ‘CvtForImageClassification’
with labels specifying the category. The distribution of the models are utilized, provided by the Hugging Face ‘trans-
various classes within the dataset is illustrated in figure 2. formers’ library, pre-configured for image classification tasks.
Then, adjusts the models’ output layer to correspond to the
number of unique labels in the dataset, ensuring the model’s
predictions are tailored to specific classification requirements.
The training process is configured using the ‘TrainingArgu-
ments’ class, allowing control over the training process. Key
parameters include setting a moderate batch size to balance
computational efficiency and model performance, here 16,
and employing a lower learning rate of 0.0002 to fine-tune
the pre-trained weights lightly. The training process is set to
evaluate the models’ performance at the end of each epoch,
saving checkpoints only if there is an improvement in the F1
score, which is used as primary performance metric due to
its relevance in balancing precision and recall, particularly
in datasets with uneven class distributions. The fine-tuning
Fig. 2. MaleVis Data Distribution process is executed using the ‘Trainer’ class, streamlining the
training, validation, and testing of the model.
C. Data Preprocessing The EfficeintNet-B0 used is a pre-trained CNN model from
Data preprocessing is an essential first step for enhancing the TensorFlow Keras applications module. This model is aug-
the performance of deep learning models. It involves preparing mented with two additional layers: a global average pooling
the image data to improve its quality and consistency for layer and a subsequent prediction layer equipped with 26 units
effective model training and evaluation. Images are resized and a sigmoid activation function, making it suitable for multi-
to the necessary scale of 224x224 as required by the CvT, label classification. Prior to fine-tuning, out of the available
ViT, and EfficientNet-B0 models used in this study. The nor- 238 layers in the model, the first 150 layers are frozen,
malization process calculates the mean and standard deviation allowing only the subsequent layers to be updated during
for each RGB channel and transforms the channel values by training. The model is compiled using the Adam optimizer and
subtracting the mean and dividing it by the standard deviation. categorical cross-entropy loss, well-suited for tasks involving
This normalizes the pixel values, facilitating better model multiclass classification. Finally, the model was fine-tuned for
convergence during training by keeping the scale and distri- 20 epochs using training data, and incorporates validation
bution uniform. Additionally, data augmentation strategies are through validation data to ensure the model’s performance is
applied to the training data to improve the models’ robustness effectively measured and to analyze the chances of overfitting
and generalization capabilities. These augmentation techniques in the model.
include random resized cropping and randomized horizontal This fine-tuning approach enhances the model’s perfor-
flipping. The former technique randomly crops and resizes mance on target image classification task of malware detection
images to a consistent size, aiding the model in recognizing and classification from byteplot images.
patterns and objects at various scales and orientations. The
latter involves randomly flipping the image horizontally with E. Evaluation Metrics
a 0.5 probability, adding variability to the training set without No single metric captures all aspects of a model’s per-
needing extra data collection, thus enhancing the model’s formance. To comprehensively evaluate the CvT, ViT, and
resilience to rotations and generalization capacity. For vali- EfficientNet-B0 models in the multiclass classification task,
dation and test datasets, resizing and center cropping ensure multiple metrics are employed. Accuracy offers a general
overview, but it doesn’t distinguish between error types. Subsequently, the model’s performance is assessed on un-
Macro-averaging addresses the by providing average precision, seen test data. It records an F1-score of 0.95821 and a
recall, and F1-score across all classes, giving each class equal precision of 0.96148. The confusion matrix presented in
weight and avoiding bias. F1-score, the harmonic mean of figure 4 provides a more detailed understanding of the model
precision and recall, is given more emphasis in the study due performance and its errors.
to its balanced perspective and ability to effectively trade-
off between these metrics, leading to a more comprehensive
evaluation.
IV. R ESULTS
The research work is carried out using the MaleVis dataset.
Subsets of the dataset are created for training, validation and
testing purposes. This data is utilized to fine-tune two pre-
trained transformer models Vision Transformer and Convo-
lutional Vision Transformer, and a pre-trained CNN model
EfficientNet-B0. The raw image data undergoes resizing with
respect to the model requirements. The 3-channel RGB values
are also normalized. Data augmentation is performed on the
training data, adding generalization ability and better robust-
ness to the model. Resizing and centre cropping are done on
the validation and test data to incorporate uniformity in the
analyzed images. The models are fine-tuned on the dataset
with the help of the Huggingface transformers library and
Tensorflow Keras application module.
C. EfficientNet-B0
Fine-tuning the EfficientNet-B0 architecture resulted in an
overall decrease in both training and validation loss as shown
in figure 7. Initially, the validation loss is higher than the
Fig. 3. ViT Training-Validation Loss training loss, highlighting difficulties in the model’s capability
D. Comparative Analysis
Table I presents a detailed performance comparison of the
Fig. 6. CvT Model Confusion Matrix multiclass classification executed using three models: ViT,
CvT, and CNN. The data reveals that the CvT model, with
an F1-score of 0.96054, slightly surpasses the ViT, which
to generalize the validation data at the onset of training. achieved a score of 0.95821 and significantly exceeds the
On continued training, generalization improves, although the performance of EfficientNet-B0, which recorded an F1-score
persistent fluctuations in validation loss indicate potential of 0.87386. The CvT model demonstrates a higher F1-score
areas for optimization. Despite fluctuations, the overall compared to both the ViT and EfficientNet-B0 models. Besides
decrease in both training and validation losses indicates the achieving impressive performance metrics, the CvT model also
model is able to learn. The significant drops in validation provides substantial benefits in terms of training efficiency.
loss at several points suggest the model achieves better On using A100 GPU from Google Colab, CvT required
generalization periodically. Figure 8 illustrates the training approximately 43 minutes to complete training, significantly
and validation accuracy of the CNN model, with both metrics shorter than the 91 minutes taken by the ViT model. When
showing significant improvement early on as the training evaluating both the F1-score and training duration, the CvT
epochs increase. The initial disparity between training and model emerges as the most efficient and effective option
validation accuracies, diminishes as the epochs advance. By among the three examined.
the 20th epoch, the accuracy measures begin to stabilize.
V. C ONCLUSION
Evaluation on unseen test data yielded an F1-score of The research work implements the fine-tuning of a pre-
0.87386 and a precision score of 0.89240. A detailed trained Convolutional Vision Transformer (CvT) for malware