Explainable Ensemble Learning Based Detection of E
Explainable Ensemble Learning Based Detection of E
Article
Explainable Ensemble Learning Based Detection of Evasive
Malicious PDF Documents
Suleiman Y. Yerima 1, * and Abul Bashar 2
1 Faculty of Computing, Engineering and Media, Cyber Technology Institute, De Montfort University,
Leicester LE1 9BH, UK
2 Department of Computer Engineering, Prince Mohammad bin Fahd University, Khobar 31952, Saudi Arabia;
[email protected]
* Correspondence: [email protected]
Abstract: PDF has become a major attack vector for delivering malware and compromising systems
and networks, due to its popularity and widespread usage across platforms. PDF provides a flexible
file structure that facilitates the embedding of different types of content such as JavaScript, encoded
streams, images, executable files, etc. This enables attackers to embed malicious code as well as to
hide their functionalities within seemingly benign non-executable documents. As a result, a large
proportion of current automated detection systems are unable to effectively detect PDF files with
concealed malicious content. To mitigate this problem, a novel approach is proposed in this paper based
on ensemble learning with enhanced static features, which is used to build an explainable and robust
malicious PDF document detection system. The proposed system is resilient against reverse mimicry
injection attacks compared to the existing state-of-the-art learning-based malicious PDF detection systems.
The recently released EvasivePDFMal2022 dataset was used to investigate the efficacy of the proposed
system. Based on this dataset, an overall classification accuracy greater than 98% was observed with
five ensemble learning classifiers. Furthermore, the proposed system, which employs new anomaly-
based features, was evaluated on a reverse mimicry attack dataset containing three different types of
content injection attacks, i.e., embedded JavaScript, embedded malicious PDF, and embedded malicious
EXE. The experiments conducted on the reverse mimicry dataset showed that the Random Committee
ensemble learning model achieved 100% detection rates for embedded EXE and embedded JavaScript,
and 98% detection rate for embedded PDF, based on our enhanced feature set.
Citation: Yerima, S.Y.; Bashar, A.
Explainable Ensemble Learning Based
Keywords: malicious PDF detection; PDF malware; feature engineering; reverse mimicry attack;
Detection of Evasive Malicious PDF malicious content injection; shapely additive explanation; ensemble learning; explainable
Documents. Electronics 2023, 12, 3148. machine learning
https://doi.org/10.3390/
electronics12143148
in targeted attacks and advance persistent threat (APT) campaigns to accomplish one or more
stages of a multi-stage attack, for instance, the MiniDuke APT campaign [4], where infected
PDF files that targeted an Adobe Reader vulnerability (i.e., CVE-2013-0640) was used for the
first stage of the attack.
The detection of malicious PDF documents is made more challenging by the fact that
its format is complex, and it is susceptible to a wide range of attacks, many of which take
advantage of legitimate PDF functionality, e.g., the embedding and encoding of a wide variety
of content types. Several static and dynamic analysis tools are available to facilitate manual
analysis of PDF documents for potentially malicious content. Examples of such tools
include PDFiD [5], PeePDF [6], PhoneyPDF [7], and PDF Walker [8]. However, the volume
of malicious PDF files that are constantly emerging makes it infeasible for the security
community to rely on manual analysis alone. While signatures can be utilized to facilitate
automated analysis to detect malicious files, this also comes with its own set of challenges,
including susceptibility to obfuscation and aging of signatures against the appearance of
new types of attacks.
To overcome these limitations, learning-based systems have been proposed by re-
searchers based on different types of features. Two popular kinds of features used in the
current learning-based PDF malware detection systems include JavaScript-based features
and structural features. Learning-based systems that utilize JavaScript-based features extract
them by analyzing embedded Javascript code to detect malicious behaviour, for example in
PJScan [9] or Lux0R [10]. Such systems, however, are only effective for detecting malicious
PDF files that contain JavaScript code. Examples of proposed learning-based systems that
rely on structural features of PDF files include PDF Slayer [11], Hidost [12], and PDFRate [13].
The use of structural features with machine learning became more widespread because it
enables fast automated detection of a wide variety of attacks including newly appearing
variants. Recently, detection systems that employ visualization-based features are also being
proposed. For example, ref. [14] proposed a system where PDF files are first converted to
grayscale images before extracting visualization-based features for machine learning.
According to [15], one of the problems with machine learning-based classifiers in the
PDF malware detection domain is that mimicry attacks and reverse mimicry attacks are
quite effective against them. A reverse mimicry attack involves injecting or embedding ma-
licious content into benign PDF files such that the features of the benign file will effectively
mask the presence of the embedded malicious content from being detected by detection
systems. It is a form of evasive adversarial attack that can be performed on a large scale
using automated tools. Existing machine learning-based solutions such as PDF Slayer [11],
Hidost [12], and PDFRate [13] have been shown to have limited robustness against reverse
mimicry attacks. Even the more recent attempt at utilizing visualization techniques for
high accuracy PDF malware detection presented in [14] did not show substantial resilience
in the reverse mimicry attack experiments.
Hence, despite the advances that have been made with learning-based malware PDF
detection, their resilience against evasive or adversarial attacks remains a significant challenge.
In order to mitigate the problem, this paper proposes a novel approach that uses an enhanced
feature set which extends existing structural features with anomaly-based ones, and utilizes
the power of ensemble learning to provide a high accuracy malicious PDF detection system
that is also resilient against injection-based adversarial attacks. The main contributions of this
paper are as follows:
• The paper proposes an ensemble learning-based system that employs an enhanced
feature set comprising structural and anomaly-based features. This feature set is a
unique one that is designed to enable robust and effective detection of malicious PDF
files including those that employ evasive techniques.
• The novel anomaly-based features that enable robust maldoc detection are described,
discussing their impact on the performance of the learning-based detection system,
as well as its resilience to reverse mimicry attacks.
Electronics 2023, 12, 3148 3 of 23
3. Related Work
Learning-based detection of malware and malicious content in PDF documents have
proliferated in recent years due to the drive to create new approaches that will enhance or
complement existing anti-malware systems. In [3], the authors proposed an approach to detect
malicious content embedded in PDF documents. They focused on data encoded in the ‘stream’
tag along with other structural information. Their method decrypts encrypted blocks and
decodes encoded blocks within the stream tags and also utilizes other structural features.
These are given to a decision tree for classification. Although the paper claims that the method
is effective against mimicry attacks, no empirical evaluation was presented to support the claim.
In [18], a method for detecting and classifying suspicious PDF files based on YARA scan and
structural scan is presented. Their system inspects PDF documents to search for features
that are important in labelling PDF documents as suspicious. In [10], a system to detect
malicious JavaScript embedded in PDF files was presented. The system was called ‘Lux 0n
discriminant References’ (Lux0R). The authors of [9] presented PJScan, a tool which is designed
to uncover JavaScript from the malicious file and to extract its lexical properties via a tokenizer.
The output, which is a token sequence, is then used to train a machine learning algorithm to
detect malicious JavaScript-bearing PDF files.
Jeong, Woo, and Kang [19] presented a convolutional neural network (CNN) designed
to take the byte sequence of a stream object contained within a PDF file and predict whether
the input sequence contains malicious actions or not. The CNN model achieved superior
performance compared to traditional machine learning classifiers including SVM, Decision
Tree, Naive Bayes, and Random Forest. Albahar et al. [20] presented two learning-based
models for detection of malicious PDFs and experimented on 30,797 infected and benign
documents collected from the Contagio dataset and VirusTotal. Their first model was a
CNN model that used tree-based PDF file structure as features and yielded 99.33% accuracy;
the second model was an ensemble SVM model with different kernels which used n-gram
with object content encoding as features and yielded an accuracy of 97.3%. In [21], Bazzi
and Onozato used LibSVM to build a classification model which utilizes features extracted
from a report generated through dynamic analysis with Cuckoo sandbox. The study used
6000 samples for training and 10,904 samples for testing, obtaining an accuracy of 97.45%.
Electronics 2023, 12, 3148 5 of 23
In [22], a PDF maldoc detection system was proposed based on extracting features with
PDFiD and PeePDF. They used both tools to extract keyword and structural features and
used malicious document heuristics to derive an additional set of features. Trough feature
selection, the top 14 important features were selected, which led to an improved accuracy of
up to 97.9% for the ML classifier. Zhang proposed MLPdf in [23] which uses an MLP classifier
to detect PDF malware. Their system extracted a group of high quality features from two
real world datasets that contained 105,000 malicious and benign PDF documents. The MLP
model achieved a detection rate of 95.12% and low false positive rate of 0.08%. Jiang et al. [24]
applied semi-supervised learning to the problem of malicious PDF document detection in [24]
by extracting structural features together with statistical features based on entropy sequences
using wavelet energy spectrum. They then employed a random sub-sampling approach to
train multiple sub-classifiers, with their method achieving an accuracy of 94%.
The authors of [25] did a performance comparison of machine learning classifiers to
traditional AV solutions by experimenting on PDF documents with embedded JavaScript. They
used 995 samples for training, 217 samples for validation, and 500 samples for testing and
obtained 92%, 50%, and 96% accuracy with Random Forest, SVM, and MLP, respectively. In [14],
the authors applied image visualization techniques of byte plot and Markov plot and extracted
various image features from both. They evaluated the performance using the Contagio PDF
dataset, obtaining very good results when testing with samples from the same dataset. They
also evaluated their models on a reverse mimicry attack dataset, with very limited success but
showing slightly improved robustness over the PDF Slayer approach. They experimented with
both Markov plot and byte plot visualization methods, applying various image processing
techniques used in extracting features to train RF, K-Nearest Neighbor (KNN), and Decision
Tree (DT) classifiers. The best method (byte plot + Gabor Filter + Random Forest) achieved an
F1-score of 99.48%.
Al-Haija, Odeh, and Hazem proposed in [26] a detection system for identifying benign
and malicious PDF files. Their proposed system used an optimally-trained AdaBoost decision
tree and their experiments were performed using the Evasive-PDFMal2022 dataset [27] (which
is also used in this paper). Their system achieved 98.4% prediction accuracy with 98.80%
precision, 98.90% sensitivity, and a 98.8% F1-score. In [28], the authors also utilized the
Evasive-PDFMal2022 dataset and applied an enhanced structural feature set to investigate the
efficacy of the enhanced set. Seven machine learning classifiers were evaluated on the dataset
using the enhanced features, and improved classification accuracy was noticed with 5 out of
7 of the classifiers compared to the baseline scenario without the enhanced features.
In [29], a system for detecting evasive PDF malware was proposed based on Stacking
ensemble learning. The detection system is based on a set of 28 static features which were
divided into ‘general’ and ‘structural’ features. Their system was evaluated on the Contagio
dataset, yielding an accuracy of 99.89% and F1-score of 99.86%. They also evaluated the
system on their newly generated Evasive-PDFmal2022 dataset [27] for which they achieved
98.69% accuracy and a 98.77% F1-score, respectively.
From our review of related work, it is evident that several of the proposed learning-
based detectors utilized features extracted only from JavaScript obtained from the PDF files,
e.g., [9,10,19]. While such systems may be able to detect PDF files incorporating content
injection attacks that involve embedded JavaScript, they may not be effective against other
types of content embedding attacks, e.g., those involving embedded PDF, Word, EXE, or other
types of content. Other works, such as [22,23,26,29], utilized structural features in their work,
but did not evaluate their approach against any type of adversarial attacks. Different from
the existing works, this paper aims to improve the robustness of malicious PDF document
detection by enhancing structural features with novel anomaly-based features and utilizing
the enhanced feature set to train ensemble learning classifiers. Furthermore, we present
experiments to demonstrate the resilience of our proposed approach to reverse mimicry
injection attacks, enabled by the new anomaly-based features.
Electronics 2023, 12, 3148 6 of 23
4. Methodology
This section presents our proposed approach to automated ensemble learning-based ma-
licious PDF detection, which is based on an enhanced feature set consisting of 35 features
(29 structural features and 6 anomaly-based features). These features are extracted from the la-
beled files that have been set aside as the training set. The instances consisting of the 35 extracted
features are fed into ensemble learning classification algorithms to learn the distinguishing
characteristics of benign and malicious PDF files, thus enabling the prediction and classification
of unlabeled PDF files as benign or malicious, as shown in Figure 2. The methods used in
building the proposed PDF classification system are discussed in the following sub-sections.
4.1. Datasets
Evasive-PDFMal2022 dataset: The first dataset used for the study in this paper is
a recently generated evasive PDF dataset (Evasive-PDFMal2022) [27] which was released
by Issakhani et al. [29]. This dataset has been generated as an improved version of the
well-known Contagio PDF dataset which has been utilized extensively in previous works.
According to [29], the Contagio dataset has several drawbacks which include (a) a high
proportion of duplicate samples with very high similarity, which was estimated as 44% of the
entire dataset and (b) lack of sufficient diversity of samples within each class of the dataset.
Thus, the new dataset aims to address the flaws found with the Contagio dataset and provide
a more realistic and representative dataset of the PDF distribution. It consists of 10,025 PDF
file samples with no duplicate entries (4468 benign and 5557 malicious).
PDF reverse mimicry dataset: This was the second dataset utilized in our study. It is
used to evaluate the robustness of our proposed approach to content injection attacks designed
to disguise malicious content by embedding them within benign PDF files. This is known
as the reverse mimicry attack, and it is a form of evasive adversarial attack that can be
performed on a large scale using automated tools. The reverse mimicry dataset [30] consists
of 1500 benign PDF files with embedded malicious components and is available online from
the Pattern Recognition and Applications lab (PRAlab), University of Cagliari, Italy. The
dataset consists of 500 PDF files containing embedded JavaScript, 500 PDF files containing
embedded PDF, and 500 PDF files containing embedded EXE payload. Further details on
how these reverse mimicry files were created can be found in [17]. Note that the detection of
malicious PDF files created by altering a benign file through such reverse mimicry attacks is a
challenging task. This is because the injected file will still retain the characteristics of a benign
PDF file, thus making it hard for learning algorithms to discriminate effectively.
Electronics 2023, 12, 3148 7 of 23
Table 1. Initial feature set containing 29 structural features (NK: non-keyword based, K: keyword-based).
embedded files NK Indicates that an embedded file is present (not from keyword)
for some of the standard features. Mal_traits_all therefore provides an indicator that has
resilience against the occurrence of such errors.
4.4.4. AdaBoost
AdaBoost is based on Boosting, which incrementally builds an ensemble by training each
new model instance to emphasize the training instances that were miss-classified in previous
iterations. Boosting [34] iteratively builds a succession of models with each one being trained
on a dataset with previously miss-classified instances given more weight. All of the models are
then weighted according to their success and the outputs are combined by voting or averaging.
With AdaBoost, the training set does not need to be large to achieve good results, since the
same training set is used iteratively.
4.4.5. Stacking
This is also called Stacked Generalization [35]. It combines multiple base learners by
introducing the concept of a meta-learner and can be used to combine models of different
types, unlike boosting or bagging-based ensemble learners. The training set is split into two
non-overlapping sets and the first part is used to train the base learners while testing them on
the second part. Using the prediction/classification outcomes from the test set as inputs, and
correct labels as outputs, the meta-learner is trained to derive a final classification outcome.
Electronics 2023, 12, 3148 10 of 23
Table 4. Ensemble classifiers results with the new features (10-fold CV).
5.3. Investigating the Effect of the New Features against the Reverse Mimicry Attacks
As mentioned earlier, the reverse mimicry dataset consists of three content injection attacks
each with 500 samples. They include (a) embedded executable, (b) embedded JavaScript, and
(c) embedded PDF. The experiments were conducted by training the ensemble models with
all of the Evasive-PDFMal2022 samples and then using each of the 500 samples in the reverse
mimicry dataset as the testing set. The first model training was done with only the 29 baseline
structural features and then the models were evaluated on the attack samples. The same
process was repeated with the full set of 35 features including the new anomaly-based features.
The results of these experiments are shown in Tables 5 and 6. The numbers depicted in brackets
in the table heading denote the number of samples used in the evaluation (a few of the initial
samples failed during the experiments).
Table 5. Reverse mimicry attack dataset—ensemble classifiers results without the new features.
Table 6. Reverse mimicry attack dataset—ensemble classifiers results with the new features.
From Table 5 (without the new features), it can be seen that the best result for embedded
Exe was AdaBoost, with 71.2%, i.e., 355 samples detected. For the embedded JavaScript, the
best was the Stacking model which detected 293 samples (58.8%). In the embedded PDF set,
the highest was only 12.2% (61 samples) detected. This shows that the embedded PDF was the
most challenging attack to detect. One possible reason for this could be the lack of structural
features (keywords) that directly indicate when a PDF file is present in the PDF file. In the
feature set there are two keywords directly related to JavaScript, which may make it easier to
detect embedded JavaScript attacks. Another possible reason could be the way the embedded
PDF attack was crafted. The embedded PDF can be used to nest other features which will not
appear within the parent benign PDF, thus tricking the classifier into predicting the sample
as benign.
From Table 6 (with the new anomaly-based features), there is significant improvement in
the detection of the mimicry attacks. Random Committee and Stacking detected 449 (90.1%) and
Electronics 2023, 12, 3148 12 of 23
466 (93.6%) samples of the embedded EXE attack, respectively. For the embedded JavaScript,
Stacking also obtained 477 (95%) detected samples, while for the embedded PDF, the highest
was AdaBoost, with 68.9% (344 samples). Again, this highlights how challenging it is to detect
the PDF embedding attack, for the reasons mentioned earlier. However, there is improvement
compared to the results in the previous table; this can be attributed to the new anomaly-based
features introduced into the feature set. The significant improvement in the performance of
the Stacking learning model highlights the impact of the anomaly-based features introduced
into the mix. The new features mal_trait2, mal_trait3, and mal_traits_all are most likely to be
responsible for enhancing the ability of the ensemble learners to detect more embedded EXE
samples. The new feature mal_trait2 is likely to have had the most impact in improving the
models’ ability to detect embedded PDF samples. Figures 3–5 visually depict the percentages
of detected samples with and without the new features for each of the three types of reverse
mimicry attacks investigated.
Figure 3. Performance of the ensemble learners on the embedded EXE reverse mimicry samples
(with and without the new features).
Figure 4. Performance of the ensemble learners on the embedded JS reverse mimicry samples (with
and without the new features).
Figure 5. Performance of the ensemble learners on the embedded PDF reverse mimicry samples
(with and without the new features).
In Table 8, the results show dramatic improvement when the new anomaly-based features
were utilized in the training and testing sets (after augmenting the training set with adversarial
samples). Figures 6–8 visually depict the percentages of detected samples with and without
the new features, and with data augmentation for each of the three types of reverse mimicry
attacks investigated. It can be seen that Random Committee detected 98% of the embedded
PDF attacks compared to only 43.2% without the new features. These results show that data
augmentation as a means to improve detection of adversarial samples would be more effective
only if we have the right feature set. The possible reason for significant improvement in
embedded PDF detection due to the new features can be explained as follows: it is highly
likely that the combination of the anomaly-based features with other features produced new
Electronics 2023, 12, 3148 14 of 23
patterns that were learned by the ensemble models, and these patterns were present in the
samples that the training set was augmented with. In a nutshell, we can conclude that the new
anomaly-based features significantly enhanced the robustness of the ensemble learning models
against reverse mimicry attacks via content injection.
Table 7. Reverse mimicry attack dataset—ensemble classifiers results with training set augmentation but
without the new features.
Table 8. Reverse mimicry attack dataset—ensemble classifiers results with training set augmentation and
the new features included.
Figure 6. Performance of the ensemble learners on the embedded EXE reverse mimicry samples
(with and without the new features), using training set augmented with attack samples.
Electronics 2023, 12, 3148 15 of 23
Figure 7. Performance of the ensemble learners on the embedded JS reverse mimicry samples
(with and without the new features), using training set augmented with attack samples.
Figure 8. Performance of the ensemble learners on the embedded PDF reverse mimicry samples
(with and without the new features), using training set augmented with attack samples.
5.5. Explaining and Interpreting the Ensemble Model Using SHapely Additive exPlanation
In this section we will explain the ensemble model for evasive malicious PDF detection
using SHapely Additive exPlanation (SHAP). SHAP was introduced by Lundberg and Lee
in 2017 [36] as a model-agnostic method of explaining machine learning models based on
Shapley values taken from game theory. SHAP determines the impact of each feature by
calculating the difference between the model’s performance with and without the feature. Thus,
it provides an understanding of how much each feature contributes to the prediction. In Figure 9
Electronics 2023, 12, 3148 16 of 23
a SHAP summary chart can be seen from which we can visualize the importance of the features
and their impact on predictions. This plot was generated from building an ensemble model
with 80% of the dataset and testing on the remaining. The features are sorted in descending
order of SHAP value magnitudes over all testing samples. The SHAP values are also used to
show the distribution of the impacts each feature has on the prediction. The colour represents
the feature value, with red indicating high while blue indicates low.
From Figure 9, we can see that metadata size, JavaScript, and mal_traits_all were the
top three that had the most impact, according to the SHAP values. It can also explain that
when the metadata size is low (blue) the model predicts positively, i.e., as a malicious PDF in
most cases. However, when the metadata size is large (red), that impacts on the prediction
by making the model classify documents as benign. We can also see that in most cases when
Javascript or JS is present (red), the model predicts malicious PDF, while it predicts benign PDF
if it is absent (blue). When there is mal_traits present (red), then malicious PDF is predicted,
and when it is not present (blue), in many instances that led to a prediction of benign PDF.
The same is true for mal_trait2. The plot also shows us that for many test samples, when text,
images, and number of streams are low (blue endstream, stream) or number of objects are low
(blue obj and endobj), then the PDF is likely to be predicted as malicious. For text, high values
(red) indicate benign PDF in many cases; which makes sense because those will be genuine
documents as opposed to crafted PDF that have been manipulated for nefarious purposes.
The plot also shows us that the model predicts malicious PDF for many instances where XFA,
OpenAction, and EmbeddedFile were present (red). Note that these plots only relate to the
particular test set that was used and will be different from another test set which will have a
different distribution of the features.
In Figure 10, the SHAP summary chart depicts the impacts of the top 10 features on an
ensemble model’s prediction on the embedded exe reverse mimicry test set. It shows the the
presence of Acroform (which is indicative of potential manual input into the document) has a
negative impact on the prediction (i.e., benign is predicted) while the opposite is true. This also
happens when metadata size is large (red) or there is a large number of objects or pages in the
PDF document. The presence of mal_traits (i.e., any of the new anomaly features) leads to a
positive prediction, and so does the presence of embedded files.
The SHAP summary chart in Figure 11 depicts the impacts of the top 10 features on an
ensemble model’s prediction on the embedded pdf reverse mimicry test set. It shows that
high number of streams (endstream and stream being red) indicates malicious PDF while in
some cases low number of streams does also indicate malicious PDF. This could mean that a
combination with other features influences the prediction, or some of these instances could be
incorrectly classified. The figure also shows us that positive predictions (i.e., malicious PDF) are
made when metadata size is low, title characters are absent when PDF size is large (which can
be an indicator for embedded PDF) and when OpenAction (which could be used to manipulate
the embedded PDF) is present (red).
In Figure 12, the impacts of the top 10 features on an ensemble model’s prediction on the
embedded JavaScript reverse mimicry test set is shown. The model’s positive (malicious PDF)
prediction can be explained by seeing low metadata size, smaller number of objects, fewer
title characters, fewer streams, and the absence of Acroform. The presence of Acroform, JS
keyword, and AA feature seem to be indicators of negative (benign PDF) prediction amongst
the samples of embedded JavaScript from this reverse mimicry dataset used to analyze the
model with SHAP. In a nutshell, these SHAP summary charts demonstrate the explainability
of the models which is crucial in increasing the trust of our proposed approach while giving
us insight into the models’ decision-making.
Electronics 2023, 12, 3148 17 of 23
Figure 9. Each features impact on model’s predictions as determined by SHAP, for the ensemble
model’s prediction on test samples that do not contain reverse mimicry content injection.
Electronics 2023, 12, 3148 18 of 23
Figure 10. Each features impact on model’s predictions as determined by SHAP, for the ensemble
model’s prediction on test samples that consist of embedded exe within the pdf files.
Figure 11. Each features impact on model’s predictions as determined by SHAP, for the ensemble
model’s prediction on test samples that consist of embedded pdf within the pdf files.
Electronics 2023, 12, 3148 19 of 23
Figure 12. Each features impact on model’s predictions as determined by SHAP, for the ensemble
model’s prediction on test samples that consist of embedded JavaScript within the pdf files.
PJScan 1% 87.7% 3%
Stacking with our enhanced feature set 93.6% (466) 95.4% (477) 38.9% (194)
Random Committee with our enhanced feature set and training set augmentation 100% (448) 100% (450) 98% (440)
Figure 13. Performance of the ensemble learners on the reverse mimicry samples, with training set
augmented with attack samples.
to extract a hybrid of complementary features that the system can use to make it resilient to
such failures. Another limitation of our proposed approach is that it could be susceptible to
obfuscation, whereby some of the features could be masked. A possible countermeasure for
this is to perform content analysis rather than relying solely on extraction of such features
from structural keywords. The content analysis-based features could also provide a hybrid
composite features approach when combined with the structural and anomaly-based features.
Author Contributions: Conceptualization, S.Y.Y.; Methodology, S.Y.Y. and A.B.; Software, S.Y.Y. and
A.B.; Validation, A.B.; Formal analysis, A.B.; Resources, A.B.; Writing—original draft, S.Y.Y. and
A.B.; Writing—review & editing, A.B. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The original datasets used in this research work were taken from the
public domain. The pre-processed datasets are available on request from the author.
Acknowledgments: This work is supported in part by the 2022 Cybersecurity research grant number
PCC-Grant-202228, from the Cybersecurity Center at Prince Mohammad Bin Fahd University, Al-Khobar,
Saudi Arabia.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Singh, P.; Tapaswi, S.; Gupta, S. Malware detection in PDF and office documents: A survey. Inf. Secur. Glob. Perspect. 2020, 29, 134–153.
[CrossRef]
2. Goud, N. Cyber Attack with Ransomware Hidden Inside PDF Documents. Available online: https://www.cybersecurity-insiders.
com/cyber-attack-with-ransomware-hidden-inside-pdf-documents/ (accessed on 31 October 2022).
3. Nath, H.V.; Mehtre, B. Ensemble learning for detection of malicious content embedded in pdf documents. In Proceedings of the
2015 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), Kozhikode, India,
19–21 February 2015; pp. 1–5.
4. Mimoso, M. MiniDuke Espionage Malware Hits Governments in Europe Using Adobe Exploits. Available online: https://threatpost.
com/miniduke-espionage-malware-hits-governments-europe-using-adobe-exploits-022713/77569/ (accessed on 31 October 2022).
5. Stevens, D. PDF Tools. Available online: https://blog.didierstevens.com/programs/pdf-tools/ (accessed on 25 September 2022).
6. Stevens, D. Peepdf—PDF Analysis Tool. Available online: https://eternal-todo.com/tools/peepdf-pdf-analysis-tool (accessed on
25 September 2022).
Electronics 2023, 12, 3148 22 of 23
35. Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 45, 214–259. [CrossRef]
36. Lundberg, S.M.; Lee, S.I. A Unified Approach to interpreting model predictions. In Proceedings of the 31st International Conference
on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.