0% found this document useful (0 votes)
11 views9 pages

Malicious PDF Files Detection 2017

The paper presents a machine learning approach for detecting malicious PDF files by extracting features from both the PDF structure and embedded JavaScript code. It highlights the inadequacies of existing detection methods and proposes a new method that combines structural and JavaScript feature vectors to improve detection accuracy. Experimental results demonstrate the effectiveness of the proposed method compared to previous techniques, achieving significantly higher accuracy in identifying malicious PDFs.

Uploaded by

Geen Life
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

Malicious PDF Files Detection 2017

The paper presents a machine learning approach for detecting malicious PDF files by extracting features from both the PDF structure and embedded JavaScript code. It highlights the inadequacies of existing detection methods and proposes a new method that combines structural and JavaScript feature vectors to improve detection accuracy. Experimental results demonstrate the effectiveness of the proposed method compared to previous techniques, achieving significantly higher accuracy in identifying malicious PDFs.

Uploaded by

Geen Life
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/319238475

Malicious PDF Files Detection Using Structural and Javascript Based Features

Conference Paper · May 2017


DOI: 10.1007/978-981-10-6544-6_14

CITATIONS READS

6 3,706

4 authors, including:

Amit Agarwal Manish Mahajan


Indian Institute of Technology Roorkee Graphic Era University
14 PUBLICATIONS 176 CITATIONS 25 PUBLICATIONS 94 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Amit Agarwal on 24 May 2019.

The user has requested enhancement of the downloaded file.


Malicious Pdf Files Detection Using Structural and
Javascript Based Features

Sonal Dabral1, Amit Agarwal2, Manish Mahajan3,Sachin Kumar4

1,3
Computer Science & Engineering, Graphic Era University, Dehradun
2
Computer Science & Engineering, Indian Institute of Technology, Roorkee
4
Centre for Transportation Systems, Indian Institute of Technology Roorkee
{sonaldabral26, amitagrawal1909, manish.mhajn, sachinagnihotri16}@gmail.com

Abstract. Malicious PDF files recently considered one of the most dangerous
threats to the system security. The flexible code-bearing vector of the PDF
format enables to attacker to carry out malicious code on the computer system
for user exploitation. Many solutions have been developed by security agents
for the safety of user’s system, but still inadequate. In this paper, we proposed a
method for malicious PDF file detection via machine learning approach. The
proposed method extracted features from PDF file structure and embedded
JavaScript code that leverage on advanced parsing mechanism. Instead of
looking for the specific attack inside the content of PDF i.e. quite complex
procedure, we extract features that are often used for attacks. Moreover, we
present the experimental evidence for the choice of learning algorithm to
provide the remarkably high accuracy as compared to other existing methods.
Keywords: Machine learning, PDF, JavaScript, Malware.

1 Introduction

Portable document format (PDF) is an electronic document format and it was released
in 1993 by Adobe System Inc, which allows publishing and exchange of documents
[1]. Nowadays, PDF is very popular because it is preferred as a mean of exchange
different documents between different organizations, peoples i.e. students and
professionals. Due to its high popularity, flexible structure and versatile functionality,
it has become a popular malware distribution strategy for user exploitation ranging
from server side to client side attack. The interest of miscreants has currently
switched from server side to client side attacks, because it gives well opportunity to
the attacker to exploit client applications (e.g. PDF readers) that are not up­to­date
where the goal is to take advantage from lack of security knowledge of users by
fooling them into opening a malicious PDF document using applications found on
most user’s computers [2].

1
One of the most popular client applications is adobe reader for reading and exchanging of documents. Attackers may
exploit specific vulnerabilities of the reader application. In addition to exploitation of the PDF reader’s vulnerabilities,
the attackers also take the advantages of the many advanced features of PDF such as /Launch which can automatically
run an embedded script to manage OS specific events, or the/GoTo and /URl which can automatically open remote
resources for creating risk that are in internet [3]. Attackers often use JavaScript code to distract usual execution flow to
malicious code, it can be done by Buffer overflow, Heap spraying and Return Oriented Programming (ROP) [4]. In
order to bypass detection, attackers mainly use advanced encryptions techniques so that they can easily hide the
malicious code or embedded files in PDF [1].
The recent academic works over the malicious PDF file detections are categorized into two methods: dynamic and
static. First Detection of malicious JavaScript code within PDF files using both methods dynamic and static [5][6][7].
Another structural based approaches for malicious PDF detection using static analysis [4][9]. The advantage of this
method over the JavaScript analysis is that they are capable of detection of non­Java Script attacks and not affected by
code obfuscation because it does not a focus on analyzing content itself. However, further research showed that attacker
exploits the system through deliberate attacks [10]. Therefore, work has focused again on malicious JavaScript code
detection [11].
This paper propose a method based on machine learning technique for malicious PDF files detection where we
combine PDF structure feature vector to the JavaScript feature vector which are extracted from the PDF file structure
and embedded JavaScript in the PDF file respectively. The set of PDF structure features includes general characteristic
of the PDF structure as well as dynamic characteristic of PDF structure in terms of keywords such as /JavaScript,
/openAction and /URL etc. and the JavaScript features obtained from JavaScript code in the PDF file. As recent
research shows that the vast majority of PDF related vulnerabilities do rely JavaScript, hence we also analyze
JavaScript code inside the PDF file. But instead of looking for the specific attack inside the JavaScript code, we extract
features from JavaScript code which can conduct attack through JavaScript. The extraction process is efficiently carried
out using PDF analysis tool, namely, Origami that overcome the parsing related weakness presented in prior work. It
provides significant features to the classifier for effective and enhance detection of Malicious PDF file. We employed
different ensemble machine learning techniques to choose the classifier for our experiment. The good choice of
ensemble classifier gives a significant improvement on malicious PDF file detection.

1.1 The PDF File Structure

PDF file is a hierarchical structure of objects that are logically connected to each other. The structure of PDF file
determines how objects are accumulated in a file, how objects accessed and updated [1]. The PDF file structure is made
by four parts shown in figure 2.
 Header: represents the version number of PDF used by the file.
 Body: It contains large part in PDF file structure which constitutes all the PDF objects and contains the data or
information that is shown to user.
 Cross reference table (CRT): It indicates the position of every indirect object and these single objects are
represented by one entry in the table.
 Trailer: It gives the location of CRT and information about root object.
Fig. 1. An example of PDF files structure

2 Related Work

The increased prevalence of malicious documents has generated interest in techniques to perform malware analysis of
such documents over the years. Previous research focused on two methods for malicious PDF detection: static and
dynamic. Li et al. and Shafiq et al. [12, 13] present a method for detection of embedded malcode in word document
through static analysis using n­gram and introduced novel dynamic run­time test that shows assertion but also remains
limited due to the size of malcode. Particularly this work is not designed for PDF file but they specially focused on
another file format such as docs, exe etc. There are possibilities to evade detection by modern obfuscation methods like
AES encryption [1], and other methods to exploit vulnerability like Heap Spraying, Return Oriented Programming
(ROP) [4].These exploiting methods are performed using embedded JavaScript code in PDF file. Therefore researchers
mainly targeted JavaScript code in PDF file.
Laskov and Srndic [6] developed a tool PJScan which is closely related to static analysis techniques, used to detect
the malicious PDF documents through lexical analysis of Java Script code. They used a machine learning approach,
One­Class Support Vector Machine automatically generate models from the available data for classification of testing
data. However this approach showed lower detection rate and not able to analyze obfuscated code that behave
maliciously during execution time. To overcome such limitation Snow et al. [14] proposed ShellOS, based on dynamic
analysis to detect code injection attacks, during runtime. It uses hardware virtualization that provides faster and precise
analysis of code and also enables to detect obfuscated code.
Moreover Tzermias et al. [7] demonstrated that the antivirus systems for the detection of malicious PDF documents
are less effective. To make more reliable detection system, they used the combination of both static and dynamic
analysis and introduced a standalone malicious PDF file scanner MDScan that specially focus on vulnerabilities. A
similar approach adopted by Schmitt et al. [15] presented a tool PDF Scrutinizer is used to detect current malicious PDF
file, however it showed a low false­positive rate. It is mainly focuses on JavaScript­ based attacks.
Dynamic analysis of Java Script code may be computationally expensive and complex. To reduce cost factor and
increase speed, research again focused on static analysis.
Maiorka et al. [4] introduced a tool, PDF Malware Slayer (PDFMS) based on static method which analyze the
structure of PDF files by keywords and their occurrence. They have performed test set on Naive Bayes, SVM, J48 and
Random Forests classifiers. The results showed Random forests provided the highest accuracy which is better than
others. However, it has some structural weaknesses.
Instead of looking for specific content, the analysis of structure of PDF provided a higher detection rate. However
current work Maiorca et al.[10] showed that such detectors may be bypass, due to complexity in parsing mechanism.
Due to some structural weaknesses, work focused again on analysis of malicious JavaScript code. Corona et al. [11]
presented Lux0R “Lux 0n discriminant References", a new approach for the malicious JavaScript code detection using
characterization of JavaScript code by its API references. And Liu et al. [16] introduced a context­aware approach for
the detection of malicious JavaScript in PDF based on static document instrumentation and runtime behavior
monitoring.

3 Materials and Methods

In this section, the paper explain a method based on machine learning approach through static analysis where we
combine PDF structure feature vector with JavaScript feature vector, which are extracted from the PDF file structure
and embedded JavaScript code within PDF file, respectively. Our system architecture is shown in figure 2.

3.1 Dataset Used

We have collected dataset both malicious and benign PDF files from real and up­to date samples. We have collected
around 4807 malicious file and 3745 benign files. Malicious PDF file samples are collected from the Contagiodump [9]
is a popular depository which contains the information about the trending vulnerabilities and attacks in PDF files. And
the benign PDF Samples are collected from the Yahoo search engine API. When collecting data from source websites it
gives no assurance that some data may be malicious. The existence of malicious files in the benign dataset will generate
undesirable results on the designed experiments. To diminish the risk, whole benign dataset was scanned using
antivirus.

3.2 Features Extraction

To extract features, we developed a parser that leverages on Origami tool. This tool performs a deep scanning of PDF
files to extract features that are mostly used by the attackers to hide malicious property. We adopted this tool as it
provide a reliable extraction of features as compared to others, such as PdfID [17], which simply analyzes the PDF file
without its logical properties, it may give good opportunity to the attackers to perform easy manipulations.

Parser
PDF Origami

Feature extractor
Structure EJSA*
Analysis

Feature Vector

Classifier Ensemble methods


Adab-oost M1 Baggi-ng Stacking

Malicious Benign
PDF Files PDF Files

For the extraction of features, we analyze each PDF file by following two ways:
* EJSA: Embedded JavaScript Analysis

Fig. 2. Architecture of our system.


1) Structure Analysis

In this phase, parser analyzes the structure of PDF file and searching for features which are significant for labeling PDF
file as malicious. This gives the set of features and their occurrence.
Based on the previous research [18, 19], these are following features that can be suspicious and used by attackers
mostly.
 JavaScript: JavaScript code can be directly embedded into an object within the PDF. Most malicious PDFs
use JavaScript to exploit Java vulnerabilities or to create heap sprays. /JS, /JavaScript keywords indicate the
use of JavaScript in the PDF.
 Actions: There are number of features such as /GoTo, /GoToR, /GoToE that are capable of specifying an
action to be performed. For example: Activating a hypertext link.
 Triggers: Attackers can use a number of different triggers in order to execute the harmful content within the
documents. An action is a common method to triggering mechanism. This is perform by the OpenAction key
in the root object of PDF file. The object which is point by OpenAction that may be a part of the attack.
 Launch: A document can open or print by launch an application, to manage OS­specific events. This feature
may be misused by attacker to steal confidential data of any organization whenever they access that suspicious
PDF file.
 Form Action: PDF Reader allow the /SubmitForm action from client to server. So in order to take advantage
of the weakness of the victim browser, this action perform a request to corrupt sites that will automatically
show on the victim browser and can perform a malfunctioning.

2) JavaScript Code Analysis

Our parser extract objects contain JavaScript from the body part of the structure of PDF file. Then it extracts embedded
JavaScript code and searching for the features labeled with JavaScript code that are often followed in carrying out an
attack. Based on the previous study [6, 19, 20], we describe following set of features used in our system.
 eval_length: This function is used by malicious scripts to dynamically interpret code and to calculate the
length of the longest string passed to eval() function call.
 max_string : It is use to define the length of the longest string. Malware writer use the strings for shell code is
very long as compared to string used in legitimate JavaScript.
 stringcount : It is used to count the no. of strings that are defined in scripts. To obfuscate the script malware
writer break the strings into many paltry strings.
 replace : This function calculate the uses of the javascript replace() function. Often it is used to obfuscate
JavaScript code in malicious scripts.
 substring : This function can be used to measure the uses of the javascript substring() function. It is mostly
used to obfuscate the JavaScript code.
 Eval : This function call used by the malicious scripts to measure the uses of the javascript eval() function and
to dynamically interpret JavaScript code.
 fromCharCode: It coverts Unicode values to the characters. It is mostly used to obfuscate the code.
 setTimeOut(): can be used to replace the eval() to run random javascript code after the particular timeout.
 document. write and document. createElement: which indicate the use of dynamic code executions.

3.3 Classification

To classify PDF files, extracted features run by a classifier that can be create by any learning algorithms. But in
previous, researchers have used the method of combining the predictions of multiple learners to produce better results
than could be produced from any individual learning algorithm [8]. In this sense we tested ensemble methods such as
Adaptive Boosting (AdaBoostM1), Bagging, stacking [8]. These algorithms combine weak classification tree models
with a particular weight to create a stronger and precise classifier. As a weak model we define a simple decision trees
(J48) (supervised learning approach, Quinlan, 1996) because an ensemble of trees gives more robustness compared to a
single tree. In addition we decided to give exhaustive experimental evidence in order to know which ensemble method
has ability to improve the accuracy on our dataset.
4 Results and Discussion

In this Section, we provide two experiments. The first one demonstrated the features extraction process. And the second
experiment presented experimental evidence as to which classification method has ability to improve the accuracy of
detection. In order to do this, first the only PDF structure features was run through different classifiers. Than we
experimented how the accuracy was improved when JavaScript features were combined with structure features.
Furthermore, we compare the performance of proposed method with previously developed tools for malicious PDFs
detection.

4.1 Experiment 1: Features Extraction

The goal of the experiment is to extract the feature vector from PDF file. Origami tool performs a deep scanning of PDF
files to extract features that are often used by the miscreants. After running the scan over one by one PDF file in
malicious and benign dataset, the results were achieved as shown in Figure 3.

Fig. 3. Structure based features extraction result

After completing the structure feature vector extraction, we realized that a huge number of the malicious PDF files used
JavaScript to perform malicious actions. In our own dataset we found around 92.3% malicious samples contained
JavaScript. Thus we performed JavaScript features extraction process by origami tool. The Results were shown in
Figure 4.

Fig. 4. JavaScript based features extraction result


4.2 Experiment 2: Detection accuracy

Our test was conducted on Adaboost M1( used as a boosting ensemble), Bagging ( used as a bagging ensemble) and
stacking with two learning algorithms (J48 and IBk, and Logistic Regression used as the Meta classifier), using 10­folds
Cross Validation repeated 10 times. We show our results with regards to confusion matrices (the number of benign and
malicious files with correct and incorrect classifications).
First the structure feature vector dataset was run through different classifiers. This gives the following results:

Table 1. Results with structure features

AdaBoostM1 Bagging Stacking


True Positives 4498 4471 4493
False positives 309 336 314
True negatives 2990 2936 2976
False negatives 755 809 767
TP Rate 0.876 0.866 0.873
FP Rate 0.141 0.152 0.144
ROC Area 0.945 0.934 0.940
Detection Accuracy 87.5584% 86.6113% 87.3363%

Further we tested how well the complete feature vector dataset (structure feature and JavaScript features) performed
at the classification task. And the dataset gives the following results as shown in table 2.

Table 2. Results with complete features (structure features and JavaScript based features)

AdaBoostM1 Bagging Stacking


True Positives 4753 4742 4744
False positives 54 65 63
True negatives 3666 3603 3670
False negatives 79 142 75
TP Rate 0.984 0.976 0.984
FP Rate 0.017 0.0287 0.017
ROC Area 0.998 0.993 0.995
Detection Accuracy 98.4448% 97.5795% 98.3863%

As we can see, when we combine structure feature vector to the JavaScript feature vector, it gives better detection
accuracy than only structure features dataset.
To interpret the proposed method, it is compared with previous developed tools such as Wepawet, PDFMS, PJScan,
MDScan and PDF Scrutinizer for malicious PDFs detection. The result is shown in Table 3. For each method, we show
true positives rate (TPR) and false positives rates (FPR). It shows that our system definitely outperforms Wepawet,
PJScan, MDScan and PDF Scrutinizer.

Table 3. Comparison between proposed method with other tools.

System TPR FPR


Proposed Method 0.984 0.017
WepaWet 0.8892 0.032
PJScan 0.7194 0.011
MDScan 0.8934 0
PDFMS 0.9955 0.0251
PDF Scrutinizer 0.9 0

PJScan, MDScan and PDF Scrutinizer show the smallest FPR, but detection rate is very low compared to the other
tools. PDFMS shows the highest TPR but gives a lower FPR as compared to proposed method. It can be also observed
that the proposed method works better than WepaWet in both TPR and FPR terms. Moreover, it is indicating that the
proposed method is better than all these tools.
5 Conclusions

In the past few years malicious PDF file has become one of the most crucial threats which originate a very effectual
attack vector for malware writers. In this paper, we have proposed a method using machine learning techniques for the
malicious PDF file detection. Instead of only relying on structure property of PDF file, we also presented the JavaScript
based features to improve the accuracy of detection. In addition, we also showed experimental evidence as to which
learning algorithm has ability to improve the accuracy of detection. Finally, we show the comparison of our method
with the other academic tools. And the high detection accuracy of our method has to be proved it is more accurate to
other tools.

References

[1] Adobe, “PDF reference, adobe portable document format version 1.7”, 2006.
[2] Symantec, “Malware security report: protecting your business, customers, and the bottom line,” Symantec,
(2010).
[3] Filiol, E., Blonce, A. and Frayssignes, L. Portable document format (PDF) security analysis and malware
threats. J. Comput. Virol, pp.75­86. (2007)
[4] Maiorca, D., Giacinto, G. and Corona, I. A pattern recognition system for malicious pdf files detection.
In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 510­524).
(2012)
[5] Esparza, J. M. Obfuscation and (non­)detection of malicious pdf files. In S21Sec e­crime. (2011).
[6] Laskov and Srndi´c, “Static detection of malicious javascript­bearing pdf documents.”In Proceedings of the
27th Annual Computer Security Applications Conference, pp.373­382, December, (2011).
[7] Tzermias, Z., Sykiotakis, G., Polychronakis, M. and Markatos, E.P, Combining static and dynamic analysis for
the detection of malicious documents. In Proceedings of the Fourth European Workshop on System
Security (p. 4) (2011)
[8] Tiwari, A. and Prakash, A. Improving Classification of J48 Algorithm Using Bagging, Boosting and Blending
Ensemble Methods on SONAR Dataset Using Weka. International Journal of Engineering and Technical
Research, 2, pp.207­209 (2014)
[9] Mila, “Contagio Malware dump.” [Online]. Available:http://contagiodump.blogspot.in/2010/08/Malicious­
documents­archive­for.html.[Accessed 10 October 2014].
[10] Maiorca, D., Corona, I. and Giacinto, G. Looking at the bag is not enough to find the bomb: an evasion of
structural methods for malicious pdf files detection. In Proceedings of the 8th ACM SIGSAC symposium on
Information, computer and communications security (pp. 119­130). (2013)
[11] Corona, I., Maiorca, D., Ariu, D. and Giacinto, G. Lux0r: Detection of malicious pdf­embedded javascript
code through discriminant analysis of api references. In Proceedings of the 2014 Workshop on Artificial
Intelligent and Security Workshop (pp. 47­57). ACM, November (2014).
[12] Li, W.­J., Stolfo, S., Stavrou, A., Androulaki, E., and Keromytis, A. D. (2007). A study of malcode­bearing
documents. In Proc. of the 4th Int. Conf. on Detect. Of Intrus. and Malware, and Vulnerability Assessment.
[13] Shafiq, M.Z., Khayam, S.A. and Farooq, M. Embedded malware detection using markov n­grams.
In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (pp. 88­
107). Springer Berlin Heidelberg July, (2008).
[14] Snow, K.Z., Krishnan, S., Monrose, F. and Provos, N., 2011, August. SHELLOS: Enabling Fast Detection and
Forensic Analysis of Code Injection Attacks. In USENIX Security Symposium (pp. 183­200), August,(2011).
[15] Schmitt, F., Gassen, J. and Gerhards­Padilla, E. PDF SCRUTINIZER: Detecting JavaScript­based attacks in
PDF documents. In Privacy, Security and Trust (PST),Tenth Annual International Conference on (pp. 104­
111). IEEE July, (2012).
[16] Liu, D., Wang, H. and Stavrou, A. Detecting malicious javascript in pdf through document instrumentation.
In Dependable Systems and Networks (DSN), 44th IFIP International Conference on (pp. 100­111). IEEE
(2014).
[17] Stevens, D., “PDF Tool”, [Online]. Available: http://blog.didierstevens.com/programs/pdf­tools/.
[18] Stevens, D., “Malicious pdf analysis ebook”. [Online].Available:http://didierstevens.com/files/data/malicious­
pdf­analysis­ebook.zip, Sept 2010.[Accessed 22 September 2015]
[19] Kittilsen, J., “Detecting malicious PDF documents.”, Master Thesis, Gjovik, Norway,pp. 1­112, December,
(2011).
[20] Cova,M., Kruege,C., Vigna,G., “Detection and Analysis of Drive­by­Download Attacks and Malicious
JavaScript Code,” In Proceedings of International Conference on World Wide Web, pp. 281­290, July, (2010).

View publication stats

You might also like