Forensic Analysis of Residual Information in Adobe PDF Files
Forensic Analysis of Residual Information in Adobe PDF Files
PDF Files
Sangjin Lee
Center for Information Security Technologies
Korea University
Seoul, Korea
[email protected]
Abstract 1. Introduction
In recent years, as electronic files include People create their electronic documents using
personal records and business activities, these files various application programs, such as Microsoft
can be used as important evidences in a digital Office, Adobe Acrobat, and so on. Although the use
forensic investigation process. In general, the data of document files is widespread, few people
that can be verified using its own application recognize that ‘hidden’ data exist in files. The
programs is largely used in the investigation of reason for using the word ‘hidden’ is that
document files. However, in the case of the PDF file identification of these data is not possible using its
that has been largely used at the present time, own application program.
certain data, which include the data before some J. Park et al. introduced ‘hidden’ data in MS
modifications, exist in electronic document files PowerPoint files. Even after the contents of an MS
unintentionally. Because such residual information PowerPoint file have been deleted or edited, they
may present the writing process of a file, it can be can still exist inside the file as residual information.
usefully used in a forensic viewpoint. This is because file-saving algorithm used in this
This paper introduces why the residual application[2]. Similar to MS PowerPoint files,
information is stored inside the PDF file and Adobe PDF (Portable Document Format) files can
explains a way to extract the information. In also contain ‘hidden’ data. Therefore, it is necessary
addition, we demonstrate the attributes of PDF files to investigate such ‘hidden data’ in a digital forensic
can be used to hide data. viewpoint.
Adobe PDF files have been largely used for
Keywords: various purposes such as writing personal
documents and distributing official documents at
PDF, Residual Information, Data hiding, Information
enterprise. In particular, for enterprise purposes,
leakage, Digital evidence
some confidential documents were created using
applications like Microsoft Word or PowerPoint.
And they were distributed after transforming into
Adobe PDF files. In the past, PDF files were just specific ASCII, Unicode text for digital forensic
application for distribution. However, the content of investigation[5].
PDF files can be edited using various PDF editors Also, Matthew was developed a tool, pdfresurrect.
in these days. The pdfresurrect shows the revision number and
The following is a hypothetical case related to time of a PDF file. This tool makes investigators to
Adobe PDF files. Company “Y” recruited new confirm revision history. However, since it cannot
business partner for new technologies. As a result, verify what data is modified, there is a limit to
both company “A” and company “B” were on the utilizing in digital forensic purposes[1].
shortlist. The company “B” bought up a core Furthermore, this study researches about unused
member of the company “A” in order to modify area in PDF documents from the viewpoint of
final proposal and be business partner with digital forensics. This paper introduces a tool for
company “Y”. Finally, the modified file was analyzing PDF files. It is developed with the special
submitted, and company “B” was selected as a function of extracting text data(both ASCII and
partner. Some three years later, certain suspicious Unicode) at each revision. It is easy to compare
elements were detected in the competitive selecting modified version with original version. Therefore,
process, following which investigation of the the result of this paper is useful for investigating
business was undertaken. The only evidence was Adobe PDF files. Also, this paper explains that
the Adobe PDF file created for the final proposal unused area in PDF files can be used for data
after long time. hiding.
In the case, forensic examiners were able to
investigate some evidences in Adobe PDF file. This
is because the contents have been deleted or edited 3. Internal structure of PDF files
can exist as residual information inside the file due
to the file-update mechanism. As shown in Figure 1, a PDF file consists of
This study consists of seven sections. Section 2 Header, Body, Cross-reference table (hereinafter
introduces the existing studies on Adobe PDF files. called xref), and Trailer[3].
Section 3 describes the internal structure of Adobe
PDF files. Section 4 represents the reason that PDF Header
files can include residual information. Section 5
proposes a way to extract the residual information.
Section 6 describes a data hiding method using a Body
PDF update mechanism. Finally, Section 7
represents the conclusion of this study.
Cross-reference
2. Related studies table (xref)
Figure 7. Size and contents of original file, modified file, and recovered file
Not Referenced Object
xref %PDF-1.x
1) Recovering the file to original version 106 22
0000000016 00000 n
106 0 obj
0000001206 00000 n
127 0 obj
0000001484 00000 n 126 0 obj
In figure 6, the first page of the PDF file after the ... 107 0 obj
modification uses the 108th object (byte offset trailer 108 0 obj Page1
Original Block
...
119465) in the ‘added block’. Also, the second page xref
125 0 obj
0 106
uses the 1th object (byte offset 114777) in the 0000000000 65535 f 1 0 obj Page2
In case of the text data, there are two types of inside the file, it is not used. This area can be used
ASCII and Unicode. The text data can be extracted to hide data. Even though ‘hidden data’ is stored in
by decompressing (or decoding) the data of a target a PDF file, the data cannot be identified using a
object. Although decompressed (or decoded) data is PDF file viewer. Since the data is compressed (or
acquired, it is difficult to extract exact text data. encoded), it is not possible to verify such data using
Since the PDF file saves the text with some a simple method like Strings. Therefore, it can be
information such as position, shape and so on, it is used as a method that simply hides huge amounts of
necessary to do additional parsing process. data.
Also, the images in the content can be directly
extracted because images are stored as their own
Original file Modified file
image formats. size : 111KB size : 193KB
%PDF-1.x %PDF-1.x
3) ExPRI (Extractor for PDF Residual Information) 106 0 obj 106 0 obj
127 0 obj 127 0 obj
126 0 obj 126 0 obj
107 0 obj 107 0 obj
To investigate the forensic attributes of Adobe 108 0 obj Page1 108 0 obj Page1
PDF file, ExPRI(Extractor for PDF Residual ... ...
Hidden data
125 0 obj 125 0 obj
Information) has been developed. It is difficult for 1 0 obj Page2 1 0 obj Page2
... ...
investigator to change manually byte offset using 6 0 obj Page3 6 0 obj Page3
methodology in Figure 8. ... ...
23 0 obj 23 0 obj
177 0 obj
The ExPRI is able to recovery previous files and 1 0 obj Page2
22 0 obj
to extract the text at each storing point. Left side in Valid data
23 0 obj
Figure 9 shows result of extracting modified PDF 108 0 obj Page1
...
file’s contents. Right side in Figure 9 shows the
result of extracting recovered PDF file’s contents.
Modified and recovered file in Figure 7 demonstrate Figure 10. Hidden and valid data in the modified file
that ExPRI extracts text correctly.
Figure 11. Data hiding technique 2 [1] Matthew Ryan Davis, “Faith in the Format: Uni
ntentional Data Hiding in PDFs,” 757Labs.com,
2008.
These two methods presented above represent an
[2] Jungheum Park, Sangjin Lee, “Forensic investig
advantage that is able to hide huge amounts of data
ation of Microsoft PowerPoint files,” Digital Inv
compared with the method proposed by Shangping
estigation, Vol 6, p16~24, 2009.
Zhong et al. [4].
[3] Adobe Systems, “Document management - Port
able document format Part 1: PDF 1.7, First Edi
7. Conclusion tion,” 2008.
[4] Shangping Zhong, Xueqi Cheng, Tierui Chen, “
In digital forensic investigation, there are some Data Hiding in a Kind of PDF Texts for Secret
cases which documents files are analyzed as Communication, International Journal of Netwo
evidences. Because researches on electronic rk Security”, Vol 4, No 1, P17-26, 2007.
document forensics are not enough until now, the [5] Didier Stevens, Solving a Little PDF Puzzle,
analyses have been focused on the contents that can URL:http://blog.didierstevens.com/2008/05/07/s
be easily verified by using specific applications. olving-a-little-pdf-puzzle/, 2008.
However, this method is insufficient for analyzing
PDF files because there are ‘hidden data’ inside
them. In this vein, it is expected that the method
introduced in this study is useful to investigate
Adobe PDF files. Also, this study will help to
improve the admissibility of electronic documentary
evidence.
In this study, we analyzed the structure and
attributes of Adobe PDF files in a digital forensic