0% found this document useful (0 votes)
152 views7 pages

Forensic Analysis of Residual Information in Adobe PDF Files

This document summarizes a study on analyzing residual information in Adobe PDF files for forensic purposes. PDF files can contain residual information from past edits and versions due to the file updating mechanism. The study presents a tool to extract residual text data, including ASCII and Unicode text, from different versions of PDF files. This allows comparison of modified and original versions. The study also notes PDF files can be used to hide data in unused areas and that residual information may provide evidence in digital investigations.

Uploaded by

Francisco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views7 pages

Forensic Analysis of Residual Information in Adobe PDF Files

This document summarizes a study on analyzing residual information in Adobe PDF files for forensic purposes. PDF files can contain residual information from past edits and versions due to the file updating mechanism. The study presents a tool to extract residual text data, including ASCII and Unicode text, from different versions of PDF files. This allows comparison of modified and original versions. The study also notes PDF files can be used to hide data in unused areas and that residual information may provide evidence in digital investigations.

Uploaded by

Francisco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Forensic Analysis of Residual Information in Adobe

PDF Files

Hyunji Chung Jungheum Park


Center for Information Security Technologies Center for Information Security Technologies
Korea University Korea University
Seoul, Korea Seoul, Korea
[email protected] [email protected]

Sangjin Lee
Center for Information Security Technologies
Korea University
Seoul, Korea
[email protected]

Abstract 1. Introduction

In recent years, as electronic files include People create their electronic documents using
personal records and business activities, these files various application programs, such as Microsoft
can be used as important evidences in a digital Office, Adobe Acrobat, and so on. Although the use
forensic investigation process. In general, the data of document files is widespread, few people
that can be verified using its own application recognize that ‘hidden’ data exist in files. The
programs is largely used in the investigation of reason for using the word ‘hidden’ is that
document files. However, in the case of the PDF file identification of these data is not possible using its
that has been largely used at the present time, own application program.
certain data, which include the data before some J. Park et al. introduced ‘hidden’ data in MS
modifications, exist in electronic document files PowerPoint files. Even after the contents of an MS
unintentionally. Because such residual information PowerPoint file have been deleted or edited, they
may present the writing process of a file, it can be can still exist inside the file as residual information.
usefully used in a forensic viewpoint. This is because file-saving algorithm used in this
This paper introduces why the residual application[2]. Similar to MS PowerPoint files,
information is stored inside the PDF file and Adobe PDF (Portable Document Format) files can
explains a way to extract the information. In also contain ‘hidden’ data. Therefore, it is necessary
addition, we demonstrate the attributes of PDF files to investigate such ‘hidden data’ in a digital forensic
can be used to hide data. viewpoint.
Adobe PDF files have been largely used for
Keywords: various purposes such as writing personal
documents and distributing official documents at
PDF, Residual Information, Data hiding, Information
enterprise. In particular, for enterprise purposes,
leakage, Digital evidence
some confidential documents were created using
applications like Microsoft Word or PowerPoint.
And they were distributed after transforming into
Adobe PDF files. In the past, PDF files were just specific ASCII, Unicode text for digital forensic
application for distribution. However, the content of investigation[5].
PDF files can be edited using various PDF editors Also, Matthew was developed a tool, pdfresurrect.
in these days. The pdfresurrect shows the revision number and
The following is a hypothetical case related to time of a PDF file. This tool makes investigators to
Adobe PDF files. Company “Y” recruited new confirm revision history. However, since it cannot
business partner for new technologies. As a result, verify what data is modified, there is a limit to
both company “A” and company “B” were on the utilizing in digital forensic purposes[1].
shortlist. The company “B” bought up a core Furthermore, this study researches about unused
member of the company “A” in order to modify area in PDF documents from the viewpoint of
final proposal and be business partner with digital forensics. This paper introduces a tool for
company “Y”. Finally, the modified file was analyzing PDF files. It is developed with the special
submitted, and company “B” was selected as a function of extracting text data(both ASCII and
partner. Some three years later, certain suspicious Unicode) at each revision. It is easy to compare
elements were detected in the competitive selecting modified version with original version. Therefore,
process, following which investigation of the the result of this paper is useful for investigating
business was undertaken. The only evidence was Adobe PDF files. Also, this paper explains that
the Adobe PDF file created for the final proposal unused area in PDF files can be used for data
after long time. hiding.
In the case, forensic examiners were able to
investigate some evidences in Adobe PDF file. This
is because the contents have been deleted or edited 3. Internal structure of PDF files
can exist as residual information inside the file due
to the file-update mechanism. As shown in Figure 1, a PDF file consists of
This study consists of seven sections. Section 2 Header, Body, Cross-reference table (hereinafter
introduces the existing studies on Adobe PDF files. called xref), and Trailer[3].
Section 3 describes the internal structure of Adobe
PDF files. Section 4 represents the reason that PDF Header
files can include residual information. Section 5
proposes a way to extract the residual information.
Section 6 describes a data hiding method using a Body
PDF update mechanism. Finally, Section 7
represents the conclusion of this study.

Cross-reference
2. Related studies table (xref)

In 2008, Didier and Matthew introduced that


updated PDF files can include revision history by Trailer
updating mechanism of Adobe PDF. That is, they
Figure 1. Basic Structure of PDF files
explained it is possible to know revision history in a
PDF file when the file is modified and saved using
The first line of a PDF file is header that has a
‘Save’ instead of ‘Save As’ function[1][5].
version number. The trailer is at the end of a PDF
Didier was developed tools, such as pdf-
file. It has byte offset of the last xref[3].
parser.py, pdfid.py for analysis of PDF files. The
tools help examiners to identify an internal structure
of PDF files. However, it is difficult to identify
/Root contains an indirect reference, “107 0 R”. In
xref
106 22 detail, xref in Figure 2 includes byte offset of “107
106 0 obj
0 obj”. /Pages type exists at byte offset 1206.
107 0 obj
0000000016 00000 n
0000001206 00000 n
Continuously, /Pages has indirect reference, “21 0
108 0 obj 0000001484 00000 n R”. The “21 0 obj” is /Kids type, and it is possible
127 0 obj
...
0000000750 00000 n
to access the position of “21 0 obj” through proper
xref. /Kids consists of one or more indirect /Page
Figure 2. Example of cross-reference table (xref) references. It is possible to know how many pages
PDF file has. This is because a /Page type means
Xref contains information of all objects in a PDF each page inside a file. Figure 3 is an example of
file. The example of the xref is shown in Figure 2. the PDF file that has three pages. In figure 3, the
There can be several xrefs, but Figure 2 shows only one of /Page types has indirect reference, “1 0 R”.
one of them. The term of “106 22” means that there The “1 0 obj” is /Contents type, and the position of
exists twenty-two objects from the 106th object. it can be accessed using xref. Also, /Contents has
Then, the first ten places in the next line represent indirect reference, “3 0 R”. The “3 0 obj” is the
the position of the 106th object as a byte offset. The body of a page, and the data will be acquired
second five places mean the generation number, and through xref that saves the position as a byte offset.
the value of ‘00000’ is allocated as it is first In addition, the body data can be a compressed(or
generated. For the third place, ‘n’ or ‘f’ can be encoded) format. Thus, it is necessary to
positioned where ‘n’ is the object, which is being decompress(or decode) the data in order to confirm
used, and ‘f’ is the object, which is not used (i.e., plain text.
free)[3].
Many objects exist in a PDF file. Objects are
logically made of a tree. As shown in Figure 3,
4. Residual information in PDF files
/Root, /Pages, /Kids, /Page and /Contents represents
the type of objects. 1) Update mechanism of PDF files

/Root Lots of users recently build their electronic


107 0 R
documents using specific application programs like
/Pages
21 0 R
Microsoft Office 2007 and store them as PDF files
through PDF transformation process. The Adobe
/Kids
Acrobat 8.0 was used in this experiment. Table 1
/Page
10R shows the experiment procedure.
/Contents
30R
Table 1. Experiment procedure
/Page Create three-page contents (texts and
60R images) using Microsoft Office Word 2007
Step1
/Contents and store it as a PDF file using the function
70R of “Save as Adobe PDF”.
/Page
108 0 R
Step2 Open the PDF file using Adobe Acrobat.
/Contents
116 0 R Modify the content in the first and second
Figure 3. Internal structure of a PDF file pages of the original PDF file into different
Step3
texts and images after deleting the existing
Although there are various different types in them.
addition to the type presented in Figure 3, this study Resave it using the function of “Save” after
Step4
refers the main types only. In Figure 3, the root of completing the modification.
the internal structure tree is the /Root type. The
%PDF-1.x the file. The reason for using ‘residual’ is that it
106 0 obj cannot be identified by PDF application. However,
127 0 obj if a user saves the file using the “Save As” function,
126 0 obj
107 0 obj
the application does reconstruct the entire structure.
108 0 obj Page1
Original Block
...
125 0 obj 2) Residual information in PDF files
1 0 obj Pa ge2
...
6 0 obj Pa ge3 Figure 6 explains the concept of residual
... information. In Figure 6, the left side is xref of the
23 0 obj
modified file, and the right side is the internal
177 0 obj structure. “108 0 obj” and “1 0 obj” exist in both
1 0 obj
22 0 obj
Pa ge2
original and added blocks. “108 0 obj” in ‘added
Added Block
23 0 obj block’ is appended after modifying the file.
108 0 obj Pa ge1
Therefore, “108 0 obj” in ‘original block’ exist in
...
the file as the residual information.
Figure 4. Structure after completing the experiment
Not Referenced Object

Figure 4 shows the internal format after finishing Referenced Object

the experiment. Repeating Step 3 and Step 4 xref %PDF-1.x

attaches ‘added block’ to ‘original block’. “108 0 106 22


0000000016 00000 n
106 0 obj
127 0 obj
obj” contains contents of the first page. “1 0 obj” 0000001206 00000 n
0000001484 00000 n 126 0 obj
has them of the second page and “6 0 obj includes trailer
... 107 0 obj
108 0 obj Page1
them of the third page. ...
Original Block
xref
Both “108 0 obj” and “1 0 obj” exist in ‘added 0 106 125 0 obj
0000000000 65535 f 1 0 obj Page2
block’, and “6 0 obj” doesn’t exist in ‘added block’. 0000053665 00000 n ...
...
This is because the third page is not modified. As 0000093361 00000 n
6 0 obj Page3
...
shown in Figure 5, it can be seen that the size of the ...
0000000000 65535 f 23 0 obj
modified file is bigger than the size of original one. trailer
177 0 obj
xref
11 1 0 obj Page2
Original PDF file Modified PDF file 0000114777 00000 n 22 0 obj
Added Block
108 1 23 0 obj
0000119465 00000 n
108 0 obj Page1
trailer
...

Figure 6. State of reference in modified file

As the first and second pages are modified, the


objects related to the modification are included in
File size:111KB File size:193KB both ‘original block’ and ‘added block’. The first
Figure 5. Comparison of original file size page after the modification does not use the 108th
and modified file size object in the ‘original block’ but use the 108th object
in the ‘added block’. The 108th and the 1st objects in
Residual information is generated when the file is the ‘original block’ are not used by the PDF file
updated. Figure 4 and 5 show this feature. This viewer.
feature improves the efficiency of saving a PDF file.
In other words, it takes less time than a full save of
Original file Modified file Recovered file
size : 111KB size : 193KB size : 193KB

Figure 7. Size and contents of original file, modified file, and recovered file
Not Referenced Object

5. Extraction of residual information Referenced Object

xref %PDF-1.x
1) Recovering the file to original version 106 22
0000000016 00000 n
106 0 obj
0000001206 00000 n
127 0 obj
0000001484 00000 n 126 0 obj
In figure 6, the first page of the PDF file after the ... 107 0 obj
modification uses the 108th object (byte offset trailer 108 0 obj Page1
Original Block
...
119465) in the ‘added block’. Also, the second page xref
125 0 obj
0 106
uses the 1th object (byte offset 114777) in the 0000000000 65535 f 1 0 obj Page2

‘added block’. 0000053665


...
00000 n ...
6 0 obj Page3
In Figure 8, to access the contents of original file, 0000093361 00000 n
...
...
there is one methodology that it changes byte offset 0000000000 65535 f 23 0 obj
trailer
114777 into byte offset 53665. In addition, it 177 0 obj
replaces byte offset 119465 with byte offset 1484. xref
11
1 0 obj Page2
22 0 obj
Then, the file is recovered to original version. Also, 0000053665 00000 n
23 0 obj
Added Block
108 1
the contents of the original file can be viewed using 0000001484 00000 n 108 0 obj Page1
a PDF file viewer. trailer ...
Figure 7 shows the contents identified by a PDF Figure 8. Method of recovering the file
file viewer. It is verified that the size of the
recovered file is the same as the modified file and 2) Extraction of texts and images
shows the same contents as the original file.
Another method of extracting residual
information is to directly extract text and images.
Since the rules of storing text and images in PDF
files are presented in [3], the contents can be
extracted directly.
Figure 9. ExPRI (Extractor for PDF Residual Information)

In case of the text data, there are two types of inside the file, it is not used. This area can be used
ASCII and Unicode. The text data can be extracted to hide data. Even though ‘hidden data’ is stored in
by decompressing (or decoding) the data of a target a PDF file, the data cannot be identified using a
object. Although decompressed (or decoded) data is PDF file viewer. Since the data is compressed (or
acquired, it is difficult to extract exact text data. encoded), it is not possible to verify such data using
Since the PDF file saves the text with some a simple method like Strings. Therefore, it can be
information such as position, shape and so on, it is used as a method that simply hides huge amounts of
necessary to do additional parsing process. data.
Also, the images in the content can be directly
extracted because images are stored as their own
Original file Modified file
image formats. size : 111KB size : 193KB

%PDF-1.x %PDF-1.x
3) ExPRI (Extractor for PDF Residual Information) 106 0 obj 106 0 obj
127 0 obj 127 0 obj
126 0 obj 126 0 obj
107 0 obj 107 0 obj
To investigate the forensic attributes of Adobe 108 0 obj Page1 108 0 obj Page1
PDF file, ExPRI(Extractor for PDF Residual ... ...
Hidden data
125 0 obj 125 0 obj
Information) has been developed. It is difficult for 1 0 obj Page2 1 0 obj Page2
... ...
investigator to change manually byte offset using 6 0 obj Page3 6 0 obj Page3
methodology in Figure 8. ... ...
23 0 obj 23 0 obj
177 0 obj
The ExPRI is able to recovery previous files and 1 0 obj Page2
22 0 obj
to extract the text at each storing point. Left side in Valid data
23 0 obj
Figure 9 shows result of extracting modified PDF 108 0 obj Page1
...
file’s contents. Right side in Figure 9 shows the
result of extracting recovered PDF file’s contents.
Modified and recovered file in Figure 7 demonstrate Figure 10. Hidden and valid data in the modified file
that ExPRI extracts text correctly.

6. Data hiding method using file-update mechanism


2) Data hiding technique 2
1) Data hiding technique 1 Because the area of ‘hidden data’ presented in
Figure 10 is not used, the area can be used for
The actually viewed section is the part of ‘valid hiding the data. Figure 11 illustrates the example of
data’ of the modified file presented in Figure 10. It such a way. As the data is hidden using this method,
means that although the area of ‘hidden data’ exists
it is not possible to verify the data by extracting the viewpoint. It is possible to trace the previous work
residual information. Also, it is very strong way to in a file using its residual information remained in
hide data with encryption algorithms. Thus, it is the file based on the update mechanism of an Adobe
very difficult to investigate such hidden data. PDF editing program. In addition, because the area
of residual information is not identified by a PDF
file viewer, it can be used as a way of data hiding.
For future study, the tool that has been developed
at the present time will be completed. It can be used
as a forensic analysis module for PDF files because
it includes previous file recovery, text and image
extraction, and metadata extraction functions for
Data Hiding each storing point. In addition, a study on the
method that intentionally hides data to the unused
area where residual information are stored will be
deeply conducted.
Intentionally Hidden Data
References

Figure 11. Data hiding technique 2 [1] Matthew Ryan Davis, “Faith in the Format: Uni
ntentional Data Hiding in PDFs,” 757Labs.com,
2008.
These two methods presented above represent an
[2] Jungheum Park, Sangjin Lee, “Forensic investig
advantage that is able to hide huge amounts of data
ation of Microsoft PowerPoint files,” Digital Inv
compared with the method proposed by Shangping
estigation, Vol 6, p16~24, 2009.
Zhong et al. [4].
[3] Adobe Systems, “Document management - Port
able document format Part 1: PDF 1.7, First Edi
7. Conclusion tion,” 2008.
[4] Shangping Zhong, Xueqi Cheng, Tierui Chen, “
In digital forensic investigation, there are some Data Hiding in a Kind of PDF Texts for Secret
cases which documents files are analyzed as Communication, International Journal of Netwo
evidences. Because researches on electronic rk Security”, Vol 4, No 1, P17-26, 2007.
document forensics are not enough until now, the [5] Didier Stevens, Solving a Little PDF Puzzle,
analyses have been focused on the contents that can URL:http://blog.didierstevens.com/2008/05/07/s
be easily verified by using specific applications. olving-a-little-pdf-puzzle/, 2008.
However, this method is insufficient for analyzing
PDF files because there are ‘hidden data’ inside
them. In this vein, it is expected that the method
introduced in this study is useful to investigate
Adobe PDF files. Also, this study will help to
improve the admissibility of electronic documentary
evidence.
In this study, we analyzed the structure and
attributes of Adobe PDF files in a digital forensic

You might also like