0% found this document useful (0 votes)
7 views49 pages

Week 3 4 5 Embeddedfiles

The document discusses the complexities and risks associated with embedded files in various digital formats, emphasizing the importance of recognizing that nearly all files may contain hidden attachments or metadata that could lead to data disclosure. It outlines categories of embedded files, the challenges of extracting and managing them, and the necessity for thorough risk assessments and workflows. The research was conducted under a NASA contract with DARPA's SafeDocs program, highlighting the need for diligence in digital preservation efforts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views49 pages

Week 3 4 5 Embeddedfiles

The document discusses the complexities and risks associated with embedded files in various digital formats, emphasizing the importance of recognizing that nearly all files may contain hidden attachments or metadata that could lead to data disclosure. It outlines categories of embedded files, the challenges of extracting and managing them, and the necessity for thorough risk assessments and workflows. The research was conducted under a NASA contract with DARPA's SafeDocs program, highlighting the need for diligence in digital preservation efforts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Reference herein to any specific commercial product, process,

or service by trade name, trademark, manufacturer, or


otherwise, does not constitute or imply its endorsement by the
United States Government or the Jet Propulsion Laboratory,
California Institute of Technology.

Embedded Files: Risks, Challenges


and Options

Tim Allison, Ph.D.


Data Scientist/Relevance Engineer
Artificial Intelligence, Analytics and Innovative
Development Organization (1740)
ITSD
The research was carried out at the NASA (National Aeronautics
and Space Administration) Jet Propulsion Laboratory, California
Institute of Technology under a contract with the Defense
Advanced Research Projects Agency (DARPA) SafeDocs
program. © 2022 California Institute of Technology. Government
sponsorship acknowledged.
About me

• Data scientist (files and search) NASA’s Jet


Propulsion Laboratory, California Institute of
Technology
• Chair/V.P. Apache Tika
• Committer Apache PDFBox, POI, Lucene/Solr,
OpenNLP
• Member Apache Software Foundation
The research was carried out at the NASA (National Aeronautics and Space Administration)
Jet Propulsion Laboratory, California Institute of Technology under a contract with the
Defense Advanced Research Projects Agency (DARPA) SafeDocs program.
© 2022 California Institute of Technology. Government sponsorship acknowledged.
© 2022 California Institute of Technology. Government sponsorship acknowledged. 2 jpl.nasa.gov
Overview

• Intended audience – technical, with larger


implications
• This is a work in progress, please help!

9/22/22 3 jpl.nasa.gov
Takeaways

• Think: every file may have embedded files


• Develop budgets, risk assessments and workflows
accordingly

I don’t have all the answers!

9/22/22 4 jpl.nasa.gov
Why should you care?

https://twitter.com/WeirdMedieval/status/1532319439684874240
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 5 jpl.nasa.gov
Why should you care?

• Sensitive data, accidental data disclosure, not just in


the attachments but in the metadata about the
attachments hosted in the parent document
• Accessibility – how do we make these searchable,
discoverable and available to users
• Every other digital preservation concern…literally
every other digital preservation concern

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 6 jpl.nasa.gov


Diligence Spectrum

• Where are you? Where do you need to be?


• What do the digipres vendors support? What do
they need to support?
• Cost/benefit, complexity and budgets

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 7 jpl.nasa.gov


“Hidden data” not covered in this talk

• Encrypted files (without password)


• Hidden sheets/columns/data in spreadsheets
• Track changes/edits and incremental updates
(including deletes!)
• Text out of viewing area in PDFs
• Text in notes components of PPT(x)s

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 8 jpl.nasa.gov


“Hidden Data” not covered in this talk

• “ActualText” (in PDF)


• Alternative content
• Font too small/too big or same color as background
• Corrupt/missing Unicode mappings/fonts (in PDF)
• Steganography
• Files embedded in cavities whether within the
“parsable” parts or outside of the file-specific
“parsable” parts.

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 9 jpl.nasa.gov


Intentionally malicious files – not covered in this talk

• Malware – Denial of Service, remote code


execution, etc.
• Crafted parser differentials

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 10 jpl.nasa.gov


Challenging files also not covered in this talk

• Polyglot and schizophrenic files – files that may be


parsed as more than one file type (e.g. a PDF that is
also a zip file)
• Quines – zip or gz or other package format that
when unpackaged is byte for byte exactly the same
file as the original

Refs: Ange Albertini


https://blog.trailofbits.com/2019/11/01/two-new-tools-
that-tame-the-treachery-of-files/
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 11 jpl.nasa.gov
Categories of Embedded Files (1 of 2)

• Attachments – something a human or process


added to the file as supplementary information that
is intended to stand on its own/be easily exported
• Images – intended to be rendered as part of the file
• Thumbnail images
• Macros/code – executable code that is intended to
help the functionality (macros in MSOffice and/or
javascript in PDF/HTML, etc.)

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 12 jpl.nasa.gov


Categories of Embedded Files (2 of 2)

• Metadata files – XMP, anything else?


• Standalone files that help with the rendering/user
experience of a file, e.g. font files, International
Color Consortium (ICC) profiles, subtitle streams
• Files that normally don’t exist outside of files – EMF,
WMF, MSGraph

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 13 jpl.nasa.gov


Why grep is not sufficient
“Hello World” as stored in a PDF Uncompressed

Compressed object as stored in


! -> H
the file
“ -> e
# -> l
$ -> o

“Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction”
https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 14 jpl.nasa.gov


Notes on Some Specific Formats

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 15 jpl.nasa.gov


Hodge podge

• Email can have alternate content (text/html)


• Zips can have free text comments!
• Apple resource forks can have all sorts of things:
See Tyler Thorsted’s #iPres2022 talk if you haven’t!

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 16 jpl.nasa.gov


HTML
Even HTML can have embedded files!

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 17 jpl.nasa.gov


MSOffice: OLE2 (doc/ppt/xls)

• Directory + file based format like zip


• Embedded files need to be parsed out of streams;
they may not be stored as separate files even within
the zip-like structure
• Old .doc files may contain a “save history” – full file
paths for where the file has been saved

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 18 jpl.nasa.gov


MSOffice: OOXML (.docx/pptx/xlsx)

• Zip files
• Embedded files may be stored as standalone within the
zip structure
• Extra “bonus” embedded files not part of the
docx/pptx/xlsx may be stored
• Files may be wrapped in an OLE2 stream
• Full file paths for the source locations for embedded files
may be stored in xml within the zip file
• Excel may store full “last saved” path:

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 19 jpl.nasa.gov


PDF

• Incremental Updates
• Attachments

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 20 jpl.nasa.gov


PDF Incremental Updates

https://developers.foxit.com/developer-hub/document/incremental-updates/
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 21 jpl.nasa.gov
Simply truncate to earlier %%EOFs to get earlier file(s)!
See also tool: pdfresurrect

Fun file (starting around p 51):


https://www.usitc.gov/publications/337/pub1859.pdf
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 22 jpl.nasa.gov
PDFResurrect – Incremental Updates, 1 million sample
from Common Crawl CC-MAIN-2021-31
Updates Percentage
* Many files
0 77% created by
1 21.75%* MSWord with a
2 1.02% single incremental
update are not
3 0.30%
significantly
4 0.14% different!
5 0.07%
6 0.04% Max in 1 million
7 0.03% sample:
3,441 incremental
8 0.02%
updates
9 0.01%
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 23 jpl.nasa.gov
PDF Attachments

• Overheard: “We only have to worry about attachments in Portfolio PDFs”

No!!! Nearly all* PDFs may contain attachments!


9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 24 jpl.nasa.gov
Nearly all*

• Files that conform to PDF/A-1 are not allowed to


contain attachments.

For all practical purposes, I don’t understand why an ingest


pipeline wouldn’t look for attachments whether or not the PDF
alleges that it is PDF/A-1 or whether or not the PDF actually
passes a conformance check for PDF/A-1.

More simply: assume everything has attachments until proven


otherwise.

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 25 jpl.nasa.gov


All PDFs may contain attachments!

• The dataset in the following consists of 8 million


PDFs from one month of Common Crawl CC-MAIN-
2021-31

• ~50k had at least one attachment


• Only 670 files were “Portfolio PDFs”

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 26 jpl.nasa.gov


Apache Tika – Attached Files, Embedded Depth = 1
Mime Count
text/plain; charset=ISO-8859-1 49,288
application/pdf 12,090
text/plain; charset=windows-1252 5,045
audio/mpeg 4,840
application/x-shockwave-flash 4,740
application/xml 4,727
text/html; charset=UTF-8 3,390
image/png 2,099
image/gif 1,656
image/svg+xml 1,476

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 27 jpl.nasa.gov


Apache Tika – Attached Files, Embedded Depth > 0
Mime Count
text/plain; charset=ISO-8859-1 49,397
image/wmf 14,419
application/pdf 12,564
application/vnd.ms-equation 12,387
image/png 7,126
text/plain; charset=windows-1252 5,127
application/xml 4,959
audio/mpeg 4,886
application/x-shockwave-flash 4,753
text/html; charset=UTF-8 3,391

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 28 jpl.nasa.gov


Apache Tika – Maximum Number of Embedded Files
Embedded File Counts PDF Count
Only 670 PDFs are
1 42,054 “Portfolio PDFs”
2 1,416
3 884
4 563
6 423 One file has 3,852
5 303 embedded files!
8 193
7 176
9 171
16 145

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 29 jpl.nasa.gov


Apache Tika – Embedded File Depths

Embedded Depth Count


0 7,931,327
1 98,268
2 37,547
3 3,137
4 177

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 30 jpl.nasa.gov


PDFs may include full file paths for various reasons

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 31 jpl.nasa.gov


TIFF
ExifTool on ~5300 TIFFs

• ~2400 have a binary data field (~800 of these are


thumbnails)
• They may contain OCR’d text

• They may contain full file paths

Source: https://corpora.tika.apache.org/base/share/tiffs-out.txt.gz
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 32 jpl.nasa.gov
Files that mostly only exist as
embedded files

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 33 jpl.nasa.gov


WMF and EMF – Top 10 containers
Container file EMF/WMF Counts
• Windows Metafile format
(WMF) ppt 57,869
• Enhanced Windows doc 36,464
Metafile (EMF) docx 8,246
image/emf 6,798
Habitat within
xls 5,007
Apache Tika’s 1
million file regression pptx 2,970
test sample rtf 2,163
xlsx 1,913
xls (macro) 988
pptx (variant) 381

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 34 jpl.nasa.gov


WMF – Windows Metafile

“WMF specifies structures for defining


a graphical image. A WMF metafile
contains drawing commands, property
definitions, and graphics objects in a
series of WMF records.”
WMF 17.0 Specification

Like PDF, WMF may contain extractable text!

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 35 jpl.nasa.gov


EMF – Enhanced Windows Metafile

• “Enhanced metafile format (EMF) is a file format that is


used to store portable representations of graphical
images. EMF metafiles contain sequential records that
are parsed and processed to render the stored image on
any output device.”

EMF 17.0 Specification

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 36 jpl.nasa.gov


EMF, EMF+, EMFSpool
Like PDF, these file types
may contain extractable
text AND attached files!

AND “EMF metafiles define a


mechanism for the encapsulation
of arbitrary vendor-defined data.
The EMR_COMMENT record
(section 2.3.3.1) can contain
arbitrary private data that is
unknown to EMF.”
EMF 17.0 Specification

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 37 jpl.nasa.gov


EMF attachments in Apache Tika’s 1 million file
regression sample

Attachments in EMFs Counts


image/wmf 86,810
application/pdf 662

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 38 jpl.nasa.gov


XMP – eXtensible Metadata Platform

• Habitat: PDF, JPEGs, Photoshop, PNG, TIFF, video,


and more
• Embedded files
• SVG
• HTML
• Thumbnails (e.g. JPEGs)

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 39 jpl.nasa.gov


XMP – some potential issues

• History – what software packaged modified the file


when, nature of modification
• User information – creator, modified by, file path
links to embedded files or external
resources/references
• OriginalDocumentId, documentId, InstanceId
• Other metadata: title, keywords, subject
• Embedded binary images!
• Custom metadata!

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 40 jpl.nasa.gov


XFA – XML Forms Architecture

• Habitat: PDF
• XML representation of forms, questions and
answers

• NOTE: If the text extractor/parser is only processing


the PDF content, it will miss content from these
XMLs

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 41 jpl.nasa.gov


Why file-format specific tools (alone) are not sufficient

• Maximum embedded depth in the 8 million PDF


corpus from Common Crawl is 4
• Maximum embedded depth in the ~1 million Tika
regression corpus is 7
• For full workflow, file-format specific tools must be
able to call each other arbitrarily

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 42 jpl.nasa.gov


Conclusion

• Treat every file as if it has attachments until proven


otherwise
• Develop budgets, risk assessments and workflows
accordingly

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 43 jpl.nasa.gov


Extras

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 44 jpl.nasa.gov


• The Portfolio PDF and the PPT with attachment are
in the iPres2022 bakeoff corpus:
https://drive.google.com/drive/folders/1ACktqBv_Yo
oW9DLInBM5ad_I0yHJRoLU
• Step by step commandlines with example data for
this talk:
https://cwiki.apache.org/confluence/display/TIKA/Op
en+Preservation+Foundation+Talk+--
+21+September+2022

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 45 jpl.nasa.gov


Some tools
• Apache Tika
• ExifTool
• pdfresurrect
• Poppler: pdfinfo, pdfimages, pdfdetach
• Didier Stevens (forensics): oledump.py, pdfid.py, pdftool.py
• Philippe Lagadec (forensics): oletools

• Great blog from @bitsgalore on opensource PDF tools:


https://www.bitsgalore.org/2021/09/06/pdf-processing-and-
analysis-with-open-source-tools

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 46 jpl.nasa.gov


Commandlines
• JSON embedded file output
• java –jar tika-app-2.y.z.jar –J -t
digitally_signed_3D_Portfolio(1).pdf
• JSON embedded file output batch
• java –jar tika-app-2.y.z.jar –J –t –i
<input_dir> -o <output_dir>
• Dump first level attachments
• java –jar tika-app-2.y.z.jar –z
digitally_signed_3D_Portfolio(1).pdf

NOTE: -z only extracts first level. If you want full recursive, please
open an issue: https://issues.apache.org/jira/projects/TIKA
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 47 jpl.nasa.gov
XMP

• Extract the literal XMP from file.pdf into


xmp.xmp
• exiftool -a -o xmp.xmp file.pdf

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 48 jpl.nasa.gov


Commandlines

• Dump contents of OLE2 to local directory


• java -cp tika-app-2.Y.Z.jar
org.apache.poi.poifs.dev.POIFSDump
testWORD_1img.doc
• List contents of OLE2 files
• java -cp ~/Intellij/tika-main/tika-
app/target/tika-app-2.Y.Z.jar
org.apache.poi.poifs.dev.POIFSViewer
261779.ppt

See also Didier Stevens’ oledump.py

9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 49 jpl.nasa.gov

You might also like