Week 3 4 5 Embeddedfiles
Week 3 4 5 Embeddedfiles
9/22/22 3 jpl.nasa.gov
Takeaways
9/22/22 4 jpl.nasa.gov
Why should you care?
https://twitter.com/WeirdMedieval/status/1532319439684874240
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 5 jpl.nasa.gov
Why should you care?
“Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction”
https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf
• Zip files
• Embedded files may be stored as standalone within the
zip structure
• Extra “bonus” embedded files not part of the
docx/pptx/xlsx may be stored
• Files may be wrapped in an OLE2 stream
• Full file paths for the source locations for embedded files
may be stored in xml within the zip file
• Excel may store full “last saved” path:
• Incremental Updates
• Attachments
https://developers.foxit.com/developer-hub/document/incremental-updates/
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 21 jpl.nasa.gov
Simply truncate to earlier %%EOFs to get earlier file(s)!
See also tool: pdfresurrect
Source: https://corpora.tika.apache.org/base/share/tiffs-out.txt.gz
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 32 jpl.nasa.gov
Files that mostly only exist as
embedded files
• Habitat: PDF
• XML representation of forms, questions and
answers
NOTE: -z only extracts first level. If you want full recursive, please
open an issue: https://issues.apache.org/jira/projects/TIKA
9/22/22 © 2022 California Institute of Technology. Government sponsorship acknowledged. 47 jpl.nasa.gov
XMP