0% found this document useful (0 votes)

166 views5 pages

Testing PDFs With Python

This document describes how to test PDFs using Python and the PyPDF2 library. It provides code to check internal and external links in a PDF for broken links. It also provides a method to check if all fonts used in the PDF are embedded by walking the PDF object structure. The code prints summaries of the links and fonts including any broken links or unembedded fonts. It also notes that additional testing of PDF metadata, encryption, and page counts can be done using PyPDF2 methods.

Uploaded by

Samir Benakli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

166 views5 pages

Testing PDFs With Python

Uploaded by

Samir Benakli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Testing PDFs with Python

Mon 26 January 2015

Overview
When you generate PDFs, you need a way to test their integrity—not only must they be valid,
but they should behave correctly and display consistently, even on different platforms. This
article describes how you can use the PyPDF2 library to test your PDF files for broken links
(both internal and external), and how to find fonts that are not embedded in the PDF.

Note that some PDF readers are ‘smart’ and will create a live hyperlink from a string of text
that looks like a URL, even though the text is not coded as a URL in the PDF file. The
technique described in this article does not address this issue—it only tests the actual URLs
present in the PDF file.

Code from this article is available as a gist on GitHub. Get a copy and play along.

Testing Internal and External Links

To test the links, we’ll create a function to check the urls found inside the PDF file. This
function uses the requests library, which you can install with pip.

There may be some ftp links, which are not directly supported by the requests library. You
can either use the urllib package as in the following code, or you can use the requests-ftp
package, available on pypi.

from PyPDF2 import PdfFileReader

from pprint import pprint
import requests
import sys
import urllib

The check_ftp function checks a given url for a response. If it fails, or if the response is
empty, it returns False, along with the reason; otherwise it returns True.

def check_ftp(url):
try:
response = [Link](url)
except IOError as e:
result, reason = False, e
else:
if [Link]():
result, reason = True, 'okay'
else:
result, reason = False, 'Empty Page'
return result, reason

The check_url function is also simple: If the url starts with ftp, it delegates to the
check_ftp function. Otherwise, it attempts to get the url with some timeout value using
typical header values. The the function returns the response along with the reason it succeeded
or failed.

def check_url(url, auth=None):

headers = {'User-Agent': 'Mozilla/5.0', 'Accept': '*/*'}
if [Link]('[Link]
result, reason = check_ftp(url)
else:
try:
response = [Link](url, timeout=6, auth=auth,
headers=headers)
except ([Link],
[Link],
[Link]) as e:
result, reason = False, e
else:
if [Link]:
result, reason = response.status_code, [Link]
else:
result, reason = False, 'Empty Page'

return result, reason

Now that we have this utility, we can check the PDF file. We will create four lists:

 links The internal PDF links in the file; for example, a reference to a section or figure.
 badlinks Of the internal links in the file, these are links that target a missing
destination (broken link).
 urls The links from the PDF to an external location; for example, a hyperlink to a
web site.
 badurls Of the external links in the file, these are the urls that target a missing
destination (broken url)

Now for the PyPDF2 goodies. The following check_pdf function loops over the pages in the
PDF file object. For each page, it walks through the Annots dictionary. If that dictionary has
an action (\A) with a key of \D (destination?), that is an internal link, so update the links list
with the destination.

If the dictionary has an action with a key of \URI, it is an external link. Check the external
links with the check_url function and update the urls and bad_urls lists.

After checking each page, get a list of all the anchors in the PDF with the
getNamedDestinations attribute; compare that list of all known anchors to the list of internal
links we just created. If there is a link with no matching anchor, that link belongs in the
badlinks list.

def check_pdf(pdf):
links = list()
urls = list()
badurls = list()

anchors = [Link]()
badlinks = [x for x in links if x not in anchors]
return links, badlinks, urls, badurls

Finally, make the code into a callable script that takes a single argument, the path to the PDF
file. Then print the results of the check_pdf function on stdout.

if __name__ == '__main__':
fname = [Link][1]
print 'Checking %s' % fname
pdf = PdfFileReader(fname)
links, badlinks, urls, badurls = check_pdf(pdf)
print 'urls: ', urls
print
print 'bad links: ', badlinks
print
print 'bad urls: ',badurls

Test for Embedded Fonts

Test to make sure that the fonts used in the PDF file are embedded. If a font is not embedded,
your PDF file may display differently on different machines, even if it is a font that is
putatively “standard”, like Times Roman or Helvetica. To insure that your PDF displays as
intended on any machine, all fonts must be embedded.

In the following code, the walk function is a recursive function that takes a dictionary-like
object (obj) and two sets (fnt and emb). It walks the given dictionary object: for every key in
the given dictionary, the function calls itself on the corresponding value (if that value is a
nested dictionary).

If the dictionary has a key called BaseFont, the value corresponding to that key is the name of
a font used in the PDF; add that font name to the fnt set of fonts used.

If the dictionary has a key called FontName, the dictionary is a descriptor for that font, so
check for another key in the same font descriptor dictionary that begins with FontFile (the
key could be FontFile, FontFile2, or FontFile3). If that key exists, the font is embedded;
add that font name to the set of fonts embedded.

If the two sets are not identical, there are unembedded fonts in the PDF.

fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])

def walk(obj, fnt, emb):

if '/BaseFont' in obj:
[Link](obj['/BaseFont'])
elif '/FontName' in obj and [Link](set(obj)):
[Link](obj['/FontName'])

for k in obj:
if hasattr(obj[k], 'keys'):
walk(obj[k], fnt, emb)

return fnt, emb

Finally, make the code into a callable script that takes a single argument, the path to the
PDF file.

Start with two empty sets, fonts and embedded. Open the file with PyPDF2. The library
gives us access to the internal structure of the PDF. We loop over each page in the PDF,
passing the page’s Resources dictionary to the walk function, described above. Add the
corresponding results to the two sets and calculate the unembedded fonts by differencing
the sets.

Print the fonts used in the PDF file and if there are unembedded fonts, print their names as
well. Of course here you can do anything you want with the information such as save it to test
database, print a report, and so on.

if __name__ == '__main__':
fname = [Link][1]
pdf = PdfFileReader(fname)
fonts = set()
embedded = set()

for page in [Link]:

obj = [Link]()
f, e = walk(obj['/Resources'], fonts, embedded)
fonts = [Link](f)
embedded = [Link](e)

unembedded = fonts - embedded

print 'Font List'
pprint(sorted(list(fonts)))
if unembedded:
print '\nUnembedded Fonts'
pprint(unembedded)

Using PyPDF2 Methods

Obviously, the more you can specify about the PDFs you produce, the more you can test. For
example, you may know that your PDF should have specific metadata, should be encrypted,
contain a certain number of pages, and so on.

You can test for those conditions with the built-in tools that the PDFFileReader in pyPDF2
provides. If you have a PDFFileReader instance, you can use the following properties
for testing:

documentInfo
returns the document metadata such as author, creator, producer, subject, and title.
isEncrypted
returns boolean value specifiying whether the document is encrypted
numPages
returns the number of pages in the document

Summary
If you produce PDF documents, you need to test them. The more you can specify about your
PDFs, the more you can test. This article describes how you can test that the links (internal
and external) are valid and that the fonts used in the document are embedded.

fpdf2 Manual
No ratings yet
fpdf2 Manual
165 pages
Extracting Text and Images From PDF Files
No ratings yet
Extracting Text and Images From PDF Files
10 pages
Create Edit PDF App in Python
No ratings yet
Create Edit PDF App in Python
3 pages
Fpdf2 Manual
No ratings yet
Fpdf2 Manual
136 pages
Pypdf2.Pdffilewriter Python Example
No ratings yet
Pypdf2.Pdffilewriter Python Example
24 pages
Malicious PDF Creation Guide
No ratings yet
Malicious PDF Creation Guide
13 pages
Anvil Community Forum: Creating and Manipulating PDF Files Via Pypdf2 and FPDF
No ratings yet
Anvil Community Forum: Creating and Manipulating PDF Files Via Pypdf2 and FPDF
6 pages
PDF To Text With Python 1658153600
No ratings yet
PDF To Text With Python 1658153600
12 pages
Extracting Text From PDF Files With Python - A Comprehensive Guide - Modo Leitor
No ratings yet
Extracting Text From PDF Files With Python - A Comprehensive Guide - Modo Leitor
17 pages
Analyzing Malicious PDF Files - Part 12
No ratings yet
Analyzing Malicious PDF Files - Part 12
7 pages
PDF Analysis Cheatsheet
No ratings yet
PDF Analysis Cheatsheet
4 pages
Use Python To Fill PDF Files! - AKDux
No ratings yet
Use Python To Fill PDF Files! - AKDux
16 pages
Dumppdf Py
No ratings yet
Dumppdf Py
9 pages
PDF
No ratings yet
PDF
14 pages
Python PDF Creation with PyPDF2 & ReportLab
No ratings yet
Python PDF Creation with PyPDF2 & ReportLab
22 pages
50 Useful Python Scripts Free PDF
100% (2)
50 Useful Python Scripts Free PDF
65 pages
Basic Static Malware Analysis On Linux
No ratings yet
Basic Static Malware Analysis On Linux
12 pages
AI Document Processing with GPT
No ratings yet
AI Document Processing with GPT
18 pages
Reference Manual - PyFPDF
No ratings yet
Reference Manual - PyFPDF
2 pages
Malicious PDF Analysis Ebook
No ratings yet
Malicious PDF Analysis Ebook
23 pages
1
No ratings yet
1
41 pages
PyPDF: Python PDF Toolkit Overview
No ratings yet
PyPDF: Python PDF Toolkit Overview
5 pages
Random
No ratings yet
Random
3 pages
PDF Explination
No ratings yet
PDF Explination
3 pages
pdfreader Documentation Overview
No ratings yet
pdfreader Documentation Overview
40 pages
Pdfreader Documentation: Release 0.1.10
No ratings yet
Pdfreader Documentation: Release 0.1.10
40 pages
Create - Folder - If - Not - Exists: STR None
No ratings yet
Create - Folder - If - Not - Exists: STR None
5 pages
8 10
No ratings yet
8 10
6 pages
Pdfreader Documentation: Release 0.1.7
No ratings yet
Pdfreader Documentation: Release 0.1.7
40 pages
GuidedPractice3 3
No ratings yet
GuidedPractice3 3
11 pages
Report
No ratings yet
Report
7 pages
PDF Analysis with peepdf Tool
No ratings yet
PDF Analysis with peepdf Tool
2 pages
PDF Analysis
No ratings yet
PDF Analysis
13 pages
Message 12 3
No ratings yet
Message 12 3
10 pages
A Feature Set of Small Size For The PDF Malware Detection
No ratings yet
A Feature Set of Small Size For The PDF Malware Detection
6 pages
GACS25
No ratings yet
GACS25
9 pages
All Streams
No ratings yet
All Streams
2 pages
PDFReader Python API Guide
No ratings yet
PDFReader Python API Guide
38 pages
Generate Multiple PDFs with fpdf2
No ratings yet
Generate Multiple PDFs with fpdf2
1 page
25 Awesome Python Scripts
No ratings yet
25 Awesome Python Scripts
26 pages
PDF Generation with pdfgen
No ratings yet
PDF Generation with pdfgen
1 page
Web Mining Techniques and Code
No ratings yet
Web Mining Techniques and Code
11 pages
Book Api Client
No ratings yet
Book Api Client
3 pages
Pseudocodes and Flowcharts (Riyansha Shahare)
No ratings yet
Pseudocodes and Flowcharts (Riyansha Shahare)
14 pages
This Little-Known PDF Parsing Library Will Save Enterprises Millions by Michael Ryaboy Jun, 2025
No ratings yet
This Little-Known PDF Parsing Library Will Save Enterprises Millions by Michael Ryaboy Jun, 2025
1 page
Generate Audit Pdfs
No ratings yet
Generate Audit Pdfs
2 pages
Pdfnup
No ratings yet
Pdfnup
3 pages
Extract Data from PDFs in ZIP Files
No ratings yet
Extract Data from PDFs in ZIP Files
2 pages
Peepdf - PDF Analysis Tool
No ratings yet
Peepdf - PDF Analysis Tool
12 pages
Myths and Realities of PDF Preservation
No ratings yet
Myths and Realities of PDF Preservation
57 pages
3 Ways To Scrape PDF in Python - Proxidize
No ratings yet
3 Ways To Scrape PDF in Python - Proxidize
20 pages
Pikepdf Readthedocs Io en Latest
No ratings yet
Pikepdf Readthedocs Io en Latest
71 pages
Python Programs for File Operations and Data Handling
No ratings yet
Python Programs for File Operations and Data Handling
10 pages
Malicious PDF Detection Guide
No ratings yet
Malicious PDF Detection Guide
26 pages
Easy, Clean, Reliable Python 2/3 Compatibility
No ratings yet
Easy, Clean, Reliable Python 2/3 Compatibility
48 pages
Debian LVM Partition Shrinking Guide
No ratings yet
Debian LVM Partition Shrinking Guide
3 pages
Rsync Exclude Files & Directories Guide
No ratings yet
Rsync Exclude Files & Directories Guide
4 pages
Using The CSV Module in Python
No ratings yet
Using The CSV Module in Python
5 pages
Compare CSV Files with AWK Script
No ratings yet
Compare CSV Files with AWK Script
6 pages
Configure Crontab in Cygwin Guide
No ratings yet
Configure Crontab in Cygwin Guide
2 pages
Export Oracle 10g to CSV via PL/SQL
No ratings yet
Export Oracle 10g to CSV via PL/SQL
5 pages
Java IO Tutorial
No ratings yet
Java IO Tutorial
60 pages
Python Excel Integration Guide
No ratings yet
Python Excel Integration Guide
11 pages
Oracle UTL_FILE Package Guide
No ratings yet
Oracle UTL_FILE Package Guide
2 pages
Korn Shell (KSH) Programming
100% (1)
Korn Shell (KSH) Programming
34 pages
Isolinux HowTo For Newbies
No ratings yet
Isolinux HowTo For Newbies
6 pages
Unix Shell Scripting With Ksh/Bash
100% (2)
Unix Shell Scripting With Ksh/Bash
45 pages
AIX Network Commands
No ratings yet
AIX Network Commands
7 pages
Human Resource Management 15th Edition by Gary Dessler Download
No ratings yet
Human Resource Management 15th Edition by Gary Dessler Download
53 pages
CH 8 Concepts of Cost - D30ca2ba 7035 43d9 Aa15 Eb7fda00ab27
No ratings yet
CH 8 Concepts of Cost - D30ca2ba 7035 43d9 Aa15 Eb7fda00ab27
44 pages
Stop, Friendly Fire! - 03
No ratings yet
Stop, Friendly Fire! - 03
199 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
3 pages
Remediation Project Guidelines and Goals
No ratings yet
Remediation Project Guidelines and Goals
2 pages
Music in Contemporary French Cinema The Crystalsong Phil Powrie Auth Download
100% (3)
Music in Contemporary French Cinema The Crystalsong Phil Powrie Auth Download
86 pages
DBMS Relational Calculus
No ratings yet
DBMS Relational Calculus
9 pages
Virtual Memory & File Systems Guide
No ratings yet
Virtual Memory & File Systems Guide
42 pages
7234 28876 1 PB
No ratings yet
7234 28876 1 PB
11 pages
4 Aa
No ratings yet
4 Aa
5 pages
Applied and Environmental Microbiology 1968 Favero 182.full
No ratings yet
Applied and Environmental Microbiology 1968 Favero 182.full
2 pages
(Hinduism) Ramesh Menon, Vyasa - The Complete Mahabharata - Volume 1-12. 1-12-Rupa & Co (2019)
No ratings yet
(Hinduism) Ramesh Menon, Vyasa - The Complete Mahabharata - Volume 1-12. 1-12-Rupa & Co (2019)
6,808 pages
Program 2
No ratings yet
Program 2
8 pages
CO2's Role in Scale Formation
No ratings yet
CO2's Role in Scale Formation
11 pages
Agro Textiles - General Property Requirement of Agrotextiles - Fibers Used For Agro-Textiles - Application of Agro Textiles - Textile Learner PDF
No ratings yet
Agro Textiles - General Property Requirement of Agrotextiles - Fibers Used For Agro-Textiles - Application of Agro Textiles - Textile Learner PDF
4 pages
PDF Nucleofill PPT New Maret 2023
No ratings yet
PDF Nucleofill PPT New Maret 2023
41 pages
Solenoid Switch
No ratings yet
Solenoid Switch
62 pages
The Poisson Distribution
No ratings yet
The Poisson Distribution
20 pages
Class XII Java Practical File 2024-25
No ratings yet
Class XII Java Practical File 2024-25
7 pages
Rabies: Symptoms, Transmission, and Prevention
100% (1)
Rabies: Symptoms, Transmission, and Prevention
1 page
MSDS SHEET - Liquid Soap
No ratings yet
MSDS SHEET - Liquid Soap
4 pages
Chapter 2 Stakeholder Relationships Social Responsibility and Corporate Governance
No ratings yet
Chapter 2 Stakeholder Relationships Social Responsibility and Corporate Governance
11 pages
Women's Body Hair Perceptions Study
No ratings yet
Women's Body Hair Perceptions Study
9 pages
Accounting Research Insights
No ratings yet
Accounting Research Insights
140 pages
Class 12- Portion for December and Cssc Exam
No ratings yet
Class 12- Portion for December and Cssc Exam
5 pages
Allegro System Capture AppNote - Custom Menus and Toolbars
No ratings yet
Allegro System Capture AppNote - Custom Menus and Toolbars
13 pages
Horizon® DXA System Brochure GBR EN
No ratings yet
Horizon® DXA System Brochure GBR EN
8 pages
A Technical Report On TCN
No ratings yet
A Technical Report On TCN
8 pages
Basic NDT - Et QB 2
No ratings yet
Basic NDT - Et QB 2
4 pages
Wireless ATM Networks Overview
No ratings yet
Wireless ATM Networks Overview
14 pages

Testing PDFs With Python

Uploaded by

Testing PDFs With Python

Uploaded by

Testing PDFs with Python

Mon 26 January 2015

Testing Internal and External Links

from PyPDF2 import PdfFileReader

def check_url(url, auth=None):

return result, reason

for page in [Link]:

Test for Embedded Fonts

fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])

def walk(obj, fnt, emb):

return fnt, emb

for page in [Link]:

unembedded = fonts - embedded

Using PyPDF2 Methods

You might also like