Testing PDFs With Python
Testing PDFs With Python
Overview
When you generate PDFs, you need a way to test their integrity—not only must they be valid,
but they should behave correctly and display consistently, even on different platforms. This
article describes how you can use the PyPDF2 library to test your PDF files for broken links
(both internal and external), and how to find fonts that are not embedded in the PDF.
Note that some PDF readers are ‘smart’ and will create a live hyperlink from a string of text
that looks like a URL, even though the text is not coded as a URL in the PDF file. The
technique described in this article does not address this issue—it only tests the actual URLs
present in the PDF file.
Code from this article is available as a gist on GitHub. Get a copy and play along.
There may be some ftp links, which are not directly supported by the requests library. You
can either use the urllib package as in the following code, or you can use the requests-ftp
package, available on pypi.
The check_ftp function checks a given url for a response. If it fails, or if the response is
empty, it returns False, along with the reason; otherwise it returns True.
def check_ftp(url):
try:
response = urllib.urlopen(url)
except IOError as e:
result, reason = False, e
else:
if response.read():
result, reason = True, 'okay'
else:
result, reason = False, 'Empty Page'
return result, reason
The check_url function is also simple: If the url starts with ftp, it delegates to the
check_ftp function. Otherwise, it attempts to get the url with some timeout value using
typical header values. The the function returns the response along with the reason it succeeded
or failed.
Now that we have this utility, we can check the PDF file. We will create four lists:
links The internal PDF links in the file; for example, a reference to a section or figure.
badlinks Of the internal links in the file, these are links that target a missing
destination (broken link).
urls The links from the PDF to an external location; for example, a hyperlink to a
web site.
badurls Of the external links in the file, these are the urls that target a missing
destination (broken url)
Now for the PyPDF2 goodies. The following check_pdf function loops over the pages in the
PDF file object. For each page, it walks through the Annots dictionary. If that dictionary has
an action (\A) with a key of \D (destination?), that is an internal link, so update the links list
with the destination.
If the dictionary has an action with a key of \URI, it is an external link. Check the external
links with the check_url function and update the urls and bad_urls lists.
After checking each page, get a list of all the anchors in the PDF with the
getNamedDestinations attribute; compare that list of all known anchors to the list of internal
links we just created. If there is a link with no matching anchor, that link belongs in the
badlinks list.
def check_pdf(pdf):
links = list()
urls = list()
badurls = list()
anchors = pdf.namedDestinations.keys()
badlinks = [x for x in links if x not in anchors]
return links, badlinks, urls, badurls
Finally, make the code into a callable script that takes a single argument, the path to the PDF
file. Then print the results of the check_pdf function on stdout.
if __name__ == '__main__':
fname = sys.argv[1]
print 'Checking %s' % fname
pdf = PdfFileReader(fname)
links, badlinks, urls, badurls = check_pdf(pdf)
print 'urls: ', urls
print
print 'bad links: ', badlinks
print
print 'bad urls: ',badurls
In the following code, the walk function is a recursive function that takes a dictionary-like
object (obj) and two sets (fnt and emb). It walks the given dictionary object: for every key in
the given dictionary, the function calls itself on the corresponding value (if that value is a
nested dictionary).
If the dictionary has a key called BaseFont, the value corresponding to that key is the name of
a font used in the PDF; add that font name to the fnt set of fonts used.
If the dictionary has a key called FontName, the dictionary is a descriptor for that font, so
check for another key in the same font descriptor dictionary that begins with FontFile (the
key could be FontFile, FontFile2, or FontFile3). If that key exists, the font is embedded;
add that font name to the set of fonts embedded.
If the two sets are not identical, there are unembedded fonts in the PDF.
for k in obj:
if hasattr(obj[k], 'keys'):
walk(obj[k], fnt, emb)
Finally, make the code into a callable script that takes a single argument, the path to the
PDF file.
Start with two empty sets, fonts and embedded. Open the file with PyPDF2. The library
gives us access to the internal structure of the PDF. We loop over each page in the PDF,
passing the page’s Resources dictionary to the walk function, described above. Add the
corresponding results to the two sets and calculate the unembedded fonts by differencing
the sets.
Print the fonts used in the PDF file and if there are unembedded fonts, print their names as
well. Of course here you can do anything you want with the information such as save it to test
database, print a report, and so on.
if __name__ == '__main__':
fname = sys.argv[1]
pdf = PdfFileReader(fname)
fonts = set()
embedded = set()
You can test for those conditions with the built-in tools that the PDFFileReader in pyPDF2
provides. If you have a PDFFileReader instance, you can use the following properties
for testing:
documentInfo
returns the document metadata such as author, creator, producer, subject, and title.
isEncrypted
returns boolean value specifiying whether the document is encrypted
numPages
returns the number of pages in the document
Summary
If you produce PDF documents, you need to test them. The more you can specify about your
PDFs, the more you can test. This article describes how you can test that the links (internal
and external) are valid and that the fonts used in the document are embedded.