22 Project 3 PDF Scraping in Python REGEX
22 Project 3 PDF Scraping in Python REGEX
In this project we use regex to extract a list of items from a pdf le.
Solution #
A complete explanation of the Python code is out of the scope for this course
(hint: learn the Python module pdfquery ). It should be easy enough for you to
understand how we capture the porduct_name from the pdf file using
bounding box function LTTextLineHorizontal:in_bbox("40, 48, 181, 633")
and then iterate over the products and search using regex and then only print
the Tandem Manufacurers.
import re
import pdfquery
from lxml import etree
PDF_FILE = 'data.pdf'
pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load()
product_info = []
page_count = len(pdf._pages)
for pg in range(page_count):
data = pdf.extract([
('with_parent', 'LTPage[pageid="{}"]'.format(pg+1)),
('with_formatter', None),
('product_name', 'LTTextLineHorizontal:in_bbox("40, 48, 181, 633")'),
])
pdf.file.close()
for p in product_info:
s = p['Manufacturer']
m = re.search(r"Tandem",s,re.I)
if m:
print('Manufacturer: {}[Model {}]\n'.format(p['Manufacturer'],p['Model']))
We have preloaded the data onto educative.io’s server and you should be able
to run the code straight ahead and get the output as follows:
From this result we can see that there are two models T:fles and T:slim
supplied by the manufacturer called ‘Tandem Diabetes Care’. The problem
solution has been adopted and simplified from the reddit user
insainodwayno.