0% found this document useful (0 votes)
33 views

22 Project 3 PDF Scraping in Python REGEX

This document discusses extracting product information from a PDF file using Python and regular expressions. [It describes how] to list all equipment models developed by manufacturers containing "Tandem" from a diabetes.org PDF. It provides the input file, explains using the pdfquery module to search the PDF text using bounding boxes and regular expressions, and prints any matching models. The code extracts manufacturer and model data from the PDF and outputs the two Tandem Diabetes Care models found.

Uploaded by

ArvindSharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

22 Project 3 PDF Scraping in Python REGEX

This document discusses extracting product information from a PDF file using Python and regular expressions. [It describes how] to list all equipment models developed by manufacturers containing "Tandem" from a diabetes.org PDF. It provides the input file, explains using the pdfquery module to search the PDF text using bounding boxes and regular expressions, and prints any matching models. The code extracts manufacturer and model data from the PDF and outputs the two Tandem Diabetes Care models found.

Uploaded by

ArvindSharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Project 3: PDF scraping in Python + REGEX

In this project we use regex to extract a list of items from a pdf le.

WE'LL COVER THE FOLLOWING

• PDF scraping example:


• Input le
• Solution

PDF scraping example: #


In this project we will use a pdf file (see the screenshot below) from the
diabetes.org website. Our goal is to list all the equipment models developed by
the manufacturers names containing the word tandem (case insensitive).

Find all the product models by the manufacturer called `Tandem`


Input le #

You can download the input file from here: data.pdf.

Solution #
A complete explanation of the Python code is out of the scope for this course
(hint: learn the Python module pdfquery ). It should be easy enough for you to
understand how we capture the porduct_name from the pdf file using
bounding box function LTTextLineHorizontal:in_bbox("40, 48, 181, 633")
and then iterate over the products and search using regex and then only print
the Tandem Manufacurers.

import re

import pdfquery
from lxml import etree

PDF_FILE = 'data.pdf'

pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load()

product_info = []
page_count = len(pdf._pages)
for pg in range(page_count):
data = pdf.extract([
('with_parent', 'LTPage[pageid="{}"]'.format(pg+1)),
('with_formatter', None),
('product_name', 'LTTextLineHorizontal:in_bbox("40, 48, 181, 633")'),
])

for ix, pn in enumerate(sorted([d for d in data['product_name'] if d.text.strip()], key=l


if ix % 2 == 0:
product_info.append({'Manufacturer': pn.text.strip(), 'page': pg, 'y_start': floa
if ix > 0:
product_info[-2]['y_end'] = float(pn.get('y0'))+10.0
else:
product_info[-1]['Model'] = pn.text.strip()

pdf.file.close()

for p in product_info:
s = p['Manufacturer']
m = re.search(r"Tandem",s,re.I)
if m:
print('Manufacturer: {}[Model {}]\n'.format(p['Manufacturer'],p['Model']))
We have preloaded the data onto educative.io’s server and you should be able
to run the code straight ahead and get the output as follows:

Manufacturer: Tandem Diabetes Care[Model T:flex]


Manufacturer: Tandem Diabetes Care[Model T:slim]

From this result we can see that there are two models T:fles and T:slim
supplied by the manufacturer called ‘Tandem Diabetes Care’. The problem
solution has been adopted and simplified from the reddit user
insainodwayno.

You might also like