22 Project 3 PDF Scraping in Python REGEX

This document discusses extracting product information from a PDF file using Python and regular expressions. [It describes how] to list all equipment models developed by manufacturers containing "Tandem" from a diabetes.org PDF. It provides the input file, explains using the pdfquery module to search the PDF text using bounding boxes and regular expressions, and prints any matching models. The code extracts manufacturer and model data from the PDF and outputs the two Tandem Diabetes Care models found.

Uploaded by

ArvindSharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

22 Project 3 PDF Scraping in Python REGEX

Uploaded by

ArvindSharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Project 3: PDF scraping in Python + REGEX

In this project we use regex to extract a list of items from a pdf le.

WE'LL COVER THE FOLLOWING

• PDF scraping example:

• Input le
• Solution

PDF scraping example: #

In this project we will use a pdf file (see the screenshot below) from the
diabetes.org website. Our goal is to list all the equipment models developed by
the manufacturers names containing the word tandem (case insensitive).

Find all the product models by the manufacturer called `Tandem`

Input le #

You can download the input file from here: data.pdf.

Solution #
A complete explanation of the Python code is out of the scope for this course
(hint: learn the Python module pdfquery ). It should be easy enough for you to
understand how we capture the porduct_name from the pdf file using
bounding box function LTTextLineHorizontal:in_bbox("40, 48, 181, 633")
and then iterate over the products and search using regex and then only print
the Tandem Manufacurers.

import re

import pdfquery
from lxml import etree

PDF_FILE = 'data.pdf'

pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load()

product_info = []
page_count = len(pdf._pages)
for pg in range(page_count):
data = pdf.extract([
('with_parent', 'LTPage[pageid="{}"]'.format(pg+1)),
('with_formatter', None),
('product_name', 'LTTextLineHorizontal:in_bbox("40, 48, 181, 633")'),
])

for ix, pn in enumerate(sorted([d for d in data['product_name'] if d.text.strip()], key=l

if ix % 2 == 0:
product_info.append({'Manufacturer': pn.text.strip(), 'page': pg, 'y_start': floa
if ix > 0:
product_info[-2]['y_end'] = float(pn.get('y0'))+10.0
else:
product_info[-1]['Model'] = pn.text.strip()

pdf.file.close()

for p in product_info:
s = p['Manufacturer']
m = re.search(r"Tandem",s,re.I)
if m:
print('Manufacturer: {}[Model {}]\n'.format(p['Manufacturer'],p['Model']))
We have preloaded the data onto educative.io’s server and you should be able
to run the code straight ahead and get the output as follows:

Manufacturer: Tandem Diabetes Care[Model T:flex]

Manufacturer: Tandem Diabetes Care[Model T:slim]

From this result we can see that there are two models T:fles and T:slim
supplied by the manufacturer called ‘Tandem Diabetes Care’. The problem
solution has been adopted and simplified from the reddit user
insainodwayno.

Introduction to AutoCAD Plant 3D 2021
From Everand
Introduction to AutoCAD Plant 3D 2021
Tutorial Books
4/5 (6)
Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Duplicate Checks Configurations in SAP-MDG
No ratings yet
Duplicate Checks Configurations in SAP-MDG
9 pages
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Introduction to AutoCAD Plant 3D 2019
From Everand
Introduction to AutoCAD Plant 3D 2019
Tutorial Books
4.5/5 (5)
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
From Everand
Unofficial TIBCO® Business Works™ Interview Questions, Answers, and Explanations: TIBCO Certification Review Questions
equitypress
3.5/5 (2)
03ws Pas Install CPM and Pvwa
100% (1)
03ws Pas Install CPM and Pvwa
92 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Programming with Python
From Everand
Programming with Python
Enrique Vicente
No ratings yet
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
From Everand
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
Abdelfattah Ragab
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Pyqt6 101: A Beginner’s Guide to PyQt6
From Everand
Pyqt6 101: A Beginner’s Guide to PyQt6
Edward Chang
No ratings yet
C++ Coding Idea with Example
From Everand
C++ Coding Idea with Example
Billy H. Green
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Introduction to Python Programming: Do your first steps into programming with python
From Everand
Introduction to Python Programming: Do your first steps into programming with python
Greytower Corp
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Dive Into Sea of C
From Everand
Dive Into Sea of C
M Ashok
No ratings yet
Python for Absolute Beginners: Learn to Code Fast!
From Everand
Python for Absolute Beginners: Learn to Code Fast!
Ibnul Jaif Farabi
No ratings yet
Fresher PyQt5: A Beginner’s Guide to PyQt5
From Everand
Fresher PyQt5: A Beginner’s Guide to PyQt5
Edward Chang
No ratings yet
Computer Practices Using C++
From Everand
Computer Practices Using C++
Ramkrishna Ghosh
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
ASP.NET Application Development Fundamentals
From Everand
ASP.NET Application Development Fundamentals
James Lombard
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Microsoft Visual C++ Windows Applications by Example
From Everand
Microsoft Visual C++ Windows Applications by Example
Stefan BjÃ¶rnander
3.5/5 (3)
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
Introduction to AutoCAD Plant 3D 2017
From Everand
Introduction to AutoCAD Plant 3D 2017
Tutorial Books
4.5/5 (3)
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)
Learn R By Coding
From Everand
Learn R By Coding
Thomas Kurnicki
No ratings yet
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
From Everand
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
Coleman Newton
No ratings yet
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
PYTHON FOR BEGINNERS: A Comprehensive Guide to Learning Python Programming from Scratch (2023)
From Everand
PYTHON FOR BEGINNERS: A Comprehensive Guide to Learning Python Programming from Scratch (2023)
Denton Freeman
No ratings yet
Introduction to AutoCAD Plant 3D 2018
From Everand
Introduction to AutoCAD Plant 3D 2018
Tutorial Books
5/5 (2)
R coding for data analysts: from beginner to advanced
From Everand
R coding for data analysts: from beginner to advanced
Porcu Valentina
No ratings yet
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
"C Programming for Beginners: A Step-by-Step Guide"
From Everand
"C Programming for Beginners: A Step-by-Step Guide"
Lov kush
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Introduction to AutoCAD Plant 3D 2016
From Everand
Introduction to AutoCAD Plant 3D 2016
Tutorial Books
5/5 (5)
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
From Everand
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
Michael Blake
5/5 (1)
Programming In ‘C’
From Everand
Programming In ‘C’
Rajendra Kawale
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Extension courseware based on the ArchiMate Standard, Version 3.1 Standard by Van Haren Publishing
From Everand
Extension courseware based on the ArchiMate Standard, Version 3.1 Standard by Van Haren Publishing
Van Haren Learning Solutions a.o.
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
C Programming Concepts
From Everand
C Programming Concepts
Jitendra Patel
No ratings yet
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet
Web Scraping
No ratings yet
Web Scraping
11 pages
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
PYTHON FOR BEGINNERS: Master the Basics of Python Programming and Start Writing Your Own Code in No Time (2023 Guide for Beginners)
From Everand
PYTHON FOR BEGINNERS: Master the Basics of Python Programming and Start Writing Your Own Code in No Time (2023 Guide for Beginners)
Glen Jennings
No ratings yet
SDFG
No ratings yet
SDFG
4 pages
Advance Data Mining Assignment
No ratings yet
Advance Data Mining Assignment
10 pages
9 Python Regex Group Functions
No ratings yet
9 Python Regex Group Functions
2 pages
18 Python Lookbehind
No ratings yet
18 Python Lookbehind
2 pages
8 Python Regex Match Vs Search Functions
No ratings yet
8 Python Regex Match Vs Search Functions
2 pages
7 Python Regex Search Function Coding Exercise
No ratings yet
7 Python Regex Search Function Coding Exercise
2 pages
3 Quiz 1 REGEX Patterns
No ratings yet
3 Quiz 1 REGEX Patterns
4 pages
4 Python Regex Match Function
No ratings yet
4 Python Regex Match Function
4 pages
Application Security Testing
No ratings yet
Application Security Testing
43 pages
CSE-3507 (SADD) Final Spring-2021
No ratings yet
CSE-3507 (SADD) Final Spring-2021
2 pages
CIS Cisco NX-OS Benchmark v1.0.0 PDF
No ratings yet
CIS Cisco NX-OS Benchmark v1.0.0 PDF
193 pages
MigrationObjects OP en
100% (2)
MigrationObjects OP en
1,122 pages
70-648 - A v2012-03-15 by Ackley
No ratings yet
70-648 - A v2012-03-15 by Ackley
107 pages
DBMS 2013 Question Paper
No ratings yet
DBMS 2013 Question Paper
2 pages
Development of Technology in Inventory Management
No ratings yet
Development of Technology in Inventory Management
12 pages
An Operating System MBA
No ratings yet
An Operating System MBA
15 pages
Ltraci 3010
No ratings yet
Ltraci 3010
45 pages
Triggers & Active Data Bases
No ratings yet
Triggers & Active Data Bases
10 pages
Safe Sign Identity Clent Standard
No ratings yet
Safe Sign Identity Clent Standard
37 pages
Whitepaper Micro Frontend PDF
No ratings yet
Whitepaper Micro Frontend PDF
6 pages
SAP PS Job Interview Preparation Guide
No ratings yet
SAP PS Job Interview Preparation Guide
6 pages
Parminder Pal
No ratings yet
Parminder Pal
38 pages
2 Resume
No ratings yet
2 Resume
1 page
Intermediate Documents (Idocs) : What Is An Idoc
No ratings yet
Intermediate Documents (Idocs) : What Is An Idoc
21 pages
Scala, Haskell and Fantom Programming Language
No ratings yet
Scala, Haskell and Fantom Programming Language
9 pages
TRACKING THE EFFECTIVENESS OF AUTOMATION IN DEVOPS (suprit)
No ratings yet
TRACKING THE EFFECTIVENESS OF AUTOMATION IN DEVOPS (suprit)
9 pages
Controller Extension in R12
No ratings yet
Controller Extension in R12
8 pages
Great University, Noida: Businessanalytics Lab Semester
No ratings yet
Great University, Noida: Businessanalytics Lab Semester
22 pages
Automation Testing Tutorial
No ratings yet
Automation Testing Tutorial
7 pages
Inventory Management System
No ratings yet
Inventory Management System
14 pages
SE Lab Manual R18
100% (1)
SE Lab Manual R18
60 pages
7 Days Analytics Course 3feiz7 1
No ratings yet
7 Days Analytics Course 3feiz7 1
8 pages
Advanced User Guide
No ratings yet
Advanced User Guide
25 pages
SPI Installation Checklist
No ratings yet
SPI Installation Checklist
2 pages
Real Time2
No ratings yet
Real Time2
3 pages
Activity 1 DF
No ratings yet
Activity 1 DF
26 pages

22 Project 3 PDF Scraping in Python REGEX

Uploaded by

22 Project 3 PDF Scraping in Python REGEX

Uploaded by

Project 3: PDF scraping in Python + REGEX

WE'LL COVER THE FOLLOWING

• PDF scraping example:

PDF scraping example: #

Find all the product models by the manufacturer called `Tandem`

You can download the input file from here: data.pdf.

for ix, pn in enumerate(sorted([d for d in data['product_name'] if d.text.strip()], key=l

Manufacturer: Tandem Diabetes Care[Model T:flex]

You might also like