Python Script For PDF - Reading

The document outlines a Python class, PDFReader, designed to extract and organize text from PDF files into chapters and topics. It includes methods for extracting text, storing chapters and topics, sorting them, and answering questions based on keyword searches. The example usage demonstrates how to instantiate the class, process a PDF, and query for specific information.

Uploaded by

Anonymous YBAHVQ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views2 pages

Python Script For PDF - Reading

Uploaded by

Anonymous YBAHVQ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

import PyPDF2

import re
from collections import defaultdict

class PDFReader:
def __init__(self, file_path):
self.file_path = file_path
self.chapters = defaultdict(list)
self.topics = defaultdict(list)

def extract_text(self):
with open(self.file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ''
for page in reader.pages:
text += page.extract_text() + '\n'
return text

def store_chapters(self, text):

# Example regex patterns, adjust according to your PDF structure
chapter_pattern = r'Chapter \d+:'
topic_pattern = r'\d+\.\s+(.*?)(?=\n\d+\.\s+|$)' # Adjust according to
topics structure

chapters = re.split(chapter_pattern, text)

for i, chapter in enumerate(chapters):
if i == 0: # Skip the introduction or non-chapter content
continue
self.chapters[f'Chapter {i}'] = chapter.strip()
topics = re.findall(topic_pattern, chapter)
for topic in topics:
self.topics[f'Chapter {i}'].append(topic.strip())

def sort_data(self):
# Sort topics within each chapter
for chapter in self.topics:
self.topics[chapter].sort()

def answer_question(self, question):

# Simple keyword search in stored data
response = []
for chapter, topics in self.topics.items():
for topic in topics:
if re.search(re.escape(question), topic, re.IGNORECASE):
response.append(f'Found in {chapter}: {topic}')
return response if response else ["No relevant information found."]

def process_pdf(self):
text = self.extract_text()
self.store_chapters(text)
self.sort_data()

# Example usage
if __name__ == "__main__":
pdf_reader = PDFReader('C:/Videos/363007BUSC01282_CDFE02_117.pdf')
pdf_reader.process_pdf()

# Example question
question = 'lifting devices need to be re-certified'
answers = pdf_reader.answer_question(question)
for answer in answers:
print(answer)

Number Line 0-20
No ratings yet
Number Line 0-20
5 pages
Homeschooling Parent Guide Content
No ratings yet
Homeschooling Parent Guide Content
4 pages
Tri Weekly 1 Spring Term WM Art
No ratings yet
Tri Weekly 1 Spring Term WM Art
2 pages
Final Ravi River Scope of Work Develpment
No ratings yet
Final Ravi River Scope of Work Develpment
10 pages
Code
No ratings yet
Code
1 page
Dsbda 7
No ratings yet
Dsbda 7
1 page
Import PDF
No ratings yet
Import PDF
2 pages
Code2pdf 66efdda59db7c
No ratings yet
Code2pdf 66efdda59db7c
2 pages
Arduino Learning Overview
No ratings yet
Arduino Learning Overview
2 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
WARM GPT Python Script
No ratings yet
WARM GPT Python Script
5 pages
PYTHON
No ratings yet
PYTHON
2 pages
HMM Mode With Notes
No ratings yet
HMM Mode With Notes
2 pages
PDF Processor
No ratings yet
PDF Processor
4 pages
Extractor de Imagenes en PDF
No ratings yet
Extractor de Imagenes en PDF
3 pages
Akk
No ratings yet
Akk
2 pages
Komenda
No ratings yet
Komenda
3 pages
AI Lab: Batch/Section Name/Reg No
No ratings yet
AI Lab: Batch/Section Name/Reg No
4 pages
Clean Data
No ratings yet
Clean Data
4 pages
Worksheets For Kids Practice Alphabet Tracing Small Letters
100% (1)
Worksheets For Kids Practice Alphabet Tracing Small Letters
1 page
Kids Maths Challenges - 01
No ratings yet
Kids Maths Challenges - 01
1 page
Cs 3308 Unit 7 Programming Assignment
No ratings yet
Cs 3308 Unit 7 Programming Assignment
8 pages
DS 8-12
No ratings yet
DS 8-12
5 pages
Kaagada Research
No ratings yet
Kaagada Research
5 pages
Python code 3
No ratings yet
Python code 3
19 pages
CS 3308 Programming Assignment Unit 4
No ratings yet
CS 3308 Programming Assignment Unit 4
7 pages
Create - Folder - If - Not - Exists: STR None
No ratings yet
Create - Folder - If - Not - Exists: STR None
5 pages
NLP Lab codes till mod3
No ratings yet
NLP Lab codes till mod3
7 pages
NLP TP1 Report Lahouel Ibrahim
No ratings yet
NLP TP1 Report Lahouel Ibrahim
6 pages
Information Retrieval WA
No ratings yet
Information Retrieval WA
9 pages
Pythonthing.pdf
No ratings yet
Pythonthing.pdf
4 pages
Long Docs
No ratings yet
Long Docs
8 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
How To Analyze A PDF With The Layout-Parser Package. - by Brendan Ferris - Towards Data Science
No ratings yet
How To Analyze A PDF With The Layout-Parser Package. - by Brendan Ferris - Towards Data Science
3 pages
Ballerono Cappuchino
No ratings yet
Ballerono Cappuchino
10 pages
Bling
No ratings yet
Bling
7 pages
Web Mining DA
No ratings yet
Web Mining DA
13 pages
Maximum Sum Circular Subarray
No ratings yet
Maximum Sum Circular Subarray
8 pages
Advanced Coding Assignment 2
No ratings yet
Advanced Coding Assignment 2
8 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
Assessment - 2: - K Mary Nikitha
No ratings yet
Assessment - 2: - K Mary Nikitha
27 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
20BCE1779 - Web Mining - Lab-4
No ratings yet
20BCE1779 - Web Mining - Lab-4
10 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
Week2 and Week 3
No ratings yet
Week2 and Week 3
13 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
Demo
No ratings yet
Demo
3 pages
NLP Lab Assignment 8
No ratings yet
NLP Lab Assignment 8
14 pages
NLP Record
No ratings yet
NLP Record
16 pages
Untitled Document
No ratings yet
Untitled Document
18 pages
Extracting Text and Images From PDF Files
No ratings yet
Extracting Text and Images From PDF Files
10 pages
22f-3386 Lab 3 6BB
No ratings yet
22f-3386 Lab 3 6BB
20 pages
Lab File Complete
No ratings yet
Lab File Complete
10 pages
CS 3308 Programming Assignment Unit 2
No ratings yet
CS 3308 Programming Assignment Unit 2
10 pages
45
No ratings yet
45
5 pages
Zref
No ratings yet
Zref
8 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
No ratings yet
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
11 pages
Notes - by Kishor
No ratings yet
Notes - by Kishor
11 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
Web Mining Lab Source Code 1-12 PRINT
No ratings yet
Web Mining Lab Source Code 1-12 PRINT
43 pages
Irs 122010304057 PDF
No ratings yet
Irs 122010304057 PDF
23 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Introduction
No ratings yet
Introduction
17 pages
Codesrepl
No ratings yet
Codesrepl
16 pages

Python Script For PDF - Reading

Uploaded by

Python Script For PDF - Reading

Uploaded by

import PyPDF2

def store_chapters(self, text):

chapters = re.split(chapter_pattern, text)

def answer_question(self, question):

You might also like