0% found this document useful (0 votes)
5 views

Final_Code_for_Markup

The document outlines a Python script that utilizes PyMuPDF and PyPDF2 libraries to search for specific texts in PDF files and highlight them. It reads search terms from a CSV file and processes multiple PDF files listed in an Excel spreadsheet, saving the modified PDFs with highlights. The script includes functions for reading search texts, searching and highlighting text in PDFs, and handling file paths for input and output.

Uploaded by

Suresh Kadam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Final_Code_for_Markup

The document outlines a Python script that utilizes PyMuPDF and PyPDF2 libraries to search for specific texts in PDF files and highlight them. It reads search terms from a CSV file and processes multiple PDF files listed in an Excel spreadsheet, saving the modified PDFs with highlights. The script includes functions for reading search texts, searching and highlighting text in PDFs, and handling file paths for input and output.

Uploaded by

Suresh Kadam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

pip install PyMuPDF

pip install PyPDF2

###FINAL
import fitz # PyMuPDF
from PyPDF2 import PdfReader, PdfWriter
import csv
import pandas as pd

def read_search_texts_from_csv(csv_path):
with open(csv_path, 'r') as csvfile:
reader = csv.reader(csvfile)
return [row[0] for row in reader]

def my_search_and_highlight(pdf_path, search_texts, output_path):


# Open the PDF with PyMuPDF
doc = fitz.open(pdf_path)

for page in doc:

# Search for each text on the page


for search_text in search_texts:
text_instances = page.search_for(search_text)

# Highlight the text on the page


for text_rect in text_instances:
highlight = page.add_highlight_annot(text_rect)
highlight.update()

# Save the modified PDF


doc.save(output_path, garbage=4, deflate=True, clean=True)

if __name__ == "__main__":
# Replace these variables with your actual file paths and search texts
input_pdf_loc = "/Users/sureshkadam/Documents/MY
DATA/PYHTON-WINDOWS/PYTHON/HTMLPARSER/Htmlparser/ETPDF/"
output_pdf_loc = "/Users/sureshkadam/Documents/MY DATA/PHD/DATA/MARKUP_PDF/"
csv_path = "search_texts.csv"

excel_file_path = "/Users/sureshkadam/Documents/MY
DATA/PHD/DATA/MARKUP_PDF/filestoupdate.xlsx"

df = pd.read_excel(excel_file_path)

# Read search texts from CSV


search_texts = read_search_texts_from_csv(csv_path)

filenames = df.iloc[:,0]
i=0
for values in filenames:
input_pdf_path = input_pdf_loc + values
output_pdf_path = output_pdf_loc + values
#print(input_pdf_path)
#print(output_pdf_path)
my_search_and_highlight(input_pdf_path, search_texts, output_pdf_path)
i = i + 1
print(str(i)+ " - "+ values)

You might also like