
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Modules Available in Python for Converting PDF to Text
Python offers several powerful libraries to convert PDF documents to plain text, such as PyPDF2 and PDFMiner which are two popular modules for text extraction from PDFs.
Some of the common approaches (modules) for converting PDF to text are as follows-
Using 'PyPDF2' Module
PyPDF2 is a versatile library used for manipulating PDF files, focusing on functions such as merging, splitting, rotating pages, and extracting text. It offers a simple approach for performing basic PDF operations.
To extract data using PyPDF2 efficiently, you can use methods like getPage() and extractText() to read and retrieve text from PDF files.
Suppose you need to parse an existing record. In that case, PyPDF2 is an excellent choice due to its strong support for recognizing various fonts and other features.
Installation of PyPDF2
Launch the Command Prompt on your system and enter the following pip command to begin the installation of the library.
pip install PyPDF2
The steps involved in converting the PDF into text using the PyPDF2 module are as follows-
-
Open the PDF file in binary mode.
-
Create a PdfReader object to navigate the structure of the PDF.
-
Iterate through each page and use the extract_text method to collect the text.
-
Write the gathered text to a new text file for easier access and readability.
Example
The following code demonstrates how to use the PyPDF2 library to convert a PDF file into text. It defines a function, `convert_pdf_to_text`, which opens a PDF file, reads each page, extracts the text, and writes the extracted content to a designated text file.
import PyPDF2 def convert_pdf_to_text(input_pdf_path, output_text_file): # Open the PDF file in binary read mode with open(input_pdf_path, "rb") as pdf_input: # Create a PdfReader object to read the PDF pdf_reader = PyPDF2.PdfReader(pdf_input) # Initialize a string to hold the extracted text extracted_text = "" # Iterate through each page in the PDF for page_index in range(len(pdf_reader.pages)): # Get the current page page = pdf_reader.pages[page_index] # Extract text from the page and add it to the extracted_text extracted_text += page.extract_text() # Save the extracted text to a text file with open(output_text_file, "w", encoding="utf-8") as text_output: text_output.write(extracted_text) if __name__ == "__main__": input_pdf_path = "Text.pdf" output_text_file = "Text.txt" convert_pdf_to_text(input_pdf_path, output_text_file) print("PDF has been successfully converted to text!")
PDF has been successfully converted to text!
Text.txt
Hello, this is the text inside the file.
Using 'PDFMiner' Module
PDFMiner is a text extraction tool for PDF documents. It can accurately determine where text is located on the page and gather layout details (font,etc) and convert PDFs into other formats, such as HTML or XML.
Installation of PDFMiner
pip install pdfminer
Some of the additional features, provided by the PDFMiner tool are as follows-
-
Automatic Layout Analysis: The tool can automatically analyze the layout of the PDF file.
-
Outline Extraction: It can extract the table of contents (TOC) from the PDF.
-
Basic Encryption Support: It can handle basic encryption types, including RC4 and AES.
-
CJK Languages and Vertical Scripts: It can process CJK (Chinese, Japanese, Korean) languages and can display vertical writing scripts.
Example
The following code demonstrates extracting the text from a PDF file using PDFMiner, you can use the extract_text() function from the pdfminer.high_level module for line-by-line reading and extraction of data inside the file.
from pdfminer.high_level import extract_text # Specify the path to your PDF file pdf_file_path = 'Users/Tutorispoint/Downloads/article.pdf' # Extract text from the PDF text = extract_text(pdf_file_path) # Print the extracted text print(text)
Following is the output for the above code-
Hello This is the text inside the file. This is an another line.
Using 'PyMuPDF' Module
PyMuPDF is commonly referred to as fitz a high-performance Python library designed for extracting, analyzing, converting, and manipulating PDF and other document types.
One of its key features is the ability to render all types of documents. Rendering means creating an image (such as PNG) from each page of a document at a specified DPI resolution. This capability is essential for displaying documents in a graphical user interface (GUI) window.
Installation of PyMuPDF
PyMuPDF should be installed using pip with:
pip install PyMuPDF
The steps involved in extracting and converting PDF to text using PyMuPDF are as follows-
-
Open the PDF file using fitz.open(), which provides access to the document.
-
Loop through each page, using the get_text() method to extract text efficiently.
-
Print or save the extracted text, ensuring ease of access to the information contained in the PDF.
-
Close the PDF document to release resources after processing.
Example
In the following example, to open a specified PDF file fitz.open() is used and the text of each page is extracted using get_text() method and printed with a page header.
import fitz # PyMuPDF library # Open the PDF file pdf_document = "path_to_pdf_file.pdf" # Replace with the actual PDF file path doc = fitz.open(pdf_document) # Iterate through all pages in the PDF for page_number in range(len(doc)): # Select the page page = doc[page_number] # Extract text from the page text = page.get_text() # Print the extracted text print(f"--- Page {page_number + 1} ---") print(text) # Close the document doc.close()
When the above code is executed, the output will look like this:
--page 1-- Welcome to PyMuPDF! This library is great for working with PDFs. --page 2-- You can extract text, images, and more. Enjoy using PyMuPDF!