Modules Available in Python for Converting PDF to Text



Python offers several powerful libraries to convert PDF documents to plain text, such as PyPDF2 and PDFMiner which are two popular modules for text extraction from PDFs.

Some of the common approaches (modules) for converting PDF to text are as follows-

Using 'PyPDF2' Module

PyPDF2 is a versatile library used for manipulating PDF files, focusing on functions such as merging, splitting, rotating pages, and extracting text. It offers a simple approach for performing basic PDF operations.

To extract data using PyPDF2 efficiently, you can use methods like getPage() and extractText() to read and retrieve text from PDF files.

Suppose you need to parse an existing record. In that case, PyPDF2 is an excellent choice due to its strong support for recognizing various fonts and other features.

Installation of PyPDF2

Launch the Command Prompt on your system and enter the following pip command to begin the installation of the library.

pip install PyPDF2

The steps involved in converting the PDF into text using the PyPDF2 module are as follows-

  • Open the PDF file in binary mode.

  • Create a PdfReader object to navigate the structure of the PDF.

  • Iterate through each page and use the extract_text method to collect the text.

  • Write the gathered text to a new text file for easier access and readability.

Example

The following code demonstrates how to use the PyPDF2 library to convert a PDF file into text. It defines a function, `convert_pdf_to_text`, which opens a PDF file, reads each page, extracts the text, and writes the extracted content to a designated text file.

import PyPDF2

def convert_pdf_to_text(input_pdf_path, output_text_file):
    # Open the PDF file in binary read mode
    with open(input_pdf_path, "rb") as pdf_input:
        # Create a PdfReader object to read the PDF
        pdf_reader = PyPDF2.PdfReader(pdf_input)

        # Initialize a string to hold the extracted text
        extracted_text = ""

        # Iterate through each page in the PDF
        for page_index in range(len(pdf_reader.pages)):
            # Get the current page
            page = pdf_reader.pages[page_index]
            # Extract text from the page and add it to the extracted_text
            extracted_text += page.extract_text()

    # Save the extracted text to a text file
    with open(output_text_file, "w", encoding="utf-8") as text_output:
        text_output.write(extracted_text)
        
if __name__ == "__main__":
    input_pdf_path = "Text.pdf"
    output_text_file = "Text.txt"

    convert_pdf_to_text(input_pdf_path, output_text_file)

    print("PDF has been successfully converted to text!")

PDF has been successfully converted to text!

Text.txt

Hello, this is the text inside the file.

Using 'PDFMiner' Module

PDFMiner is a text extraction tool for PDF documents. It can accurately determine where text is located on the page and gather layout details (font,etc) and convert PDFs into other formats, such as HTML or XML.

Installation of PDFMiner

pip install pdfminer

Some of the additional features, provided by the PDFMiner tool are as follows-

  • Automatic Layout Analysis: The tool can automatically analyze the layout of the PDF file.

  • Outline Extraction: It can extract the table of contents (TOC) from the PDF.

  • Basic Encryption Support: It can handle basic encryption types, including RC4 and AES.

  • CJK Languages and Vertical Scripts: It can process CJK (Chinese, Japanese, Korean) languages and can display vertical writing scripts.

Example

The following code demonstrates extracting the text from a PDF file using PDFMiner, you can use the extract_text() function from the pdfminer.high_level module for line-by-line reading and extraction of data inside the file.

from pdfminer.high_level import extract_text  

# Specify the path to your PDF file  
pdf_file_path = 'Users/Tutorispoint/Downloads/article.pdf'  

# Extract text from the PDF  
text = extract_text(pdf_file_path)  

# Print the extracted text  
print(text)

Following is the output for the above code-

Hello
This is the text inside the file.
This is an another line.

Using 'PyMuPDF' Module

PyMuPDF is commonly referred to as fitz a high-performance Python library designed for extracting, analyzing, converting, and manipulating PDF and other document types.

One of its key features is the ability to render all types of documents. Rendering means creating an image (such as PNG) from each page of a document at a specified DPI resolution. This capability is essential for displaying documents in a graphical user interface (GUI) window.

Installation of PyMuPDF

PyMuPDF should be installed using pip with:

pip install PyMuPDF

The steps involved in extracting and converting PDF to text using PyMuPDF are as follows-

  • Open the PDF file using fitz.open(), which provides access to the document.

  • Loop through each page, using the get_text() method to extract text efficiently.

  • Print or save the extracted text, ensuring ease of access to the information contained in the PDF.

  • Close the PDF document to release resources after processing.

Example

In the following example, to open a specified PDF file fitz.open() is used and the text of each page is extracted using get_text() method and printed with a page header.

import fitz  # PyMuPDF library

# Open the PDF file
pdf_document = "path_to_pdf_file.pdf"  # Replace with the actual PDF file path
doc = fitz.open(pdf_document)

# Iterate through all pages in the PDF
for page_number in range(len(doc)):
    # Select the page
    page = doc[page_number]
    
    # Extract text from the page
    text = page.get_text()
    
    # Print the extracted text
    print(f"--- Page {page_number + 1} ---")
    print(text)

# Close the document
doc.close()

When the above code is executed, the output will look like this:

    --page 1--
Welcome to PyMuPDF!
This library is great for working with PDFs.
 
    --page 2--
You can extract text, images, and more.
Enjoy using PyMuPDF!
Updated on: 2025-01-07T11:24:38+05:30

262 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements