Exporting PDF Data using Python

Last Updated : 10 May, 2020

Sometimes, we have to extract data from PDF. we have to copy & paste the data from PDF. It is time-consuming. In Python, there are packages that we can use to extract data from a PDF and export it in a different format using Python. We will learn how to extract data from PDFs.

Extracting Text With PDFMiner

PDFMiner is a text extraction tool for PDF documents. you can try using pip to install PDFminer in your system as:

pip install pdfminer

Let's get started with extracting all the text of PDF page by page. It requires the following steps to extract pages data

create a resource manager instance.
create a file-like object via Python’s io module.
create a converter.
create a PDF interpreter object that will take our resource manager and converter objects and extract the text.
open the PDF and loop through each page.

Below is the implementation. PDF File Used: python-pdfminer-1

Python3 1==

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage


def extract_text_by_page(pdf_path):

    with open(pdf_path, 'rb') as fh:
        
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
            
            converter = TextConverter(resource_manager, 
                                      fake_file_handle)
            
            page_interpreter = PDFPageInterpreter(resource_manager,
                                                  converter)
            
            page_interpreter.process_page(page)
            text = fake_file_handle.getvalue()
            
            yield text
            
            # close open handles
            converter.close()
            fake_file_handle.close()
            
def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
        print(page)
        print()
        
# Driver code
if __name__ == '__main__':
    print(extract_text('GFG.pdf'))