Open In App

Exporting PDF Data using Python

Last Updated : 10 May, 2020
Comments
Improve
Suggest changes
Like Article
Like
Report
Sometimes, we have to extract data from PDF. we have to copy & paste the data from PDF. It is time-consuming. In Python, there are packages that we can use to extract data from a PDF and export it in a different format using Python. We will learn how to extract data from PDFs.

Extracting Text With PDFMiner

PDFMiner is a text extraction tool for PDF documents. you can try using pip to install PDFminer in your system as:
pip install pdfminer
Let's get started with extracting all the text of PDF page by page. It requires the following steps to extract pages data
  • create a resource manager instance.
  • create a file-like object via Python’s io module.
  • create a converter.
  • create a PDF interpreter object that will take our resource manager and converter objects and extract the text.
  • open the PDF and loop through each page.
Below is the implementation. PDF File Used: python-pdfminer-1 Python3 1==
import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage


def extract_text_by_page(pdf_path):

    with open(pdf_path, 'rb') as fh:
        
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
            
            converter = TextConverter(resource_manager, 
                                      fake_file_handle)
            
            page_interpreter = PDFPageInterpreter(resource_manager,
                                                  converter)
            
            page_interpreter.process_page(page)
            text = fake_file_handle.getvalue()
            
            yield text
            
            # close open handles
            converter.close()
            fake_file_handle.close()
            
def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
        print(page)
        print()
        
# Driver code
if __name__ == '__main__':
    print(extract_text('GFG.pdf'))
Output: python-pdfminer-extract-data-from-pdf In this example, we create a function that yields the text for each page. The extract_text function prints out the text of each page.

Next Article
Article Tags :
Practice Tags :

Similar Reads