Extract text from PDF File using Python
Last Updated :
09 Aug, 2024
All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
We will extract text from pdf files using two Python libraries, pypdf and PyMuPDF, in this article.
Extracting text from a PDF file using the pypdf library.
Python package pypdf can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files. Note: For more information, refer to Working with PDF files in Python
Installation
To install this package type the below command in the terminal.
pip install pypdf
Example: Input PDF: 
Python
# importing required modules
from pypdf import PdfReader
# creating a pdf reader object
reader = PdfReader('example.pdf')
# printing number of pages in pdf file
print(len(reader.pages))
# getting a specific page from the pdf file
page = reader.pages[0]
# extracting text from page
text = page.extract_text()
print(text)
Output:
Let us try to understand the above code in chunks:
reader = PdfReader('example.pdf')
- We created an object of PdfReader class from the pypdf module.
- The PdfReader class takes a required positional argument of the path to the pdf file.
print(len(reader.pages))
- pages property gives a List of PageObjects. So, here we can use the in-built len() function of python to get the number of pages in the pdf file.
page = reader.pages[0]
- Now, as reader.pages is a list of PageObjects, we can get a specific Page of the pdf by tapping into the index of the page. In python list indexing starts from 0, so reader.pages[0] gives us the first page of the pdf file.
text = page.extract_text()
print(text)
- Page object has function extract_text() to extract text from the pdf page.
Extracting text from a PDF file using the PyMuPDF library.
PyMuPDF is a Python library that supports file formats like XPS, PDF, CBR, and CBZ. But for now, in this article, we are going to concentrate on PDF (Portable Document Format) files.
Installation
pip install pymupdf
pip install fitz
To extract the text from the pdf, we need to follow the following steps:
- Importing the library
- Opening document
- Extracting text
Note: We are using the sample.pdf here; to get the pdf, use the link below.
sample.pdf - Link
1. Importing the library
Python
2. Opening document
Python
doc = fitz.open('sample.pdf')
Here we created an object called "doc," and filename should be a Python string.
3. Extracting text
Python
for page in doc:
text = page.get_text()
print(text)
Here, we iterated pages in pdf and used the get_text() method to extract each page from the file.
All the Code to extract the text
Python
import fitz
doc = fitz.open('sample.pdf')
text = ""
for page in doc:
text+=page.get_text()
print(text)
Output:

Conclusion
We have seen two Python libraries, pypdf and PyMuPDF, that can extract text from a PDF file. Comment on your preferred library from the above two libraries.
Similar Reads
Convert PDF to TXT File Using Python We have a PDF file and want to extract its text into a simple .txt format. The idea is to automate this process so the content can be easily read, edited, or processed later. For example, a PDF with articles or reports can be converted into plain text using just a few lines of Python. In this articl
2 min read
Python Extract Substring Using Regex Python provides a powerful and flexible module called re for working with regular expressions. Regular expressions (regex) are a sequence of characters that define a search pattern, and they can be incredibly useful for extracting substrings from strings. In this article, we'll explore four simple a
2 min read
Get the File Extension from a URL in Python Handling URLs in Python often involves extracting valuable information, such as file extensions, from the URL strings. However, this task requires careful consideration to ensure the safety and accuracy of the extracted data. In this article, we will explore four approaches to safely get the file ex
2 min read
Print the Content of a Txt File in Python Python provides a straightforward way to read and print the contents of a .txt file. Whether you are a beginner or an experienced developer, understanding how to work with file operations in Python is essential. In this article, we will explore some simple code examples to help you print the content
3 min read
How to Use Regex with os.listdir() in Python? We are given a file path and our task is to find out the usage of regex with os.listdir() in Python by using that path and files inside that directory. In this article, we will see the usage of Regex with os.listdir() in Python with code examples. Regex with os.listdir() in PythonIn Python, the os.l
3 min read
Download Anything to Google Drive using Google colab When we download/upload something from a cloud server, it gives more transfer rate as compared to a normal server. We can use Google Drive for storage as well as fast speed download. The problem is how to upload something to G-Drive direct from Internet. So, Here we will see a solution to upload any
2 min read
Convert Text and Text File to PDF using Python PDFs are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system. Converting a given text or a text file to PDF (Portable Do
3 min read
How to convert CSV File to PDF File using Python? In this article, we will learn how to do Conversion of CSV to PDF file format. This simple task can be easily done using two Steps : Firstly, We convert our CSV file to HTML using the PandasIn the Second Step, we use PDFkit Python API to convert our HTML file to the PDF file format. Approach: 1. Con
3 min read
How to convert PDF file to Excel file using Python? In this article, we will see how to convert a PDF to Excel or CSV File Using Python. It can be done with various methods, here are we are going to use some methods. Method 1: Using pdftables_api Here will use the pdftables_api Module for converting the PDF file into any other format. It's a simple
2 min read
Extract title from a webpage using Python Prerequisite Implementing Web Scraping in Python with BeautifulSoup, Python Urllib Module, Tools for Web Scraping In this article, we are going to write python scripts to extract the title form the webpage from the given webpage URL. Method 1: bs4 Beautiful Soup(bs4) is a Python library for pulling
3 min read