PDF Redaction using Python
Last Updated :
22 Jun, 2021
So, let's just start with what exactly does Redaction mean. So, Redaction is a form of editing in which multiple sources of texts are combined and altered slightly to make a single document. In simple words, whenever you see any part in any document which is blackened out to hide some information, it is known as Redaction. To perform the same task on a PDF is known as PDF Redaction.
If anyone has worked with any kind of data extraction on PDF, then they know how painful it can be to handle PDFs. Consider a scenario where you want to share a PDF with someone but there are certain parts in a PDF that you don't want to get leaked. So, what you can do is, you can redact the texts. It is pretty easy to redact texts using something like Adobe Acrobat, but what if you want this to be an automated process. Suppose, you are working in a company that shares its user's purchases on its site with the Income Tax Department but due to strict privacy policies, the safety of users' Personal Identifiable Information (PII) they want to remove those from the transaction receipts. If the user base is large then it can not be done manually, so you need some kind of automation to do so. This is where Python comes in. There is this amazing library called PyMuPDF, which is a library for pdf handling and performing various operations on them. So, let's just check out how we are going to do so.
First, you need to have Python3 installed and also PyMuPDF installed. To install PyMuPDF, simply open up your terminal and type the following in it
pip3 install PyMuPDF
For this demonstration, we will be only redacting Email IDs from a PDF. You can apply the same logic to any other PII
Approach:
- Read the PDF file
- Iterate line by line through the pdf and look for each occurrence of any email id. Email IDs have a pattern, so we will be using Regex to identify an email
- Once we encounter an email, we add it to a list and then return the list at the end of the last line
- Now, we need to simply search for the occurrence of the fetched email ids in the pdf. PyMuPDF makes it very easy to find any text in a PDF. It returns four coordinates of a rectangle inside which the text will be present.
- Once we have all the text boxes, we can simply iterate over those boxes and Redact each box from the PDF
Below is the implementation of the above approach and I have added inline comments for a better understanding of the code.
PDF file used:
Before
Python3
# imports
import fitz
import re
class Redactor:
# static methods work independent of class object
@staticmethod
def get_sensitive_data(lines):
""" Function to get all the lines """
# email regex
EMAIL_REG = r"([\w\.\d]+\@[\w\d]+\.[\w\d]+)"
for line in lines:
# matching the regex to each line
if re.search(EMAIL_REG, line, re.IGNORECASE):
search = re.search(EMAIL_REG, line, re.IGNORECASE)
# yields creates a generator
# generator is used to return
# values in between function iterations
yield search.group(1)
# constructor
def __init__(self, path):
self.path = path
def redaction(self):
""" main redactor code """
# opening the pdf
doc = fitz.open(self.path)
# iterating through pages
for page in doc:
# _wrapContents is needed for fixing
# alignment issues with rect boxes in some
# cases where there is alignment issue
page._wrapContents()
# getting the rect boxes which consists the matching email regex
sensitive = self.get_sensitive_data(page.getText("text")
.split('\n'))
for data in sensitive:
areas = page.searchFor(data)
# drawing outline over sensitive datas
[page.addRedactAnnot(area, fill = (0, 0, 0)) for area in areas]
# applying the redaction
page.apply_redactions()
# saving it to a new pdf
doc.save('redacted.pdf')
print("Successfully redacted")
# driver code for testing
if __name__ == "__main__":
# replace it with name of the pdf file
path = 'testing.pdf'
redactor = Redactor(path)
redactor.redaction()
Output:
After
Similar Reads
Convert PDF to CSV using Python Python is a high-level, general-purpose, and very popular programming language. Python programming language (the latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry. Python Programming Language is very well sui
2 min read
Convert PDF to Image using Python Many tools are available on the internet for converting a PDF to an image. In this article, we are going to write code for converting pdf to image and make a handy application in python. Before writing the code we need to install the required module pdf2image and poppler.Modules Neededpdf2image 1.14
2 min read
Convert Excel to PDF Using Python Python is a high-level, general-purpose, and very popular programming language. Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry.In this article, we will learn how to convert an
1 min read
How to take screenshots using python? Python is a widely used general-purpose language. It allows performing a variety of tasks. One of them can be taking a screenshot. It provides a module named pyautogui which can be used to take the screenshot. pyautogui takes pictures as a PIL(python image library) which supports opening, manipulati
1 min read
Take and convert Screenshot to PDF using Python In order to take and convert a screenshot to PDF, firstly the PyAutoGUI can be used which is an automation library in python which can control mouse, keyboard and can handle many GUI control tasks. Secondly, for the conversion PIL(Python Imaging Library) of python can be used which provides image pr
3 min read
Merge PDF stored in Remote server using Python Prerequisites: Working with PDF files in Python There are many libraries for manipulating PDF files in Python but all are using when that all PDF files already downloaded in your local machine. But what if your target PDF files are in Remote server, you have only URLs of files and no required downlo
2 min read