pdfgrep Command in Linux

Last Updated : 21 Nov, 2022

Grep is a powerful tool to search for a pattern or regular expression in a text file but it cannot do the search in pdf files and that's where pdfgrep comes into the picture. It's a simple command used to search PDF files for a regular expression. In this article, we will discuss about the pdfgrep command and its usage.

Syntax:

Usage: pdfgrep [OPTION]... PATTERN FILE...

Installation of pdfgrep command

pdfgrep is not pre-installed like grep but it can be downloaded from the repositories in most of the Linux distributions.

1. For Ubuntu/Debian:

sudo apt-get install pdfgrep

2. For CentOS/Fedora:

sudo yum install pdfgrep

Working with pdfgrep:

pdfgrep command is compatible with GNU grep with some PDF-specific options. If you are familiar with grep, then most of the option looks familiar.

1. Basic Search

Let's do a basic search for a string "General Linux" in a pdf file,

Example:

pdfgrep "General Linux" intro-linux.pdf

Output:

2. Print filename

Use --with-filename or -H option to display pdf file name along with the output when there is one file to search.

Example:

pdfgrep -H dns intro-linux.pdf

Output:

The command prints the filename by default when there is more than one file to search (implies -H).

3. Case-insensitive search:

Use --ignore-case or -i to do case insensitive search. Let's search for the word dns.

Example:

pdfgrep -i dns intro-linux.pdf

Output:

The above output shows the matches for both dns and DNS.

4. Get the match count

Use --count or -c to see the count for the matches.

Example:

pdfgrep -ic dns intro-linux.pdf

Output:

Thus ignoring the case, dns was mentioned 28 times.

5. Show the page number

Use –page-number or -n to show the page number. This option would prefix each match with the page number where the pattern got matched.

Example:

pdfgrep -in dns intro-linux.pdf

Output:

6. Show match-count per page

Use --page-count or -p option to print the number of matches per page. This option implies page number (-n).

Example:

pdfgrep -ip dns intro-linux.pdf

Output:

The above output represents 'page number: match count'. On page number 53, dns is present once. But the same is repeated 5 times on page number 169.

7. Stop match count

Use --max-count or -m option to stop reading the file when the number of pages crossed. This option can be used when the user doesn't want to read the file after crossing the NUM matches.

Example:

pdfgrep -inm 10 dns intro-linux.pdf

Output:

The output shows only 10 matches for dns pattern and stopped reading the file further.

8. Context control

The following options can be used when the user wants to know what lines are present before, after, and around the match.

8.1 Context after the match

Use --after-context or -A option to print NUM lines of context after the match.

Example:

pdfgrep -A 2 dns intro-linux.pdf

Output:

Here we can see 2 lines are printed after the match and the contiguous group of matches is separated by --.

8.2 Context before the match

Use --before-context or -B to print NUM lines of context before the match.

Example:

pdfgrep -B 2 dns intro-linux.pdf

Output:

8.3 Context around match

Use --context or -C to print NUM lines of context before and after the match.

Example:

pdfgrep -C 2 dns intro-linux.pdf

Output:

9. Caching

PDF file consists of images along with the text. When the file is large it would take some time to skip the media and do the search which can be frustrating when we do frequent grep. There is an option --cache which would cache the rendered text and make the search time quicker. It would be helpful especially when the file is large.

Example:

time pdfgrep --cache iq dns intro-linux.pdf

Output:

Here the pattern dns is searched twice with and without enabling cache, where the command that includes --cache got completed faster than the other commands that didn't include it. -q option is used to suppress the output for easy understanding.

10. Password protected file

Using --password option, pdfgrep tool can also be used to do a grep in password protected file.

Usage:

pdfgrep --password [password] [pattern] [pdf_file]

Example:

pdfgrep --password "ndey" dns intro-linux-protected.pdf

Table of Difference between grep and pdfgrep

Grep	pdfgrep
It works only on plain text files.	It works only on pdf files.
It is a default package.	It is not a default package but can be downloaded from the repository.
It operates on lines.	It operates on pages.
-n option is to show the line number	-n option is to show the page number.

pdfgrep Command in Linux

strive_to_learn

Improve

Article Tags :

pdfgrep Command in Linux

Syntax:

Installation of pdfgrep command

1. For Ubuntu/Debian:

2. For CentOS/Fedora:

Working with pdfgrep:

1. Basic Search

2. Print filename

3. Case-insensitive search:

4. Get the match count

5. Show the page number

6. Show match-count per page

7. Stop match count

8. Context control

9. Caching

10. Password protected file

Table of Difference between grep and pdfgrep

Similar Reads

Thank You!

What kind of Experience do you want to share?