Java Program to Extract Content from a PDF

Last Updated : 16 Jul, 2021

Java class< file using the Apache Tika< library is used. For document type detection and content extraction from various file formats, it uses various document parsers and document type detection techniques to detect and extract data. It provides a single generic API for parsing different file formats. All these parser libraries are encapsulated in a single interface called the Parser interface.

Java supports multiple in-built classes and packages to extract and access the content from a PDF document. The following classes are used in the extraction of the content :

BodyContentHandler is an in-built class that creates a handler for the text, which writes these XHTML body character events and stores them in an internal string buffer. It is inherited from the parent class ContentHandlerDecorator in Java. The specified text can be retrieved using the method ContentHandlerDecorator.toString() provided by the parent class.

PDFParser Java provides an in-built package that provides a class PDFParser, which parses the contents of PDF documents. It extracts the contents of a PDF Document stored within paragraphs, strings, and tables (without invoking tabular boundaries). It can be used to parse encrypted documents too if the password is specified as an argument.

ParseContext: This class is a component of the Java package org.apache.tika.parser, which is used to parse context and pass it on to the Tika parsers.

Procedure:

Create a content handler.
Create a PDF file at the local directory in the system.
Now, create a FileInputStream having the same path as that of the above PDF file created.
Create a content parser using a metadata type object for the PDF document.
PDF document is now parsed using the PDF parser class.
Print the content of the PDF document as created above to illustrate the extraction of content in the above PDF.

Implementation: The following Java program is used to illustrate the extraction of content from the PDF document.

Java

// Java Program to Extract Content from a PDF

// Importing java input/output classes
import java.io.File;
import java.io.FileInputStream;
// Importing Apache POI classes
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;

// Class 
public class GFG {

    // Main driver method
    public static void main(String[] args) throws Exception
    {

        // Create a content handler
        BodyContentHandler contenthandler
            = new BodyContentHandler();

        // Create a file in local directory
        File f = new File("C:/extractcontent.pdf");

        // Create a file input stream
        // on specified path with the created file
        FileInputStream fstream = new FileInputStream(f);

        // Create an object of type Metadata to use
        Metadata data = new Metadata();

        // Create a context parser for the pdf document
        ParseContext context = new ParseContext();

        // PDF document can be parsed using the PDFparser
        // class
        PDFParser pdfparser = new PDFParser();

        // Method parse invoked on PDFParser class
        pdfparser.parse(fstream, contenthandler, data,
                        context);

        // Printing the contents of the pdf document
        // using toString() method in java
        System.out.println("Extracting contents :"
                           + contenthandler.toString());
    }
}