Extract Text From PDF, Office Files (.Doc, .PPT, .XLS), Open Office Files, .RTF, and Text - Plain Files in Java - Codezrule's Blog
Extract Text From PDF, Office Files (.Doc, .PPT, .XLS), Open Office Files, .RTF, and Text - Plain Files in Java - Codezrule's Blog
pen office files, .rtf, and text/plain files in Java | Codezrule's Blog
Codezrule's Blog
Following is the code to extract text from .pdf, .doc, .ppt, .xls, .odt, .ods, .odp, .rtf and all
text/plain files. Following jar files are used in the program:
FontBox-0.1.0-dev.jar
jdom.jar
log4j-1.2.15.jar
PDFBox-0.7.3-dev.jar
poi-2.5.1-final-20040804.jar
poi-contrib-2.5.1-final-20040804.jar
poi-scratchpad-2.5.1-final-20040804.jar
If you are unable to find these jars leave a comment, I’ll send you the jars.
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 1/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.util.Enumeration;
import java.util.Iterator;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import javax.swing.text.DefaultStyledDocument;
import javax.swing.text.rtf.RTFEditorKit;
import org.apache.poi.hdf.extractor.WordDocument;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.poifs.eventfilesystem.POIFSReader;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent;
import org.apache.poi.poifs.eventfilesystem.POIFSReaderListener;
import org.apache.poi.poifs.filesystem.DocumentInputStream;
import org.apache.poi.util.LittleEndian;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.Text;
import org.jdom.input.SAXBuilder;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileReader;
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 2/10
01/03/2019 Extract Text From pdf, office files(.doc, .ppt, .xls), open office files, .rtf, and text/plain files in Java | Codezrule's Blog
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
cosDoc.close();
pdDoc.close();
} catch (Exception e) {
System.out.println("Anption occured in parsing the PDF Documen
e.printStackTrace();
try {
if (cosDoc != null) {
cosDoc.close();
}
if (pdDoc != null) {
pdDoc.close();
}
} catch (Exception e1) {
e1.printStackTrace();
}
return null;
}
System.out.println("Done return parsedText;
}
while (iterator.hasNext()) {
Object child = iterator.next();
//If Child is a Text Node, then append the text
if (child instanceof Text) {
Text t = (Text) child;
TextBuffer.append(t.getValue());
} else {
processElement(child); // Recursively process
}
}
}
if (elementName.equals("text:p")) {
TextBuffer.append("\n");
}
} else {
List non_text_list = e.getContent();
Iterator it = non_text_list.iterator();
while (it.hasNext()) {
Object non_text_child = it.next();
processElement(non_text_child);
}
}
}
}
}
return sbuff.toString();
}
If you are on a linux machine with bash shell then place the code in ReadFileFormat.java beside
the above mentioned jar files and execute the following commands:
1. export CLASSPATH=FontBox-0.1.0-dev.jar:jdom.jar:log4j-1.2.15.jar:PDFBox-0.7.3-
dev.jar:poi-2.5.1-final-20040804.jar:poi-contrib-2.5.1-final-20040804.jar:poi-scratchpad-2.5.1-
final-20040804.jar:$CLASSPATH
2. javac ReadFileFormat.java
3. java ReadFileFormat
If you are on a windows machine and javac has been set in your classpath then follow the
following commands from your command line:
1. set classpath=%classpath%;.;FontBox-0.1.0-dev.jar;.;jdom.jar;.;log4j-1.2.15.jar;.;PDFBox-0.7.3-
dev.jar;.;poi-2.5.1-final-20040804.jar;.;poi-contrib-2.5.1-final-20040804.jar;.;poi-scratchpad-
2.5.1-final-20040804.jar;.;
2. javac ReadFileFormat.java
3. java ReadFileFormat
https://codezrule.wordpress.com/2010/03/24/extract-text-from-pdf-office-files-doc-ppt-xls-open-office-files-rtf-and-textplain-files-in-java/ 7/10